kanshan/TensorRT-LLMs

Fork 0

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Ruocheng Jia 20a7cec991 open source v0.12-jetson

2024-11-05 10:17:16 +00:00

4.0 KiB

Raw Blame History

TensorRT-LLM Deployment on Jetson Orin

1. Installation
2. Build and Run
3. Reference Memory Usage
4. Reference Benchmark Performance

1 Installation

1.1 Install Jetpack 6.1

Install Jetpack 6.1 with CUDA, cuDNN and TensorRT with the help of SDK Manager and then boost and lock Jetson to MAX-N clock by the following commonds.

sudo nvpmodel -m 0
sudo jetson_clocks

1.2 Install Prerequisites

sudo apt-get update
sudo apt-get install -y python3-pip libopenblas-dev git-lfs ccache
wget https://raw.githubusercontent.com/pytorch/pytorch/9b424aac1d70f360479dd919d6b7933b5a9181ac/.ci/docker/common/install_cusparselt.sh
export CUDA_VERSION=12.6
sudo -E bash ./install_cusparselt.sh
python3 -m pip install numpy=='1.26.1'

1.3 Install Jetson TensorRT-LLM

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git checkout v0.12.0-jetson
git lfs pull
python3 scripts/build_wheel.py --clean --cuda_architectures 87 -DENABLE_MULTI_DEVICE=0 --build_type Release --benchmarks --use_ccache
pip install build/tensorrt_llm-*.whl

2 Build and Run

We take the Meta-Llama-3-8B-Instruct INT4-GPTQ as example.

2.1 Build the Engine with INT4-GPTQ

git clone https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ

python convert_checkpoint.py --model_dir Meta-Llama-3-8B-Instruct-GPTQ --output_dir tllm_checkpoint_1gpu_gptq --dtype float16 --use_weight_only --weight_only_precision int4_gptq  --per_group

export PATH=/home/nvidia/.local/bin:$PATH
trtllm-build --checkpoint_dir tllm_checkpoint_1gpu_gptq --output_dir engine_1gpu_gptq --gemm_plugin float16

2.2 Run the Engine

python3 ../run.py --max_output_len=50 --tokenizer_dir Meta-Llama-3-8B-Instruct --engine_dir=engine_1gpu_gptq --use_mmap

3 Reference Memory Usage

Static Batching

No CUDA_LAZY_LOADING

w/wo mmap	GPU Memory Usage(GB)	Total Memory Usage(GB)
without mmap	7.8	16.2
with mmap	7.8	11.2

With CUDA_LAZY_LOADING (Default)

w/wo mmap	GPU Memory Usage(GB)	Total Memory Usage(GB)
without mmap	6.8	12.3
with mmap	6.8	7.3

Inflight Batching

In inflight batching mode, 90% of the memory is allocated by default for paged KV cache. Therefore, using mmap will almost not reduce the overall memory consumption, instead it will increase the available hardware memory.

w/wo mmap	GPU Memory Usage(GB)	Total Memory Usage(GB)
without mmap	44.0	50.5
with mmap	48.9	49.7

NOTE: To control the memory allocation for KV cache, you can utilize the --kv_cache_free_gpu_memory_fraction parameter when executing run.py.

4 Reference Benchmark Performance

Platform: Jetson Orin 64GB, MAX-N, Jetpack 6.1

Framework: TRT-LLM v0.12.0-jetson

Model: Llama-3-8B

Reference Command:

cpp/build/benchmarks/gptSessionBenchmark --engine_dir path/to/engine --batch_size "1" --input_output_len "128,128" --enable_cuda_graph

INT4 Default

This config be achieved by specifying --weight_only_precision int4 when running convert_checkpoint.py

Batch Size	Input Length	Output Length	Context (ms)	Decode (token/s)
1	128	128	91	35.9
1	512	512	260	35.2
1	1024	512	582	34.7

INT4 GPTQ

This config be achieved by specifying --weight_only_precision int4_gptq --per_group when running convert_checkpoint.py

Batch Size	Input Length	Output Length	Context (ms)	Decode (token/s)
1	128	128	105	34.5
1	512	512	292	33.7
1	1024	512	337	33.3

4.0 KiB Raw Blame History