mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
doc: fix path after examples migration (#3814)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
This commit is contained in:
parent
635dcdcb9e
commit
dfbcb543ce
@ -10,4 +10,4 @@ examples/**/*.bin
|
||||
examples/**/*.engine
|
||||
examples/**/*.onnx
|
||||
examples/**/c-model
|
||||
examples/gpt/gpt*
|
||||
examples/models/core/gpt/gpt*
|
||||
|
||||
@ -197,7 +197,7 @@ Several popular models are pre-defined and can be easily customized or extended
|
||||
To get started with TensorRT-LLM, visit our documentation:
|
||||
|
||||
- [Quick Start Guide](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html)
|
||||
- [Running DeepSeek](./examples/deepseek_v3)
|
||||
- [Running DeepSeek](./examples/models/core/deepseek_v3)
|
||||
- [Installation Guide for Linux](https://nvidia.github.io/TensorRT-LLM/installation/linux.html)
|
||||
- [Installation Guide for Grace Hopper](https://nvidia.github.io/TensorRT-LLM/installation/grace-hopper.html)
|
||||
- [Supported Hardware, Models, and other Software](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html)
|
||||
|
||||
@ -112,7 +112,7 @@ cd cpp/build
|
||||
Take GPT-350M as an example for 2-GPU inflight batching
|
||||
```
|
||||
mpirun -n 2 ./benchmarks/gptManagerBenchmark \
|
||||
--engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
|
||||
--engine_dir ../../examples/models/core/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
|
||||
--request_rate 10 \
|
||||
--dataset ../../benchmarks/cpp/preprocessed_dataset.json \
|
||||
--max_num_samples 500
|
||||
@ -125,7 +125,7 @@ cd cpp/build
|
||||
|
||||
Currently encoder-decoder engines only support `--api executor`, `--type IFB`, `--enable_kv_cache_reuse false`, which are all default values so no specific settings required.
|
||||
|
||||
Prepare t5-small engine from [examples/enc_dec](/examples/enc_dec/README.md#convert-and-split-weights) for the encoder-decoder 4-GPU inflight batching example.
|
||||
Prepare t5-small engine from [examples/models/core/enc_dec](/examples/models/core/enc_dec/README.md#convert-and-split-weights) for the encoder-decoder 4-GPU inflight batching example.
|
||||
|
||||
Prepare the dataset suitable for engine input lengths.
|
||||
```
|
||||
@ -147,8 +147,8 @@ cd cpp/build
|
||||
Run the benchmark
|
||||
```
|
||||
mpirun --allow-run-as-root -np 4 ./benchmarks/gptManagerBenchmark \
|
||||
--encoder_engine_dir ../../examples/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/encoder \
|
||||
--decoder_engine_dir ../../examples/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/decoder \
|
||||
--encoder_engine_dir ../../examples/models/core/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/encoder \
|
||||
--decoder_engine_dir ../../examples/models/core/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/decoder \
|
||||
--dataset cnn_dailymail.json
|
||||
```
|
||||
|
||||
@ -173,7 +173,7 @@ Datasets with fixed input/output lengths for benchmarking can be generated with
|
||||
Take GPT-350M as an example for single GPU with static batching
|
||||
```
|
||||
./benchmarks/gptManagerBenchmark \
|
||||
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
|
||||
--engine_dir ../../examples/models/core/gpt/trt_engine/gpt2/fp16/1-gpu/ \
|
||||
--request_rate -1 \
|
||||
--static_emulated_batch_size 32 \
|
||||
--static_emulated_timeout 100 \
|
||||
@ -213,7 +213,7 @@ CPP_LORA=chinese-llama-2-lora-13b-cpp
|
||||
EG_DIR=/tmp/lora-eg
|
||||
|
||||
# Build lora enabled engine
|
||||
python examples/llama/convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
|
||||
python examples/models/core/llama/convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
|
||||
--output_dir ${CONVERTED_CHECKPOINT} \
|
||||
--dtype ${DTYPE} \
|
||||
--tp_size ${TP} \
|
||||
|
||||
@ -59,9 +59,9 @@ The weights and built engines are stored under [cpp/tests/resources/models](reso
|
||||
To build the engines from the top-level directory:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=examples/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
|
||||
PYTHONPATH=examples/gptj:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py
|
||||
PYTHONPATH=examples/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
|
||||
PYTHONPATH=examples/models/core/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
|
||||
PYTHONPATH=examples/models/contrib/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py
|
||||
PYTHONPATH=examples/models/core/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
|
||||
PYTHONPATH=examples/chatglm:$PYTHONPATH python3 cpp/tests/resources/scripts/build_chatglm_engines.py
|
||||
PYTHONPATH=examples/medusa:$PYTHONPATH python3 cpp/tests/resources/scripts/build_medusa_engines.py
|
||||
PYTHONPATH=examples/eagle:$PYTHONPATH python3 cpp/tests/resources/scripts/build_eagle_engines.py
|
||||
@ -71,7 +71,7 @@ PYTHONPATH=examples/redrafter:$PYTHONPATH python3 cpp/tests/resources/scripts/bu
|
||||
It is possible to build engines with tensor and pipeline parallelism for LLaMA using 4 GPUs.
|
||||
|
||||
```bash
|
||||
PYTHONPATH=examples/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu
|
||||
PYTHONPATH=examples/models/core/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu
|
||||
```
|
||||
|
||||
If there is an issue finding model_spec.so in engine building, manually build model_spec.so by
|
||||
|
||||
@ -30,7 +30,9 @@ import tensorrt_llm.bindings as _tb
|
||||
|
||||
|
||||
def get_ckpt_without_quatization(model_dir, output_dir):
|
||||
build_args = [_sys.executable, "examples/gptj/convert_checkpoint.py"] + [
|
||||
build_args = [
|
||||
_sys.executable, "examples/models/contrib/gpt/convert_checkpoint.py"
|
||||
] + [
|
||||
'--model_dir={}'.format(model_dir),
|
||||
'--output_dir={}'.format(output_dir),
|
||||
]
|
||||
|
||||
@ -17,7 +17,7 @@ LLaMA, for example.
|
||||
|
||||
Complete support of encoder-decoder models, like T5, will be added to
|
||||
TensorRT-LLM in a future release. An experimental version, only in Python for
|
||||
now, can be found in the [`examples/enc_dec`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) folder.
|
||||
now, can be found in the [`examples/models/core/enc_dec`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) folder.
|
||||
|
||||
## Overview
|
||||
|
||||
|
||||
@ -9,7 +9,7 @@ git-lfs clone https://huggingface.co/qychen/luotuo-lora-7b-0.1
|
||||
git-lfs clone https://huggingface.co/kunishou/Japanese-Alpaca-LoRA-7b-v0
|
||||
BASE_MODEL=llama-7b-hf
|
||||
|
||||
python examples/llama/convert_checkpoint.py --model_dir ${BASE_MODEL} \
|
||||
python examples/models/core/llama/convert_checkpoint.py --model_dir ${BASE_MODEL} \
|
||||
--output_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
|
||||
--dtype float16
|
||||
|
||||
|
||||
@ -12,7 +12,7 @@ Here is an example to run llama-7b with Weight Streaming:
|
||||
```bash
|
||||
|
||||
# Convert model as normal. Assume hugging face model is in llama-7b-hf/
|
||||
python3 examples/llama/convert_checkpoint.py \
|
||||
python3 examples/models/core/llama/convert_checkpoint.py \
|
||||
--model_dir llama-7b-hf/ \
|
||||
--output_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
|
||||
--dtype float16
|
||||
|
||||
@ -103,7 +103,7 @@ class Linear(Module):
|
||||
self.weight = Parameter(shape=(self.out_features, self.in_features), dtype=dtype)
|
||||
self.bias = Parameter(shape=(self.out_features, ), dtype=dtype)
|
||||
|
||||
# The parameters are bound to the weights before compiling the model. See examples/gpt/weight.py:
|
||||
# The parameters are bound to the weights before compiling the model. See examples/models/core/gpt/weight.py:
|
||||
tensorrt_llm_gpt.layers[i].mlp.fc.weight.value = fromfile(...)
|
||||
tensorrt_llm_gpt.layers[i].mlp.fc.bias.value = fromfile(...)
|
||||
```
|
||||
@ -277,7 +277,7 @@ max_output_len=128
|
||||
max_batch_size=4
|
||||
workers=$(( tp_size * pp_size ))
|
||||
|
||||
python ${folder_trt_llm}/examples/llama/convert_checkpoint.py \
|
||||
python ${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py \
|
||||
--output_dir ${ckpt_dir} \
|
||||
--model_dir ${model_dir} \
|
||||
--dtype ${dtype} \
|
||||
@ -329,7 +329,7 @@ max_output_len=128
|
||||
max_batch_size=4
|
||||
workers=8
|
||||
|
||||
python ${folder_trt_llm}/examples/llama/convert_checkpoint.py \
|
||||
python ${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py \
|
||||
--output_dir ${ckpt_dir} \
|
||||
--model_dir ${model_dir} \
|
||||
--dtype ${dtype} \
|
||||
|
||||
@ -48,7 +48,7 @@ class LLaMAForCausalLM (DecoderModelForCausalLM):
|
||||
|
||||
|
||||
Then, in the convert_checkpoint.py script in the
|
||||
[`examples/llama/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/) directory of the GitHub repo,
|
||||
[`examples/models/core/llama/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama/) directory of the GitHub repo,
|
||||
the logic can be greatly simplified. Even if the model definition code of TensorRT-LLM LLaMA class is changed due to some reason, the `from_hugging_face` API will keep the same, thus the existing workflow using this interface will not be affected.
|
||||
|
||||
|
||||
@ -68,7 +68,7 @@ In the 0.9 release, only LLaMA is refactored. Since popular LLaMA (and its varia
|
||||
|
||||
|
||||
In future releases, there might be `from_jax`, `from_nemo`, `from_keras` or other factory methods for different training checkpoints added.
|
||||
For example, the Gemma 2B model and the convert_checkpoint.py file in the [`examples/gemma`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gemma/)
|
||||
For example, the Gemma 2B model and the convert_checkpoint.py file in the [`examples/models/core/gemma`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gemma/)
|
||||
directory support JAX and Keras formats in addition to Hugging Face. The model developers can choose to implement **any subset** of these factory methods for the models they contributed to TensorRT-LLM.
|
||||
|
||||
|
||||
|
||||
@ -117,7 +117,7 @@ These improvements will be published in the `main` branch soon, and will be
|
||||
included in the v0.7 & v0.8 releases.
|
||||
|
||||
Similar examples running Llama-70B in TensorRT-LLM are published in
|
||||
[examples/llama](/examples/llama).
|
||||
[examples/models/core/llama](/examples/models/core/llama).
|
||||
|
||||
For more information about H200, please see the [H200 announcement blog](./H200launch.md).
|
||||
|
||||
|
||||
@ -61,7 +61,7 @@ Using this model is subject to a [particular](https://ai.meta.com/resources/mode
|
||||
There are two ways to build a TensorRT-LLM engine:
|
||||
|
||||
1. You can build the TensorRT-LLM engine from the Hugging Face model directly with the [`trtllm-build`](../commands/trtllm-build.rst) tool and then save the engine to disk for later use.
|
||||
Refer to the [README](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) in the [`examples/llama`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) repository on GitHub.
|
||||
Refer to the [README](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) in the [`examples/models/core/llama`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) repository on GitHub.
|
||||
|
||||
After the engine building is finished, we can load the model:
|
||||
|
||||
|
||||
@ -655,7 +655,7 @@ To prepare a dataset, follow the same process as specified in [](#preparing-a-da
|
||||
To quantize the checkpoint:
|
||||
|
||||
```shell
|
||||
cd tensorrt_llm/examples/llama
|
||||
cd tensorrt_llm/examples/models/core/llama
|
||||
python ../quantization/quantize.py \
|
||||
--model_dir $checkpoint_dir \
|
||||
--dtype bfloat16 \
|
||||
|
||||
@ -73,10 +73,10 @@ if __name__ == '__main__':
|
||||
|
||||
TensorRT-LLM also has a command line interface for building and saving engines. This workflow consists of two steps
|
||||
|
||||
1. Convert model checkpoint (HuggingFace, Nemo) to TensorRT-LLM checkpoint via `convert_checkpoint.py`. Each supported model has a `convert_checkpoint.py` associated it with it and can be found in the examples folder. For example, the `convert_checkpoint.py` script for Llama models can be found [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/convert_checkpoint.py)
|
||||
1. Convert model checkpoint (HuggingFace, Nemo) to TensorRT-LLM checkpoint via `convert_checkpoint.py`. Each supported model has a `convert_checkpoint.py` associated it with it and can be found in the examples folder. For example, the `convert_checkpoint.py` script for Llama models can be found [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama/convert_checkpoint.py)
|
||||
2. Build engine by passing TensorRT-LLM checkpoint to `trtllm-build` command. The `trtllm-build` command is installed automatically when the `tensorrt_llm` package is installed.
|
||||
|
||||
The README in the examples folder for supported models walks through building engines using this flow for a wide variety of situations. The examples folder for Llama models can be found at [https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama).
|
||||
The README in the examples folder for supported models walks through building engines using this flow for a wide variety of situations. The examples folder for Llama models can be found at [https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama).
|
||||
|
||||
## Benchmarking with `trtllm-bench`
|
||||
|
||||
|
||||
@ -42,7 +42,7 @@ The `LLM` class takes `tensor_parallel_size` and `pipeline_parallel_size` as par
|
||||
If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) you can specify tensor parallelism and pipeline parallelism by providing the `--tp_size` and `--tp_size` arguments to `convert_checkpoint.py`
|
||||
|
||||
```
|
||||
python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama/405B/ \
|
||||
python examples/models/core/llama/convert_checkpoint.py --model_dir ./tmp/llama/405B/ \
|
||||
--output_dir ./tllm_checkpoint_16gpu_tp8_pp2 \
|
||||
--dtype float16 \
|
||||
--tp_size 8
|
||||
|
||||
@ -52,7 +52,7 @@ if __name__ == '__main__':
|
||||
main()
|
||||
```
|
||||
|
||||
For an example of how to build an fp8 engine using the [TensorRT-LLM CLI workflow](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) flow see [TensorRT-LLM LLaMA examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama). In short you first run [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize and convert the model checkpoint to TensorRT-LLM format and then use `trtllm-build`.
|
||||
For an example of how to build an fp8 engine using the [TensorRT-LLM CLI workflow](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) flow see [TensorRT-LLM LLaMA examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama). In short you first run [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize and convert the model checkpoint to TensorRT-LLM format and then use `trtllm-build`.
|
||||
|
||||
> ***Note: While quantization aims to preserve model accuracy this is not guaranteed and it is extremely important you check that the quality of outputs remains sufficient after quantization.***
|
||||
|
||||
|
||||
@ -92,7 +92,7 @@ For examples and command syntax, refer to the [trtllm-serve](commands/trtllm-ser
|
||||
(quick-start-guide-compile)=
|
||||
### Compile the Model into a TensorRT Engine
|
||||
|
||||
Use the [Llama model definition](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) from the `examples/llama` directory of the GitHub repository.
|
||||
Use the [Llama model definition](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) from the `examples/models/core/llama` directory of the GitHub repository.
|
||||
The model definition is a minimal example that shows some of the optimizations available in TensorRT-LLM.
|
||||
|
||||
```console
|
||||
@ -104,7 +104,7 @@ make -C docker release_run LOCAL_USER=1
|
||||
huggingface-cli login --token *****
|
||||
|
||||
# Convert the model into TensorRT-LLM checkpoint format
|
||||
cd examples/llama
|
||||
cd examples/models/core/llama
|
||||
pip install -r requirements.txt
|
||||
pip install --upgrade transformers # Llama 3.1 requires transformer 4.43.0+ version.
|
||||
python3 convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct --output_dir llama-3.1-8b-ckpt
|
||||
@ -117,7 +117,7 @@ trtllm-build --checkpoint_dir llama-3.1-8b-ckpt \
|
||||
|
||||
When you create a model definition with the TensorRT-LLM API, you build a graph of operations from [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) primitives that form the layers of your neural network. These operations map to specific kernels; prewritten programs for the GPU.
|
||||
|
||||
In this example, we included the `gpt_attention` plugin, which implements a FlashAttention-like fused attention kernel, and the `gemm` plugin, that performs matrix multiplication with FP32 accumulation. We also called out the desired precision for the full model as FP16, matching the default precision of the weights that you downloaded from Hugging Face. For more information about plugins and quantizations, refer to the [Llama example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) and {ref}`precision` section.
|
||||
In this example, we included the `gpt_attention` plugin, which implements a FlashAttention-like fused attention kernel, and the `gemm` plugin, that performs matrix multiplication with FP32 accumulation. We also called out the desired precision for the full model as FP16, matching the default precision of the weights that you downloaded from Hugging Face. For more information about plugins and quantizations, refer to the [Llama example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) and {ref}`precision` section.
|
||||
|
||||
### Run the Model
|
||||
|
||||
|
||||
@ -85,8 +85,8 @@ The activations are encoded using floating-point values (FP16 or BF16).
|
||||
To use INT4/INT8 Weight-Only methods, the user must determine the scaling
|
||||
factors to use to quantize and dequantize the weights of the model.
|
||||
|
||||
This release includes examples for [GPT](source:examples/gpt) and
|
||||
[LLaMA](source:examples/llama).
|
||||
This release includes examples for [GPT](source:examples/models/core/gpt) and
|
||||
[LLaMA](source:examples/models/core/llama).
|
||||
|
||||
## GPTQ and AWQ (W4A16)
|
||||
|
||||
@ -101,9 +101,9 @@ plugin and the corresponding
|
||||
[`weight_only_groupwise_quant_matmul`](source:tensorrt_llm/quantization/functional.py)
|
||||
Python function, for details.
|
||||
|
||||
This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/gpt)
|
||||
and [LLaMA-v2](source:examples/llama), as well as an example of using AWQ with
|
||||
[GPT-J](source:examples/gptj). Those examples are experimental implementations and
|
||||
This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/models/core/gpt)
|
||||
and [LLaMA-v2](source:examples/models/core/llama), as well as an example of using AWQ with
|
||||
[GPT-J](source:examples/models/contrib/gpt). Those examples are experimental implementations and
|
||||
are likely to evolve in a future release.
|
||||
|
||||
## FP8 (Hopper)
|
||||
|
||||
@ -10,69 +10,69 @@ TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA
|
||||
|
||||
- [Arctic](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/arctic)
|
||||
- [Baichuan/Baichuan2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/baichuan)
|
||||
- [BART](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
|
||||
- [BERT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bert)
|
||||
- [BART](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
|
||||
- [BERT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/bert)
|
||||
- [BLOOM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bloom)
|
||||
- [ByT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
|
||||
- [ByT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
|
||||
- [GLM/ChatGLM/ChatGLM2/ChatGLM3/GLM-4](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/chatglm)
|
||||
- [Code LLaMA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama)
|
||||
- [Code LLaMA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
|
||||
- [DBRX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/dbrx)
|
||||
- [Exaone](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/exaone)
|
||||
- [FairSeq NMT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
|
||||
- [Exaone](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/exaone)
|
||||
- [FairSeq NMT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
|
||||
- [Falcon](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/falcon)
|
||||
- [Flan-T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) [^encdec]
|
||||
- [Gemma/Gemma2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gemma)
|
||||
- [GPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt)
|
||||
- [GPT-J](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gptj)
|
||||
- [GPT-Nemo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt)
|
||||
- [GPT-NeoX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gptneox)
|
||||
- [Granite-3.0](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/granite)
|
||||
- [Flan-T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) [^encdec]
|
||||
- [Gemma/Gemma2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gemma)
|
||||
- [GPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
|
||||
- [GPT-J](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/gpt)
|
||||
- [GPT-Nemo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
|
||||
- [GPT-NeoX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gptneox)
|
||||
- [Granite-3.0](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/granite)
|
||||
- [Grok-1](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/grok)
|
||||
- [InternLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/internlm)
|
||||
- [InternLM2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/internlm2)
|
||||
- [LLaMA/LLaMA 2/LLaMA 3/LLaMA 3.1](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama)
|
||||
- [Mamba](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mamba)
|
||||
- [mBART](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
|
||||
- [Minitron](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/nemotron)
|
||||
- [Mistral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama)
|
||||
- [Mistral NeMo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama)
|
||||
- [Mixtral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mixtral)
|
||||
- [InternLM2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/internlm2)
|
||||
- [LLaMA/LLaMA 2/LLaMA 3/LLaMA 3.1](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
|
||||
- [Mamba](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/mamba)
|
||||
- [mBART](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
|
||||
- [Minitron](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/nemotron)
|
||||
- [Mistral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
|
||||
- [Mistral NeMo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
|
||||
- [Mixtral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/mixtral)
|
||||
- [MPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mpt)
|
||||
- [Nemotron](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/nemotron)
|
||||
- [mT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
|
||||
- [Nemotron](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/nemotron)
|
||||
- [mT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
|
||||
- [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt)
|
||||
- [Phi-1.5/Phi-2/Phi-3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/phi)
|
||||
- [Qwen/Qwen1.5/Qwen2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/qwen)
|
||||
- [Qwen-VL](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/qwenvl)
|
||||
- [RecurrentGemma](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/recurrentgemma)
|
||||
- [Phi-1.5/Phi-2/Phi-3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi)
|
||||
- [Qwen/Qwen1.5/Qwen2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwen)
|
||||
- [Qwen-VL](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwenvl)
|
||||
- [RecurrentGemma](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/recurrentgemma)
|
||||
- [Replit Code](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mpt) [^replitcode]
|
||||
- [RoBERTa](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bert)
|
||||
- [SantaCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt)
|
||||
- [RoBERTa](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/bert)
|
||||
- [SantaCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
|
||||
- [Skywork](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/skywork)
|
||||
- [Smaug](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/smaug)
|
||||
- [StarCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt)
|
||||
- [T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
|
||||
- [Whisper](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper)
|
||||
- [StarCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
|
||||
- [T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
|
||||
- [Whisper](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/whisper)
|
||||
|
||||
|
||||
### Multi-Modal Models [^multimod]
|
||||
|
||||
- [BLIP2 w/ OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [BLIP2 w/ T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [CogVLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal) [^bf16only]
|
||||
- [Deplot](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [Fuyu](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [Kosmos](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [LLaVA-v1.5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [LLaVa-Next](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [LLaVa-OneVision](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [NeVA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [Nougat](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [Phi-3-vision](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [Video NeVA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [VILA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [MLLaMA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [LLama 3.2 VLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
- [BLIP2 w/ OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [BLIP2 w/ T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [CogVLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal) [^bf16only]
|
||||
- [Deplot](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [Fuyu](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [Kosmos](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [LLaVA-v1.5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [LLaVa-Next](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [LLaVa-OneVision](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [NeVA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [Nougat](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [Phi-3-vision](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [Video NeVA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [VILA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [MLLaMA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
- [LLama 3.2 VLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
|
||||
|
||||
(support-matrix-hardware)=
|
||||
|
||||
@ -76,7 +76,7 @@ Here is an example to print the values of the MLP output tensor in the GPT model
|
||||
Enable the `--enable_debug_output` option when building engines with `trtllm-build`
|
||||
|
||||
```bash
|
||||
cd examples/gpt
|
||||
cd examples/models/core/gpt
|
||||
|
||||
# Download hf gpt2 model
|
||||
rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2
|
||||
@ -323,6 +323,6 @@ As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm
|
||||
node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a
|
||||
dedicated MPI environment, not the one provided by your Slurm allocation.
|
||||
|
||||
For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
|
||||
For example: `mpirun -n 1 python3 examples/models/core/gpt/build.py ...`
|
||||
|
||||
It's critical that it's always `-n 1` regardless of how many GPUs are being used. If you'd use `-n 2` for a 2 GPU program it will not work. `mpirun` here isn't being used to orchestrate multiple processes, but to invoke the right environment on SLURM. The internal MPI implementation deals with spawning the additional processes.
|
||||
|
||||
@ -43,14 +43,14 @@ Each request can specify a Lookahead configuration, noted as `(w, n, g)`. If non
|
||||
### Convert Checkpoint
|
||||
|
||||
This example is based on the Vicuna-7b v1.3 model, a fine-tuned Llama model.
|
||||
Checkpoint conversion is similar to any standard autoregressive model, such as the models located in the [examples/llama](../../examples/llama) directory.
|
||||
Checkpoint conversion is similar to any standard autoregressive model, such as the models located in the [examples/models/core/llama](../../examples/models/core/llama) directory.
|
||||
|
||||
```bash
|
||||
MODEL_DIR=/path/to/vicuna-7b-v1.3
|
||||
ENGINE_DIR=tmp/engine
|
||||
CKPT_DIR=tmp/engine/ckpt
|
||||
|
||||
python3 examples/llama/convert_checkpoint.py \
|
||||
python3 examples/models/core/llama/convert_checkpoint.py \
|
||||
--model_dir=$MODEL_DIR \
|
||||
--output_dir=$CKPT_DIR \
|
||||
--dtype=float16 \
|
||||
|
||||
@ -4,7 +4,7 @@ This document shows how to build and run a [Arctic](https://huggingface.co/Snowf
|
||||
|
||||
The TensorRT-LLM Arctic implementation is based on the LLaMA model, with Mixture of Experts (MoE) enabled. The implementation can
|
||||
be found in [llama/model.py](../../../../tensorrt_llm/models/llama/model.py).
|
||||
See the LLaMA example [`examples/llama`](../../../llama) for details.
|
||||
See the LLaMA example [`examples/models/core/llama`](../../../llama) for details.
|
||||
|
||||
- [Arctic](#arctic)
|
||||
- [Download model checkpoints](#download-model-checkpoints)
|
||||
@ -80,7 +80,7 @@ Test your engine with the [run.py](../run.py) script:
|
||||
mpirun -n ${TP} --allow-run-as-root python ../../../run.py --engine_dir ./tmp/trt_engines/${ENGINE} --tokenizer_dir tmp/hf_checkpoints/${HF_MODEL} --max_output_len 20 --input_text "The future of AI is" |& tee tmp/trt_engines/${ENGINE}_run.log
|
||||
```
|
||||
|
||||
For more examples see [`examples/llama/README.md`](../../../llama/README.md)
|
||||
For more examples see [`examples/models/core/llama/README.md`](../../../llama/README.md)
|
||||
|
||||
|
||||
### OOTB
|
||||
|
||||
@ -29,7 +29,7 @@ This document explains how to build the [ChatGLM-6B](https://huggingface.co/THUD
|
||||
The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
|
||||
The TensorRT-LLM ChatGLM example code is located in [`examples/models/contrib/chatglm-6b`](./). There is one main file:
|
||||
|
||||
* [`examples/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
|
||||
* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
|
||||
|
||||
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
|
||||
|
||||
@ -103,4 +103,4 @@ cp chatglm_6b/tokenization_chatglm.py chatglm_6b/tokenization_chatglm.py-backup
|
||||
cp tokenization_chatglm.py chatglm_6b
|
||||
```
|
||||
|
||||
For more example codes, please refer to the [examples/glm-4-9b/README.md](../../../glm-4-9b/README.md).
|
||||
For more example codes, please refer to the [examples/models/core/glm-4-9b/README.md](../../../glm-4-9b/README.md).
|
||||
|
||||
@ -29,7 +29,7 @@ This document explains how to build the [ChatGLM2-6B](https://huggingface.co/THU
|
||||
The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
|
||||
The TensorRT-LLM ChatGLM example code is located in [`examples/models/contrib/chatglm2-6b`](./). There is one main file:
|
||||
|
||||
* [`examples/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
|
||||
* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
|
||||
|
||||
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
|
||||
|
||||
@ -99,4 +99,4 @@ git clone https://huggingface.co/THUDM/chatglm2-6b chatglm2_6b
|
||||
git clone https://huggingface.co/THUDM/chatglm2-6b-32k chatglm2_6b_32k
|
||||
```
|
||||
|
||||
For more example codes, please refer to the [examples/glm-4-9b/README.md](../../../glm-4-9b/README.md).
|
||||
For more example codes, please refer to the [examples/models/core/glm-4-9b/README.md](../../../glm-4-9b/README.md).
|
||||
|
||||
@ -29,7 +29,7 @@ This document explains how to build the [ChatGLM3-6B](https://huggingface.co/THU
|
||||
The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
|
||||
The TensorRT-LLM ChatGLM example code is located in [`examples/models/contrib/chatglm3-6b-32k`](./). There is one main file:
|
||||
|
||||
* [`examples/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
|
||||
* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
|
||||
|
||||
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
|
||||
|
||||
@ -103,4 +103,4 @@ git clone https://huggingface.co/THUDM/chatglm3-6b-base chatglm3_6b_base
|
||||
git clone https://huggingface.co/THUDM/chatglm3-6b-32k chatglm3_6b_32k
|
||||
```
|
||||
|
||||
For more example codes, please refer to the [examples/glm-4-9b/README.md](../../../glm-4-9b/README.md).
|
||||
For more example codes, please refer to the [examples/models/core/glm-4-9b/README.md](../../../glm-4-9b/README.md).
|
||||
|
||||
@ -59,7 +59,7 @@ Here're some examples:
|
||||
# Build a single-GPU float16 engine from HF weights.
|
||||
# gpt_attention_plugin is necessary in InternLM.
|
||||
# Try use_gemm_plugin to prevent accuracy issue.
|
||||
cd examples/llama
|
||||
cd examples/models/core/llama
|
||||
|
||||
# Convert the InternLM 7B model using a single GPU and FP16.
|
||||
python convert_checkpoint.py --model_dir ./internlm-chat-7b/ \
|
||||
@ -117,7 +117,7 @@ and then export the scaling factors needed for INT8 KV cache inference.
|
||||
Example:
|
||||
|
||||
```bash
|
||||
cd examples/llama
|
||||
cd examples/models/core/llama
|
||||
|
||||
# For 7B models
|
||||
python convert_checkpoint.py --model_dir ./internlm-chat-7b \
|
||||
@ -135,7 +135,7 @@ trtllm-build --checkpoint_dir ./internlm-chat-7b/smooth_internlm/int8_kv_cache/
|
||||
|
||||
|
||||
```bash
|
||||
cd examples/llama
|
||||
cd examples/models/core/llama
|
||||
|
||||
# For 20B models
|
||||
python convert_checkpoint.py --model_dir ./internlm-chat-20b \
|
||||
@ -182,7 +182,7 @@ Unlike the FP16 build where the HF weights are processed and loaded into the Ten
|
||||
|
||||
Example:
|
||||
```bash
|
||||
cd examples/llama
|
||||
cd examples/models/core/llama
|
||||
|
||||
# For 7B models
|
||||
python convert_checkpoint.py --model_dir ./internlm-chat-7b --output_dir ./internlm-chat-7b/smooth_internlm/sq0.5/ --dtype float16 --smoothquant 0.5
|
||||
@ -192,7 +192,7 @@ trtllm-build --checkpoint_dir ./internlm-chat-7b/smooth_internlm/sq0.5/ \
|
||||
--gemm_plugin float16
|
||||
|
||||
# For 20B models
|
||||
cd examples/llama
|
||||
cd examples/models/core/llama
|
||||
|
||||
python convert_checkpoint.py --model_dir ./internlm-chat-20b --output_dir ./internlm-chat-20b/smooth_internlm/sq0.5/ --dtype float16 --smoothquant 0.5
|
||||
trtllm-build --checkpoint_dir ./internlm-chat-20b/smooth_internlm/sq0.5/ \
|
||||
@ -211,7 +211,7 @@ Examples of build invocations:
|
||||
|
||||
```bash
|
||||
# Build model for SmoothQuant in the _per_token_ + _per_channel_ mode
|
||||
cd examples/llama
|
||||
cd examples/models/core/llama
|
||||
|
||||
# 7B model
|
||||
python convert_checkpoint.py --model_dir ./internlm-chat-7b --output_dir ./internlm-chat-7b/smooth_internlm/sq0.5/ --dtype float16 --smoothquant 0.5 --per_channel --per_token
|
||||
|
||||
@ -39,7 +39,7 @@ git clone https://huggingface.co/Skywork/Skywork-13B-base
|
||||
### 2. Convert HF Model to TRT Checkpoint
|
||||
|
||||
```bash
|
||||
cd examples/llama
|
||||
cd examples/models/core/llama
|
||||
|
||||
# fp16 model
|
||||
python3 convert_checkpoint.py --model_dir ./Skywork-13B-base \
|
||||
|
||||
@ -5,7 +5,7 @@ This document explains how to build the BERT family, specifically [BERT](https:/
|
||||
## Overview
|
||||
|
||||
The TensorRT-LLM BERT family implementation can be found in [`tensorrt_llm/models/bert/model.py`](../../../../tensorrt_llm/models/bert/model.py).
|
||||
The TensorRT-LLM BERT family example code is located in [`examples/bert`](./). There are two main files in that folder:
|
||||
The TensorRT-LLM BERT family example code is located in [`examples/models/core/bert`](./). There are two main files in that folder:
|
||||
|
||||
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the BERT model into tensorrt-llm checkpoint format.
|
||||
* [`run.py`](./run.py) to run the inference on an input text,
|
||||
|
||||
@ -19,7 +19,7 @@ This document explains how to build the [C4AI Command-R](https://huggingface.co/
|
||||
## Overview
|
||||
|
||||
The TensorRT-LLM Command-R implementation can be found in [`tensorrt_llm/models/commandr/model.py`](../../../../tensorrt_llm/models/commandr/model.py).
|
||||
The TensorRT-LLM Command-R example code is located in [`examples/commandr`](./). There is one main file:
|
||||
The TensorRT-LLM Command-R example code is located in [`examples/models/core/commandr`](./). There is one main file:
|
||||
|
||||
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
|
||||
|
||||
|
||||
@ -27,7 +27,7 @@ This document shows how to build and run an Encoder-Decoder (Enc-Dec) model in T
|
||||
|
||||
## Overview
|
||||
|
||||
The TensorRT-LLM Enc-Dec implementation can be found in [tensorrt_llm/models/enc_dec/model.py](../../../../tensorrt_llm/models/enc_dec/model.py). The TensorRT-LLM Enc-Dec example code is located in [`examples/enc_dec`](./):
|
||||
The TensorRT-LLM Enc-Dec implementation can be found in [tensorrt_llm/models/enc_dec/model.py](../../../../tensorrt_llm/models/enc_dec/model.py). The TensorRT-LLM Enc-Dec example code is located in [`examples/models/core/enc_dec`](./):
|
||||
|
||||
* `trtllm-build` to build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run the Enc-Dec model,
|
||||
* [`run.py`](./run.py) to run the inference on an example input text.
|
||||
@ -35,7 +35,7 @@ The TensorRT-LLM Enc-Dec implementation can be found in [tensorrt_llm/models/enc
|
||||
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert weights from HuggingFace or FairSeq format to TRT-LLM format, and split weights for multi-GPU inference,
|
||||
## Usage
|
||||
|
||||
The TensorRT-LLM Enc-Dec example code locates at [examples/enc_dec](./). It takes HuggingFace or FairSeq model name as input, and builds the corresponding TensorRT engines. On each GPU, there will be two TensorRT engines, one for Encoder and one for Decoder.
|
||||
The TensorRT-LLM Enc-Dec example code locates at [examples/models/core/enc_dec](./). It takes HuggingFace or FairSeq model name as input, and builds the corresponding TensorRT engines. On each GPU, there will be two TensorRT engines, one for Encoder and one for Decoder.
|
||||
|
||||
## Encoder-Decoder Model Support
|
||||
|
||||
@ -225,7 +225,7 @@ For pure C++ runtime, there is no example given yet. Please check the [`Executor
|
||||
|
||||
#### Run Python runtime
|
||||
|
||||
For pure Python runtime, you can still use the encoder-decoder specific script under `examples/enc_dec/`.
|
||||
For pure Python runtime, you can still use the encoder-decoder specific script under `examples/models/core/enc_dec/`.
|
||||
|
||||
```bash
|
||||
# Inferencing w/ single GPU greedy search, compare results with HuggingFace FP32
|
||||
|
||||
@ -3,7 +3,7 @@
|
||||
This document shows how to build and run a [EXAONE](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct) model in TensorRT-LLM.
|
||||
|
||||
The TensorRT-LLM EXAONE implementation is based on the LLaMA model. The implementation can be found in [llama/model.py](../../../../tensorrt_llm/models/llama/model.py).
|
||||
See the LLaMA example [`examples/llama`](../llama) for details.
|
||||
See the LLaMA example [`examples/models/core/llama`](../llama) for details.
|
||||
|
||||
- [EXAONE](#exaone)
|
||||
- [Support Matrix](#support-matrix)
|
||||
@ -211,4 +211,4 @@ python ../../../summarize.py \
|
||||
--engine_dir trt_engines/exaone/fp16/1-gpu
|
||||
```
|
||||
|
||||
For more examples see [`examples/llama/README.md`](../llama/README.md)
|
||||
For more examples see [`examples/models/core/llama/README.md`](../llama/README.md)
|
||||
|
||||
@ -601,7 +601,7 @@ UNIFIED_CKPT_PATH=/tmp/checkpoints/tmp_$variant_it_tensorrt_llm/bf16/tp1/
|
||||
ENGINE_PATH=/tmp/gemma2/$variant/bf16/1-gpu/
|
||||
VOCAB_FILE_PATH=gemma-2-$variant-it/tokenizer.model
|
||||
|
||||
python3 ./examples/gemma/convert_checkpoint.py \
|
||||
python3 ./examples/models/core/gemma/convert_checkpoint.py \
|
||||
--ckpt-type hf \
|
||||
--model-dir ${CKPT_PATH} \
|
||||
--dtype bfloat16 \
|
||||
|
||||
@ -27,7 +27,7 @@ This document explains how to build the [glm-4-9b](https://huggingface.co/THUDM/
|
||||
## Overview
|
||||
|
||||
The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../../../tensorrt_llm/models/chatglm/model.py).
|
||||
The TensorRT-LLM ChatGLM example code is located in [`examples/glm-4-9b`](./). There is one main file:
|
||||
The TensorRT-LLM ChatGLM example code is located in [`examples/models/core/glm-4-9b`](./). There is one main file:
|
||||
|
||||
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
|
||||
|
||||
|
||||
@ -6,7 +6,7 @@ This document shows how to build and run InternLM2 7B / 20B models in TensorRT-L
|
||||
|
||||
The TensorRT-LLM InternLM2 implementation is based on the LLaMA model. The implementation can
|
||||
be found in [model.py](../../../../tensorrt_llm/models/llama/model.py).
|
||||
The TensorRT-LLM InternLM2 example code lies in [`examples/internlm2`](./):
|
||||
The TensorRT-LLM InternLM2 example code lies in [`examples/models/core/internlm2`](./):
|
||||
|
||||
* [`convert_checkpoint.py`](./convert_checkpoint.py) converts the Huggingface Model of InternLM2 into TensorRT-LLM checkpoint.
|
||||
|
||||
|
||||
@ -231,7 +231,7 @@ commands still work.
|
||||
Note that the `rope_theta` and `vocab_size` are larger in LLaMA v3 models and these values are now inferred
|
||||
or pickup up from the `params.json` when using the `meta_ckpt_dir`.
|
||||
|
||||
LLaMA 3.2 models are also supported now. For text only model like [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B), the steps are same to v3.0. For vision model like [Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision), please refer to the [examples/mllama/README.md](../mllama/README.md)
|
||||
LLaMA 3.2 models are also supported now. For text only model like [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B), the steps are same to v3.0. For vision model like [Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision), please refer to the [examples/models/core/mllama/README.md](../mllama/README.md)
|
||||
|
||||
```bash
|
||||
# Build LLaMA v3 8B TP=1 using HF checkpoints directly.
|
||||
|
||||
@ -1,3 +1,3 @@
|
||||
# MLLaMA (llama-3.2 Vision model)
|
||||
|
||||
MLLaMA is a multimodal model, and reuse the multimodal modules in [examples/multimodal](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
|
||||
MLLaMA is a multimodal model, and reuse the multimodal modules in [examples/models/core/multimodal](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
|
||||
|
||||
@ -256,7 +256,7 @@ Currently, CogVLM only support bfloat16 precision.
|
||||
## Deplot
|
||||
|
||||
1. Download Huggingface weights and convert original checkpoint to TRT-LLM checkpoint format
|
||||
following example in `examples/enc_dec/README.md`.
|
||||
following example in `examples/models/core/enc_dec/README.md`.
|
||||
|
||||
```bash
|
||||
export MODEL_NAME="deplot"
|
||||
@ -320,7 +320,7 @@ Currently, CogVLM only support bfloat16 precision.
|
||||
git clone https://huggingface.co/adept/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
|
||||
```
|
||||
|
||||
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/gpt`.
|
||||
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/models/core/gpt`.
|
||||
The LLM portion of Fuyu uses a Persimmon model
|
||||
```bash
|
||||
python ../gpt/convert_checkpoint.py \
|
||||
@ -489,7 +489,7 @@ Firstly, please install transformers with 4.37.2
|
||||
git clone https://huggingface.co/microsoft/kosmos-2-patch14-224 tmp/hf_models/${MODEL_NAME}
|
||||
```
|
||||
|
||||
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/gpt`.
|
||||
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/models/core/gpt`.
|
||||
```bash
|
||||
python ../gpt/convert_checkpoint.py \
|
||||
--model_dir tmp/hf_models/${MODEL_NAME} \
|
||||
@ -560,7 +560,7 @@ Firstly, please install transformers with 4.37.2
|
||||
git clone https://huggingface.co/Efficient-Large-Model/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
|
||||
```
|
||||
|
||||
2. Generate TRT-LLM engine for LLaMA following example in `examples/llama/README.md` and `examples/qwen/README.md`
|
||||
2. Generate TRT-LLM engine for LLaMA following example in `examples/models/core/llama/README.md` and `examples/models/core/qwen/README.md`
|
||||
|
||||
```bash
|
||||
python ../llama/convert_checkpoint.py \
|
||||
@ -725,7 +725,7 @@ Firstly, please install transformers with 4.37.2
|
||||
|
||||
This section shows how to build and run a LLaMA-3.2 Vision model in TensorRT-LLM. We use [Llama-3.2-11B-Vision/](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) as an example.
|
||||
|
||||
For LLaMA-3.2 text model, please refer to the [examples/llama/README.md](../llama/README.md) because it shares the model architecture of llama.
|
||||
For LLaMA-3.2 text model, please refer to the [examples/models/core/llama/README.md](../llama/README.md) because it shares the model architecture of llama.
|
||||
|
||||
### Support data types <!-- omit from toc -->
|
||||
* BF16
|
||||
@ -831,7 +831,7 @@ Note that for instruct Vision model, please set the `max_encoder_input_len` as `
|
||||
|
||||
[NeVA](https://docs.nvidia.com/nemo-framework/user-guide/latest/multimodalmodels/neva/index.html) is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.
|
||||
|
||||
1. Generate TRT-LLM engine for NVGPT following example in `examples/gpt/README.md`. To adhere to the NVGPT conventions of the conversion script, some layer keys have to be remapped using `--nemo_rename_key`.
|
||||
1. Generate TRT-LLM engine for NVGPT following example in `examples/models/core/gpt/README.md`. To adhere to the NVGPT conventions of the conversion script, some layer keys have to be remapped using `--nemo_rename_key`.
|
||||
|
||||
```bash
|
||||
export MODEL_NAME="neva"
|
||||
@ -886,11 +886,11 @@ Note that for instruct Vision model, please set the `max_encoder_input_len` as `
|
||||
git clone https://huggingface.co/facebook/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
|
||||
```
|
||||
|
||||
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/enc_dec`
|
||||
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/models/core/enc_dec`
|
||||
|
||||
Nougat uses mBART architecture but replaces the LLM encoder with a Swin Transformer encoder.
|
||||
To achieve this, we add an extra `--nougat` flag (over mBART example) to
|
||||
`convert_checkpoint.py` in `examples/enc_dec` and `trtllm-build`.
|
||||
`convert_checkpoint.py` in `examples/models/core/enc_dec` and `trtllm-build`.
|
||||
|
||||
```bash
|
||||
python ../enc_dec/convert_checkpoint.py --model_type bart \
|
||||
@ -938,7 +938,7 @@ Note that for instruct Vision model, please set the `max_encoder_input_len` as `
|
||||
git clone https://huggingface.co/microsoft/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
|
||||
```
|
||||
|
||||
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/phi`.
|
||||
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/models/core/phi`.
|
||||
```bash
|
||||
python ../phi/convert_checkpoint.py \
|
||||
--model_dir tmp/hf_models/${MODEL_NAME} \
|
||||
@ -1045,7 +1045,7 @@ pip install -r requirements-qwen2vl.txt
|
||||
|
||||
[Video NeVA](https://github.com/NVIDIA/NeMo/blob/main/docs/source/multimodal/mllm/video_neva.rst) is a groundbreaking addition to the NeMo Multimodal ecosystem that could work with video modality. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.
|
||||
|
||||
1. Generate TRT-LLM engine for Nemotron model following example in `examples/nemotron/README.md`. To adhere to the NVGPT conventions of the conversion script. This will be used as our base LM for inference.
|
||||
1. Generate TRT-LLM engine for Nemotron model following example in `examples/models/core/nemotron/README.md`. To adhere to the NVGPT conventions of the conversion script. This will be used as our base LM for inference.
|
||||
|
||||
```bash
|
||||
pip install decord # used for loading video
|
||||
@ -1094,7 +1094,7 @@ This section explains how to evaluate datasets using our provided script, includ
|
||||
To run an evaluation, use the following command:
|
||||
|
||||
```bash
|
||||
python ./examples/multimodal/eval.py \
|
||||
python ./examples/models/core/multimodal/eval.py \
|
||||
--model_type <model_type> \
|
||||
--engine_dir <engine_dir> \
|
||||
--hf_model_dir <hf_model_dir> \
|
||||
|
||||
@ -43,7 +43,7 @@ Due the non-uniform architecture of the model, the different pipeline parallelis
|
||||
|
||||
## Usage
|
||||
|
||||
The TensorRT-LLM example code is located at [examples/nemotron_nas](./).
|
||||
The TensorRT-LLM example code is located at [examples/models/core/nemotron_nas](./).
|
||||
The `convert_checkpoint.py` script accepts Hugging Face weights as input, and builds the corresponding TensorRT engines.
|
||||
The number of TensorRT engines depends on the number of GPUs used to run inference.
|
||||
|
||||
|
||||
@ -30,7 +30,7 @@ LLaVA-NeXT is an extension of LLaVA. TRT-LLM currently supports [Mistral-7b](htt
|
||||
# copy the image newlines tensor to engine directory
|
||||
cp tmp/trt_models/${MODEL_NAME}/fp16/1-gpu/vision/image_newlines.safetensors tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision
|
||||
```
|
||||
3. Generate TRT-LLM engine for LLaMA following example in `examples/llama/README.md`
|
||||
3. Generate TRT-LLM engine for LLaMA following example in `examples/models/core/llama/README.md`
|
||||
|
||||
```bash
|
||||
python ../llama/convert_checkpoint.py \
|
||||
|
||||
@ -319,7 +319,7 @@ def main(args):
|
||||
if is_enc_dec:
|
||||
logger.warning(
|
||||
"This path is an encoder-decoder model. Using different handling.")
|
||||
assert not args.use_py_session, "Encoder-decoder models don't have a unified python runtime, please use its own examples/enc_dec/run.py instead."
|
||||
assert not args.use_py_session, "Encoder-decoder models don't have a unified python runtime, please use its own examples/models/core/enc_dec/run.py instead."
|
||||
|
||||
model_name, model_version = read_model_name(
|
||||
args.engine_dir if not is_enc_dec else os.path.
|
||||
|
||||
@ -62,7 +62,7 @@ wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/merges.txt
|
||||
```
|
||||
|
||||
2. Convert the Hugging Face checkpoint into TensorRT-LLM format.
|
||||
Run below command lines in [`examples/gptj`](../gptj) directory.
|
||||
Run below command lines in [`examples/models/contrib/gpt`](../gptj) directory.
|
||||
```bash
|
||||
# Build a float16 checkpoint using HF weights.
|
||||
python convert_checkpoint.py --model_dir ./gpt-j-6b \
|
||||
@ -113,7 +113,7 @@ python3 ../summarize.py --engine_dir ./trt_engines/gptj_fp16_tp1.refit \
|
||||
1. Download the llama-7b-hf checkpoint and saved in /llm-models/llama-models/llama-7b-hf/.
|
||||
|
||||
2. Calibrate the checkpoint and convert into TensorRT-LLM format.
|
||||
Run below command lines in [`examples/llama`](../llama) directory.
|
||||
Run below command lines in [`examples/models/core/llama`](../llama) directory.
|
||||
```bash
|
||||
# Calibrate INT4 using AMMO.
|
||||
python ../quantization/quantize.py --model_dir /llm-models/llama-models/llama-7b-hf/ \
|
||||
@ -154,7 +154,7 @@ python3 ../summarize.py --engine_dir trt_int4_AWQ_full_from_wtless \
|
||||
1. Download the llama-7b-hf checkpoint and saved in /llm-models/llama-models/llama-7b-hf/.
|
||||
|
||||
2. Convert the checkpoint into TensorRT-LLM format.
|
||||
Run below command lines in [`examples/llama`](../llama) directory.
|
||||
Run below command lines in [`examples/models/core/llama`](../llama) directory.
|
||||
```bash
|
||||
python3 convert_checkpoint.py --model_dir /llm-models/llama-models/llama-7b-hf/ \
|
||||
--output_dir ./llama-7b-hf-fp16-woq \
|
||||
@ -194,7 +194,7 @@ python3 ../summarize.py --engine_dir ./engines/llama-7b-hf-fp16-woq-1gpu-wtless-
|
||||
1. Download the llama-v2-70b-hf checkpoint and saved in /llm-models/llama-models-v2/llama-v2-70b-hf/.
|
||||
|
||||
2. Calibrate the checkpoint and convert into TensorRT-LLM format.
|
||||
Run below command lines in [`examples/llama`](../llama) directory.
|
||||
Run below command lines in [`examples/models/core/llama`](../llama) directory.
|
||||
```bash
|
||||
# Calibrate FP8 using AMMO.
|
||||
python ../quantization/quantize.py --model_dir /llm-models/llama-models-v2/llama-v2-70b-hf/ \
|
||||
@ -250,7 +250,7 @@ Building an engine from a pruned checkpoint will also allow the engine to be [re
|
||||
#### Pruning a TensorRT-LLM Checkpoint
|
||||
|
||||
1. Install [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md) either through [pip](https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md#installation) or [from the source](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation/build-from-source-linux.md).
|
||||
2. Download a model of your choice and convert it to a TensorRT-LLM checkpoint ([llama instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md#usage)).
|
||||
2. Download a model of your choice and convert it to a TensorRT-LLM checkpoint ([llama instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/llama/README.md#usage)).
|
||||
3. (Optional) Run the `trtllm-prune` command.
|
||||
```bash
|
||||
# Prunes the TRT-LLM checkpoint at ${CHECKPOINT_DIR}, and stores it in the directory ${CHECKPOINT_DIR}.pruned
|
||||
|
||||
@ -515,7 +515,7 @@ class BertForSequenceClassification(BertBase):
|
||||
remove_input_padding = default_net().plugin_config.remove_input_padding
|
||||
|
||||
# required as explicit input in remove_input_padding mode
|
||||
# see examples/bert/run_remove_input_padding.py for how to create them from input_ids and input_lengths
|
||||
# see examples/models/core/bert/run_remove_input_padding.py for how to create them from input_ids and input_lengths
|
||||
if remove_input_padding:
|
||||
assert token_type_ids is not None and \
|
||||
position_ids is not None and \
|
||||
|
||||
@ -138,7 +138,7 @@ def install_additional_requirements(python_exe, root_dir):
|
||||
"pip",
|
||||
"install",
|
||||
"-r",
|
||||
"examples/recurrentgemma/requirements.txt",
|
||||
"examples/models/core/recurrentgemma/requirements.txt",
|
||||
],
|
||||
cwd=root_dir,
|
||||
env=_os.environ,
|
||||
|
||||
@ -5,7 +5,7 @@ hf_model_dir=$1
|
||||
engine_dir=$2
|
||||
|
||||
# fake a 1-layer LLaMA model for CI
|
||||
python3 ../../examples/llama/build.py \
|
||||
python3 ../../examples/models/core/llama/build.py \
|
||||
--use_gemm_plugin \
|
||||
--enable_context_fmha \
|
||||
--use_gpt_attention_plugin \
|
||||
|
||||
Loading…
Reference in New Issue
Block a user