doc: fix path after examples migration (#3814)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
This commit is contained in:
Kaiyu Xie 2025-04-24 02:36:45 +08:00 committed by GitHub
parent 635dcdcb9e
commit dfbcb543ce
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
44 changed files with 139 additions and 137 deletions

View File

@ -10,4 +10,4 @@ examples/**/*.bin
examples/**/*.engine
examples/**/*.onnx
examples/**/c-model
examples/gpt/gpt*
examples/models/core/gpt/gpt*

View File

@ -197,7 +197,7 @@ Several popular models are pre-defined and can be easily customized or extended
To get started with TensorRT-LLM, visit our documentation:
- [Quick Start Guide](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html)
- [Running DeepSeek](./examples/deepseek_v3)
- [Running DeepSeek](./examples/models/core/deepseek_v3)
- [Installation Guide for Linux](https://nvidia.github.io/TensorRT-LLM/installation/linux.html)
- [Installation Guide for Grace Hopper](https://nvidia.github.io/TensorRT-LLM/installation/grace-hopper.html)
- [Supported Hardware, Models, and other Software](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html)

View File

@ -112,7 +112,7 @@ cd cpp/build
Take GPT-350M as an example for 2-GPU inflight batching
```
mpirun -n 2 ./benchmarks/gptManagerBenchmark \
--engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
--engine_dir ../../examples/models/core/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
--request_rate 10 \
--dataset ../../benchmarks/cpp/preprocessed_dataset.json \
--max_num_samples 500
@ -125,7 +125,7 @@ cd cpp/build
Currently encoder-decoder engines only support `--api executor`, `--type IFB`, `--enable_kv_cache_reuse false`, which are all default values so no specific settings required.
Prepare t5-small engine from [examples/enc_dec](/examples/enc_dec/README.md#convert-and-split-weights) for the encoder-decoder 4-GPU inflight batching example.
Prepare t5-small engine from [examples/models/core/enc_dec](/examples/models/core/enc_dec/README.md#convert-and-split-weights) for the encoder-decoder 4-GPU inflight batching example.
Prepare the dataset suitable for engine input lengths.
```
@ -147,8 +147,8 @@ cd cpp/build
Run the benchmark
```
mpirun --allow-run-as-root -np 4 ./benchmarks/gptManagerBenchmark \
--encoder_engine_dir ../../examples/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/encoder \
--decoder_engine_dir ../../examples/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/decoder \
--encoder_engine_dir ../../examples/models/core/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/encoder \
--decoder_engine_dir ../../examples/models/core/enc_dec/tmp/trt_engines/t5-small-4gpu/bfloat16/decoder \
--dataset cnn_dailymail.json
```
@ -173,7 +173,7 @@ Datasets with fixed input/output lengths for benchmarking can be generated with
Take GPT-350M as an example for single GPU with static batching
```
./benchmarks/gptManagerBenchmark \
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
--engine_dir ../../examples/models/core/gpt/trt_engine/gpt2/fp16/1-gpu/ \
--request_rate -1 \
--static_emulated_batch_size 32 \
--static_emulated_timeout 100 \
@ -213,7 +213,7 @@ CPP_LORA=chinese-llama-2-lora-13b-cpp
EG_DIR=/tmp/lora-eg
# Build lora enabled engine
python examples/llama/convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
python examples/models/core/llama/convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
--output_dir ${CONVERTED_CHECKPOINT} \
--dtype ${DTYPE} \
--tp_size ${TP} \

View File

@ -59,9 +59,9 @@ The weights and built engines are stored under [cpp/tests/resources/models](reso
To build the engines from the top-level directory:
```bash
PYTHONPATH=examples/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
PYTHONPATH=examples/gptj:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py
PYTHONPATH=examples/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
PYTHONPATH=examples/models/core/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py
PYTHONPATH=examples/models/contrib/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py
PYTHONPATH=examples/models/core/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py
PYTHONPATH=examples/chatglm:$PYTHONPATH python3 cpp/tests/resources/scripts/build_chatglm_engines.py
PYTHONPATH=examples/medusa:$PYTHONPATH python3 cpp/tests/resources/scripts/build_medusa_engines.py
PYTHONPATH=examples/eagle:$PYTHONPATH python3 cpp/tests/resources/scripts/build_eagle_engines.py
@ -71,7 +71,7 @@ PYTHONPATH=examples/redrafter:$PYTHONPATH python3 cpp/tests/resources/scripts/bu
It is possible to build engines with tensor and pipeline parallelism for LLaMA using 4 GPUs.
```bash
PYTHONPATH=examples/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu
PYTHONPATH=examples/models/core/llama python3 cpp/tests/resources/scripts/build_llama_engines.py --only_multi_gpu
```
If there is an issue finding model_spec.so in engine building, manually build model_spec.so by

View File

@ -30,7 +30,9 @@ import tensorrt_llm.bindings as _tb
def get_ckpt_without_quatization(model_dir, output_dir):
build_args = [_sys.executable, "examples/gptj/convert_checkpoint.py"] + [
build_args = [
_sys.executable, "examples/models/contrib/gpt/convert_checkpoint.py"
] + [
'--model_dir={}'.format(model_dir),
'--output_dir={}'.format(output_dir),
]

View File

@ -17,7 +17,7 @@ LLaMA, for example.
Complete support of encoder-decoder models, like T5, will be added to
TensorRT-LLM in a future release. An experimental version, only in Python for
now, can be found in the [`examples/enc_dec`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) folder.
now, can be found in the [`examples/models/core/enc_dec`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) folder.
## Overview

View File

@ -9,7 +9,7 @@ git-lfs clone https://huggingface.co/qychen/luotuo-lora-7b-0.1
git-lfs clone https://huggingface.co/kunishou/Japanese-Alpaca-LoRA-7b-v0
BASE_MODEL=llama-7b-hf
python examples/llama/convert_checkpoint.py --model_dir ${BASE_MODEL} \
python examples/models/core/llama/convert_checkpoint.py --model_dir ${BASE_MODEL} \
--output_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
--dtype float16

View File

@ -12,7 +12,7 @@ Here is an example to run llama-7b with Weight Streaming:
```bash
# Convert model as normal. Assume hugging face model is in llama-7b-hf/
python3 examples/llama/convert_checkpoint.py \
python3 examples/models/core/llama/convert_checkpoint.py \
--model_dir llama-7b-hf/ \
--output_dir /tmp/llama_7b/trt_ckpt/fp16/1-gpu/ \
--dtype float16

View File

@ -103,7 +103,7 @@ class Linear(Module):
self.weight = Parameter(shape=(self.out_features, self.in_features), dtype=dtype)
self.bias = Parameter(shape=(self.out_features, ), dtype=dtype)
# The parameters are bound to the weights before compiling the model. See examples/gpt/weight.py:
# The parameters are bound to the weights before compiling the model. See examples/models/core/gpt/weight.py:
tensorrt_llm_gpt.layers[i].mlp.fc.weight.value = fromfile(...)
tensorrt_llm_gpt.layers[i].mlp.fc.bias.value = fromfile(...)
```
@ -277,7 +277,7 @@ max_output_len=128
max_batch_size=4
workers=$(( tp_size * pp_size ))
python ${folder_trt_llm}/examples/llama/convert_checkpoint.py \
python ${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py \
--output_dir ${ckpt_dir} \
--model_dir ${model_dir} \
--dtype ${dtype} \
@ -329,7 +329,7 @@ max_output_len=128
max_batch_size=4
workers=8
python ${folder_trt_llm}/examples/llama/convert_checkpoint.py \
python ${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py \
--output_dir ${ckpt_dir} \
--model_dir ${model_dir} \
--dtype ${dtype} \

View File

@ -48,7 +48,7 @@ class LLaMAForCausalLM (DecoderModelForCausalLM):
Then, in the convert_checkpoint.py script in the
[`examples/llama/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/) directory of the GitHub repo,
[`examples/models/core/llama/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama/) directory of the GitHub repo,
the logic can be greatly simplified. Even if the model definition code of TensorRT-LLM LLaMA class is changed due to some reason, the `from_hugging_face` API will keep the same, thus the existing workflow using this interface will not be affected.
@ -68,7 +68,7 @@ In the 0.9 release, only LLaMA is refactored. Since popular LLaMA (and its varia
In future releases, there might be `from_jax`, `from_nemo`, `from_keras` or other factory methods for different training checkpoints added.
For example, the Gemma 2B model and the convert_checkpoint.py file in the [`examples/gemma`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gemma/)
For example, the Gemma 2B model and the convert_checkpoint.py file in the [`examples/models/core/gemma`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gemma/)
directory support JAX and Keras formats in addition to Hugging Face. The model developers can choose to implement **any subset** of these factory methods for the models they contributed to TensorRT-LLM.

View File

@ -117,7 +117,7 @@ These improvements will be published in the `main` branch soon, and will be
included in the v0.7 & v0.8 releases.
Similar examples running Llama-70B in TensorRT-LLM are published in
[examples/llama](/examples/llama).
[examples/models/core/llama](/examples/models/core/llama).
For more information about H200, please see the [H200 announcement blog](./H200launch.md).

View File

@ -61,7 +61,7 @@ Using this model is subject to a [particular](https://ai.meta.com/resources/mode
There are two ways to build a TensorRT-LLM engine:
1. You can build the TensorRT-LLM engine from the Hugging Face model directly with the [`trtllm-build`](../commands/trtllm-build.rst) tool and then save the engine to disk for later use.
Refer to the [README](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) in the [`examples/llama`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) repository on GitHub.
Refer to the [README](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) in the [`examples/models/core/llama`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) repository on GitHub.
After the engine building is finished, we can load the model:

View File

@ -655,7 +655,7 @@ To prepare a dataset, follow the same process as specified in [](#preparing-a-da
To quantize the checkpoint:
```shell
cd tensorrt_llm/examples/llama
cd tensorrt_llm/examples/models/core/llama
python ../quantization/quantize.py \
--model_dir $checkpoint_dir \
--dtype bfloat16 \

View File

@ -73,10 +73,10 @@ if __name__ == '__main__':
TensorRT-LLM also has a command line interface for building and saving engines. This workflow consists of two steps
1. Convert model checkpoint (HuggingFace, Nemo) to TensorRT-LLM checkpoint via `convert_checkpoint.py`. Each supported model has a `convert_checkpoint.py` associated it with it and can be found in the examples folder. For example, the `convert_checkpoint.py` script for Llama models can be found [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/convert_checkpoint.py)
1. Convert model checkpoint (HuggingFace, Nemo) to TensorRT-LLM checkpoint via `convert_checkpoint.py`. Each supported model has a `convert_checkpoint.py` associated it with it and can be found in the examples folder. For example, the `convert_checkpoint.py` script for Llama models can be found [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama/convert_checkpoint.py)
2. Build engine by passing TensorRT-LLM checkpoint to `trtllm-build` command. The `trtllm-build` command is installed automatically when the `tensorrt_llm` package is installed.
The README in the examples folder for supported models walks through building engines using this flow for a wide variety of situations. The examples folder for Llama models can be found at [https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama).
The README in the examples folder for supported models walks through building engines using this flow for a wide variety of situations. The examples folder for Llama models can be found at [https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama).
## Benchmarking with `trtllm-bench`

View File

@ -42,7 +42,7 @@ The `LLM` class takes `tensor_parallel_size` and `pipeline_parallel_size` as par
If you are using the [CLI flow for building engines](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) you can specify tensor parallelism and pipeline parallelism by providing the `--tp_size` and `--tp_size` arguments to `convert_checkpoint.py`
```
python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama/405B/ \
python examples/models/core/llama/convert_checkpoint.py --model_dir ./tmp/llama/405B/ \
--output_dir ./tllm_checkpoint_16gpu_tp8_pp2 \
--dtype float16 \
--tp_size 8

View File

@ -52,7 +52,7 @@ if __name__ == '__main__':
main()
```
For an example of how to build an fp8 engine using the [TensorRT-LLM CLI workflow](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) flow see [TensorRT-LLM LLaMA examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama). In short you first run [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize and convert the model checkpoint to TensorRT-LLM format and then use `trtllm-build`.
For an example of how to build an fp8 engine using the [TensorRT-LLM CLI workflow](./benchmarking-default-performance.md#building-and-saving-engines-via-cli) flow see [TensorRT-LLM LLaMA examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama). In short you first run [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize and convert the model checkpoint to TensorRT-LLM format and then use `trtllm-build`.
> ***Note: While quantization aims to preserve model accuracy this is not guaranteed and it is extremely important you check that the quality of outputs remains sufficient after quantization.***

View File

@ -92,7 +92,7 @@ For examples and command syntax, refer to the [trtllm-serve](commands/trtllm-ser
(quick-start-guide-compile)=
### Compile the Model into a TensorRT Engine
Use the [Llama model definition](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) from the `examples/llama` directory of the GitHub repository.
Use the [Llama model definition](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) from the `examples/models/core/llama` directory of the GitHub repository.
The model definition is a minimal example that shows some of the optimizations available in TensorRT-LLM.
```console
@ -104,7 +104,7 @@ make -C docker release_run LOCAL_USER=1
huggingface-cli login --token *****
# Convert the model into TensorRT-LLM checkpoint format
cd examples/llama
cd examples/models/core/llama
pip install -r requirements.txt
pip install --upgrade transformers # Llama 3.1 requires transformer 4.43.0+ version.
python3 convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct --output_dir llama-3.1-8b-ckpt
@ -117,7 +117,7 @@ trtllm-build --checkpoint_dir llama-3.1-8b-ckpt \
When you create a model definition with the TensorRT-LLM API, you build a graph of operations from [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) primitives that form the layers of your neural network. These operations map to specific kernels; prewritten programs for the GPU.
In this example, we included the `gpt_attention` plugin, which implements a FlashAttention-like fused attention kernel, and the `gemm` plugin, that performs matrix multiplication with FP32 accumulation. We also called out the desired precision for the full model as FP16, matching the default precision of the weights that you downloaded from Hugging Face. For more information about plugins and quantizations, refer to the [Llama example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) and {ref}`precision` section.
In this example, we included the `gpt_attention` plugin, which implements a FlashAttention-like fused attention kernel, and the `gemm` plugin, that performs matrix multiplication with FP32 accumulation. We also called out the desired precision for the full model as FP16, matching the default precision of the weights that you downloaded from Hugging Face. For more information about plugins and quantizations, refer to the [Llama example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) and {ref}`precision` section.
### Run the Model

View File

@ -85,8 +85,8 @@ The activations are encoded using floating-point values (FP16 or BF16).
To use INT4/INT8 Weight-Only methods, the user must determine the scaling
factors to use to quantize and dequantize the weights of the model.
This release includes examples for [GPT](source:examples/gpt) and
[LLaMA](source:examples/llama).
This release includes examples for [GPT](source:examples/models/core/gpt) and
[LLaMA](source:examples/models/core/llama).
## GPTQ and AWQ (W4A16)
@ -101,9 +101,9 @@ plugin and the corresponding
[`weight_only_groupwise_quant_matmul`](source:tensorrt_llm/quantization/functional.py)
Python function, for details.
This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/gpt)
and [LLaMA-v2](source:examples/llama), as well as an example of using AWQ with
[GPT-J](source:examples/gptj). Those examples are experimental implementations and
This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/models/core/gpt)
and [LLaMA-v2](source:examples/models/core/llama), as well as an example of using AWQ with
[GPT-J](source:examples/models/contrib/gpt). Those examples are experimental implementations and
are likely to evolve in a future release.
## FP8 (Hopper)

View File

@ -10,69 +10,69 @@ TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA
- [Arctic](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/arctic)
- [Baichuan/Baichuan2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/baichuan)
- [BART](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
- [BERT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bert)
- [BART](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [BERT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/bert)
- [BLOOM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bloom)
- [ByT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
- [ByT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [GLM/ChatGLM/ChatGLM2/ChatGLM3/GLM-4](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/chatglm)
- [Code LLaMA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama)
- [Code LLaMA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
- [DBRX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/dbrx)
- [Exaone](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/exaone)
- [FairSeq NMT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
- [Exaone](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/exaone)
- [FairSeq NMT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [Falcon](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/falcon)
- [Flan-T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) [^encdec]
- [Gemma/Gemma2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gemma)
- [GPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt)
- [GPT-J](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gptj)
- [GPT-Nemo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt)
- [GPT-NeoX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gptneox)
- [Granite-3.0](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/granite)
- [Flan-T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) [^encdec]
- [Gemma/Gemma2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gemma)
- [GPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
- [GPT-J](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/gpt)
- [GPT-Nemo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
- [GPT-NeoX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gptneox)
- [Granite-3.0](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/granite)
- [Grok-1](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/grok)
- [InternLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/internlm)
- [InternLM2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/internlm2)
- [LLaMA/LLaMA 2/LLaMA 3/LLaMA 3.1](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama)
- [Mamba](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mamba)
- [mBART](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
- [Minitron](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/nemotron)
- [Mistral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama)
- [Mistral NeMo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama)
- [Mixtral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mixtral)
- [InternLM2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/internlm2)
- [LLaMA/LLaMA 2/LLaMA 3/LLaMA 3.1](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
- [Mamba](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/mamba)
- [mBART](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [Minitron](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/nemotron)
- [Mistral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
- [Mistral NeMo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama)
- [Mixtral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/mixtral)
- [MPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mpt)
- [Nemotron](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/nemotron)
- [mT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
- [Nemotron](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/nemotron)
- [mT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt)
- [Phi-1.5/Phi-2/Phi-3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/phi)
- [Qwen/Qwen1.5/Qwen2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/qwen)
- [Qwen-VL](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/qwenvl)
- [RecurrentGemma](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/recurrentgemma)
- [Phi-1.5/Phi-2/Phi-3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi)
- [Qwen/Qwen1.5/Qwen2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwen)
- [Qwen-VL](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwenvl)
- [RecurrentGemma](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/recurrentgemma)
- [Replit Code](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mpt) [^replitcode]
- [RoBERTa](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bert)
- [SantaCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt)
- [RoBERTa](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/bert)
- [SantaCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
- [Skywork](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/skywork)
- [Smaug](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/smaug)
- [StarCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt)
- [T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec)
- [Whisper](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper)
- [StarCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt)
- [T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [Whisper](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/whisper)
### Multi-Modal Models [^multimod]
- [BLIP2 w/ OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [BLIP2 w/ T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [CogVLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal) [^bf16only]
- [Deplot](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [Fuyu](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [Kosmos](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [LLaVA-v1.5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [LLaVa-Next](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [LLaVa-OneVision](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [NeVA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [Nougat](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [Phi-3-vision](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [Video NeVA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [VILA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [MLLaMA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [LLama 3.2 VLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
- [BLIP2 w/ OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [BLIP2 w/ T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [CogVLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal) [^bf16only]
- [Deplot](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [Fuyu](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [Kosmos](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [LLaVA-v1.5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [LLaVa-Next](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [LLaVa-OneVision](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [NeVA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [Nougat](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [Phi-3-vision](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [Video NeVA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [VILA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [MLLaMA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
- [LLama 3.2 VLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)
(support-matrix-hardware)=

View File

@ -76,7 +76,7 @@ Here is an example to print the values of the MLP output tensor in the GPT model
Enable the `--enable_debug_output` option when building engines with `trtllm-build`
```bash
cd examples/gpt
cd examples/models/core/gpt
# Download hf gpt2 model
rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2
@ -323,6 +323,6 @@ As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm
node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a
dedicated MPI environment, not the one provided by your Slurm allocation.
For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
For example: `mpirun -n 1 python3 examples/models/core/gpt/build.py ...`
It's critical that it's always `-n 1` regardless of how many GPUs are being used. If you'd use `-n 2` for a 2 GPU program it will not work. `mpirun` here isn't being used to orchestrate multiple processes, but to invoke the right environment on SLURM. The internal MPI implementation deals with spawning the additional processes.

View File

@ -43,14 +43,14 @@ Each request can specify a Lookahead configuration, noted as `(w, n, g)`. If non
### Convert Checkpoint
This example is based on the Vicuna-7b v1.3 model, a fine-tuned Llama model.
Checkpoint conversion is similar to any standard autoregressive model, such as the models located in the [examples/llama](../../examples/llama) directory.
Checkpoint conversion is similar to any standard autoregressive model, such as the models located in the [examples/models/core/llama](../../examples/models/core/llama) directory.
```bash
MODEL_DIR=/path/to/vicuna-7b-v1.3
ENGINE_DIR=tmp/engine
CKPT_DIR=tmp/engine/ckpt
python3 examples/llama/convert_checkpoint.py \
python3 examples/models/core/llama/convert_checkpoint.py \
--model_dir=$MODEL_DIR \
--output_dir=$CKPT_DIR \
--dtype=float16 \

View File

@ -4,7 +4,7 @@ This document shows how to build and run a [Arctic](https://huggingface.co/Snowf
The TensorRT-LLM Arctic implementation is based on the LLaMA model, with Mixture of Experts (MoE) enabled. The implementation can
be found in [llama/model.py](../../../../tensorrt_llm/models/llama/model.py).
See the LLaMA example [`examples/llama`](../../../llama) for details.
See the LLaMA example [`examples/models/core/llama`](../../../llama) for details.
- [Arctic](#arctic)
- [Download model checkpoints](#download-model-checkpoints)
@ -80,7 +80,7 @@ Test your engine with the [run.py](../run.py) script:
mpirun -n ${TP} --allow-run-as-root python ../../../run.py --engine_dir ./tmp/trt_engines/${ENGINE} --tokenizer_dir tmp/hf_checkpoints/${HF_MODEL} --max_output_len 20 --input_text "The future of AI is" |& tee tmp/trt_engines/${ENGINE}_run.log
```
For more examples see [`examples/llama/README.md`](../../../llama/README.md)
For more examples see [`examples/models/core/llama/README.md`](../../../llama/README.md)
### OOTB

View File

@ -29,7 +29,7 @@ This document explains how to build the [ChatGLM-6B](https://huggingface.co/THUD
The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
The TensorRT-LLM ChatGLM example code is located in [`examples/models/contrib/chatglm-6b`](./). There is one main file:
* [`examples/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@ -103,4 +103,4 @@ cp chatglm_6b/tokenization_chatglm.py chatglm_6b/tokenization_chatglm.py-backup
cp tokenization_chatglm.py chatglm_6b
```
For more example codes, please refer to the [examples/glm-4-9b/README.md](../../../glm-4-9b/README.md).
For more example codes, please refer to the [examples/models/core/glm-4-9b/README.md](../../../glm-4-9b/README.md).

View File

@ -29,7 +29,7 @@ This document explains how to build the [ChatGLM2-6B](https://huggingface.co/THU
The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
The TensorRT-LLM ChatGLM example code is located in [`examples/models/contrib/chatglm2-6b`](./). There is one main file:
* [`examples/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@ -99,4 +99,4 @@ git clone https://huggingface.co/THUDM/chatglm2-6b chatglm2_6b
git clone https://huggingface.co/THUDM/chatglm2-6b-32k chatglm2_6b_32k
```
For more example codes, please refer to the [examples/glm-4-9b/README.md](../../../glm-4-9b/README.md).
For more example codes, please refer to the [examples/models/core/glm-4-9b/README.md](../../../glm-4-9b/README.md).

View File

@ -29,7 +29,7 @@ This document explains how to build the [ChatGLM3-6B](https://huggingface.co/THU
The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
The TensorRT-LLM ChatGLM example code is located in [`examples/models/contrib/chatglm3-6b-32k`](./). There is one main file:
* [`examples/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@ -103,4 +103,4 @@ git clone https://huggingface.co/THUDM/chatglm3-6b-base chatglm3_6b_base
git clone https://huggingface.co/THUDM/chatglm3-6b-32k chatglm3_6b_32k
```
For more example codes, please refer to the [examples/glm-4-9b/README.md](../../../glm-4-9b/README.md).
For more example codes, please refer to the [examples/models/core/glm-4-9b/README.md](../../../glm-4-9b/README.md).

View File

@ -59,7 +59,7 @@ Here're some examples:
# Build a single-GPU float16 engine from HF weights.
# gpt_attention_plugin is necessary in InternLM.
# Try use_gemm_plugin to prevent accuracy issue.
cd examples/llama
cd examples/models/core/llama
# Convert the InternLM 7B model using a single GPU and FP16.
python convert_checkpoint.py --model_dir ./internlm-chat-7b/ \
@ -117,7 +117,7 @@ and then export the scaling factors needed for INT8 KV cache inference.
Example:
```bash
cd examples/llama
cd examples/models/core/llama
# For 7B models
python convert_checkpoint.py --model_dir ./internlm-chat-7b \
@ -135,7 +135,7 @@ trtllm-build --checkpoint_dir ./internlm-chat-7b/smooth_internlm/int8_kv_cache/
```bash
cd examples/llama
cd examples/models/core/llama
# For 20B models
python convert_checkpoint.py --model_dir ./internlm-chat-20b \
@ -182,7 +182,7 @@ Unlike the FP16 build where the HF weights are processed and loaded into the Ten
Example:
```bash
cd examples/llama
cd examples/models/core/llama
# For 7B models
python convert_checkpoint.py --model_dir ./internlm-chat-7b --output_dir ./internlm-chat-7b/smooth_internlm/sq0.5/ --dtype float16 --smoothquant 0.5
@ -192,7 +192,7 @@ trtllm-build --checkpoint_dir ./internlm-chat-7b/smooth_internlm/sq0.5/ \
--gemm_plugin float16
# For 20B models
cd examples/llama
cd examples/models/core/llama
python convert_checkpoint.py --model_dir ./internlm-chat-20b --output_dir ./internlm-chat-20b/smooth_internlm/sq0.5/ --dtype float16 --smoothquant 0.5
trtllm-build --checkpoint_dir ./internlm-chat-20b/smooth_internlm/sq0.5/ \
@ -211,7 +211,7 @@ Examples of build invocations:
```bash
# Build model for SmoothQuant in the _per_token_ + _per_channel_ mode
cd examples/llama
cd examples/models/core/llama
# 7B model
python convert_checkpoint.py --model_dir ./internlm-chat-7b --output_dir ./internlm-chat-7b/smooth_internlm/sq0.5/ --dtype float16 --smoothquant 0.5 --per_channel --per_token

View File

@ -39,7 +39,7 @@ git clone https://huggingface.co/Skywork/Skywork-13B-base
### 2. Convert HF Model to TRT Checkpoint
```bash
cd examples/llama
cd examples/models/core/llama
# fp16 model
python3 convert_checkpoint.py --model_dir ./Skywork-13B-base \

View File

@ -5,7 +5,7 @@ This document explains how to build the BERT family, specifically [BERT](https:/
## Overview
The TensorRT-LLM BERT family implementation can be found in [`tensorrt_llm/models/bert/model.py`](../../../../tensorrt_llm/models/bert/model.py).
The TensorRT-LLM BERT family example code is located in [`examples/bert`](./). There are two main files in that folder:
The TensorRT-LLM BERT family example code is located in [`examples/models/core/bert`](./). There are two main files in that folder:
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the BERT model into tensorrt-llm checkpoint format.
* [`run.py`](./run.py) to run the inference on an input text,

View File

@ -19,7 +19,7 @@ This document explains how to build the [C4AI Command-R](https://huggingface.co/
## Overview
The TensorRT-LLM Command-R implementation can be found in [`tensorrt_llm/models/commandr/model.py`](../../../../tensorrt_llm/models/commandr/model.py).
The TensorRT-LLM Command-R example code is located in [`examples/commandr`](./). There is one main file:
The TensorRT-LLM Command-R example code is located in [`examples/models/core/commandr`](./). There is one main file:
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.

View File

@ -27,7 +27,7 @@ This document shows how to build and run an Encoder-Decoder (Enc-Dec) model in T
## Overview
The TensorRT-LLM Enc-Dec implementation can be found in [tensorrt_llm/models/enc_dec/model.py](../../../../tensorrt_llm/models/enc_dec/model.py). The TensorRT-LLM Enc-Dec example code is located in [`examples/enc_dec`](./):
The TensorRT-LLM Enc-Dec implementation can be found in [tensorrt_llm/models/enc_dec/model.py](../../../../tensorrt_llm/models/enc_dec/model.py). The TensorRT-LLM Enc-Dec example code is located in [`examples/models/core/enc_dec`](./):
* `trtllm-build` to build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run the Enc-Dec model,
* [`run.py`](./run.py) to run the inference on an example input text.
@ -35,7 +35,7 @@ The TensorRT-LLM Enc-Dec implementation can be found in [tensorrt_llm/models/enc
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert weights from HuggingFace or FairSeq format to TRT-LLM format, and split weights for multi-GPU inference,
## Usage
The TensorRT-LLM Enc-Dec example code locates at [examples/enc_dec](./). It takes HuggingFace or FairSeq model name as input, and builds the corresponding TensorRT engines. On each GPU, there will be two TensorRT engines, one for Encoder and one for Decoder.
The TensorRT-LLM Enc-Dec example code locates at [examples/models/core/enc_dec](./). It takes HuggingFace or FairSeq model name as input, and builds the corresponding TensorRT engines. On each GPU, there will be two TensorRT engines, one for Encoder and one for Decoder.
## Encoder-Decoder Model Support
@ -225,7 +225,7 @@ For pure C++ runtime, there is no example given yet. Please check the [`Executor
#### Run Python runtime
For pure Python runtime, you can still use the encoder-decoder specific script under `examples/enc_dec/`.
For pure Python runtime, you can still use the encoder-decoder specific script under `examples/models/core/enc_dec/`.
```bash
# Inferencing w/ single GPU greedy search, compare results with HuggingFace FP32

View File

@ -3,7 +3,7 @@
This document shows how to build and run a [EXAONE](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct) model in TensorRT-LLM.
The TensorRT-LLM EXAONE implementation is based on the LLaMA model. The implementation can be found in [llama/model.py](../../../../tensorrt_llm/models/llama/model.py).
See the LLaMA example [`examples/llama`](../llama) for details.
See the LLaMA example [`examples/models/core/llama`](../llama) for details.
- [EXAONE](#exaone)
- [Support Matrix](#support-matrix)
@ -211,4 +211,4 @@ python ../../../summarize.py \
--engine_dir trt_engines/exaone/fp16/1-gpu
```
For more examples see [`examples/llama/README.md`](../llama/README.md)
For more examples see [`examples/models/core/llama/README.md`](../llama/README.md)

View File

@ -601,7 +601,7 @@ UNIFIED_CKPT_PATH=/tmp/checkpoints/tmp_$variant_it_tensorrt_llm/bf16/tp1/
ENGINE_PATH=/tmp/gemma2/$variant/bf16/1-gpu/
VOCAB_FILE_PATH=gemma-2-$variant-it/tokenizer.model
python3 ./examples/gemma/convert_checkpoint.py \
python3 ./examples/models/core/gemma/convert_checkpoint.py \
--ckpt-type hf \
--model-dir ${CKPT_PATH} \
--dtype bfloat16 \

View File

@ -27,7 +27,7 @@ This document explains how to build the [glm-4-9b](https://huggingface.co/THUDM/
## Overview
The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../../../tensorrt_llm/models/chatglm/model.py).
The TensorRT-LLM ChatGLM example code is located in [`examples/glm-4-9b`](./). There is one main file:
The TensorRT-LLM ChatGLM example code is located in [`examples/models/core/glm-4-9b`](./). There is one main file:
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.

View File

@ -6,7 +6,7 @@ This document shows how to build and run InternLM2 7B / 20B models in TensorRT-L
The TensorRT-LLM InternLM2 implementation is based on the LLaMA model. The implementation can
be found in [model.py](../../../../tensorrt_llm/models/llama/model.py).
The TensorRT-LLM InternLM2 example code lies in [`examples/internlm2`](./):
The TensorRT-LLM InternLM2 example code lies in [`examples/models/core/internlm2`](./):
* [`convert_checkpoint.py`](./convert_checkpoint.py) converts the Huggingface Model of InternLM2 into TensorRT-LLM checkpoint.

View File

@ -231,7 +231,7 @@ commands still work.
Note that the `rope_theta` and `vocab_size` are larger in LLaMA v3 models and these values are now inferred
or pickup up from the `params.json` when using the `meta_ckpt_dir`.
LLaMA 3.2 models are also supported now. For text only model like [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B), the steps are same to v3.0. For vision model like [Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision), please refer to the [examples/mllama/README.md](../mllama/README.md)
LLaMA 3.2 models are also supported now. For text only model like [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B), the steps are same to v3.0. For vision model like [Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision), please refer to the [examples/models/core/mllama/README.md](../mllama/README.md)
```bash
# Build LLaMA v3 8B TP=1 using HF checkpoints directly.

View File

@ -1,3 +1,3 @@
# MLLaMA (llama-3.2 Vision model)
MLLaMA is a multimodal model, and reuse the multimodal modules in [examples/multimodal](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal)
MLLaMA is a multimodal model, and reuse the multimodal modules in [examples/models/core/multimodal](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal)

View File

@ -256,7 +256,7 @@ Currently, CogVLM only support bfloat16 precision.
## Deplot
1. Download Huggingface weights and convert original checkpoint to TRT-LLM checkpoint format
following example in `examples/enc_dec/README.md`.
following example in `examples/models/core/enc_dec/README.md`.
```bash
export MODEL_NAME="deplot"
@ -320,7 +320,7 @@ Currently, CogVLM only support bfloat16 precision.
git clone https://huggingface.co/adept/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/gpt`.
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/models/core/gpt`.
The LLM portion of Fuyu uses a Persimmon model
```bash
python ../gpt/convert_checkpoint.py \
@ -489,7 +489,7 @@ Firstly, please install transformers with 4.37.2
git clone https://huggingface.co/microsoft/kosmos-2-patch14-224 tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/gpt`.
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/models/core/gpt`.
```bash
python ../gpt/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
@ -560,7 +560,7 @@ Firstly, please install transformers with 4.37.2
git clone https://huggingface.co/Efficient-Large-Model/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Generate TRT-LLM engine for LLaMA following example in `examples/llama/README.md` and `examples/qwen/README.md`
2. Generate TRT-LLM engine for LLaMA following example in `examples/models/core/llama/README.md` and `examples/models/core/qwen/README.md`
```bash
python ../llama/convert_checkpoint.py \
@ -725,7 +725,7 @@ Firstly, please install transformers with 4.37.2
This section shows how to build and run a LLaMA-3.2 Vision model in TensorRT-LLM. We use [Llama-3.2-11B-Vision/](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) as an example.
For LLaMA-3.2 text model, please refer to the [examples/llama/README.md](../llama/README.md) because it shares the model architecture of llama.
For LLaMA-3.2 text model, please refer to the [examples/models/core/llama/README.md](../llama/README.md) because it shares the model architecture of llama.
### Support data types <!-- omit from toc -->
* BF16
@ -831,7 +831,7 @@ Note that for instruct Vision model, please set the `max_encoder_input_len` as `
[NeVA](https://docs.nvidia.com/nemo-framework/user-guide/latest/multimodalmodels/neva/index.html) is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.
1. Generate TRT-LLM engine for NVGPT following example in `examples/gpt/README.md`. To adhere to the NVGPT conventions of the conversion script, some layer keys have to be remapped using `--nemo_rename_key`.
1. Generate TRT-LLM engine for NVGPT following example in `examples/models/core/gpt/README.md`. To adhere to the NVGPT conventions of the conversion script, some layer keys have to be remapped using `--nemo_rename_key`.
```bash
export MODEL_NAME="neva"
@ -886,11 +886,11 @@ Note that for instruct Vision model, please set the `max_encoder_input_len` as `
git clone https://huggingface.co/facebook/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/enc_dec`
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/models/core/enc_dec`
Nougat uses mBART architecture but replaces the LLM encoder with a Swin Transformer encoder.
To achieve this, we add an extra `--nougat` flag (over mBART example) to
`convert_checkpoint.py` in `examples/enc_dec` and `trtllm-build`.
`convert_checkpoint.py` in `examples/models/core/enc_dec` and `trtllm-build`.
```bash
python ../enc_dec/convert_checkpoint.py --model_type bart \
@ -938,7 +938,7 @@ Note that for instruct Vision model, please set the `max_encoder_input_len` as `
git clone https://huggingface.co/microsoft/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
```
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/phi`.
2. Convert Huggingface weights into TRT-LLM checkpoints and build TRT engines using scripts in `examples/models/core/phi`.
```bash
python ../phi/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
@ -1045,7 +1045,7 @@ pip install -r requirements-qwen2vl.txt
[Video NeVA](https://github.com/NVIDIA/NeMo/blob/main/docs/source/multimodal/mllm/video_neva.rst) is a groundbreaking addition to the NeMo Multimodal ecosystem that could work with video modality. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.
1. Generate TRT-LLM engine for Nemotron model following example in `examples/nemotron/README.md`. To adhere to the NVGPT conventions of the conversion script. This will be used as our base LM for inference.
1. Generate TRT-LLM engine for Nemotron model following example in `examples/models/core/nemotron/README.md`. To adhere to the NVGPT conventions of the conversion script. This will be used as our base LM for inference.
```bash
pip install decord # used for loading video
@ -1094,7 +1094,7 @@ This section explains how to evaluate datasets using our provided script, includ
To run an evaluation, use the following command:
```bash
python ./examples/multimodal/eval.py \
python ./examples/models/core/multimodal/eval.py \
--model_type <model_type> \
--engine_dir <engine_dir> \
--hf_model_dir <hf_model_dir> \

View File

@ -43,7 +43,7 @@ Due the non-uniform architecture of the model, the different pipeline parallelis
## Usage
The TensorRT-LLM example code is located at [examples/nemotron_nas](./).
The TensorRT-LLM example code is located at [examples/models/core/nemotron_nas](./).
The `convert_checkpoint.py` script accepts Hugging Face weights as input, and builds the corresponding TensorRT engines.
The number of TensorRT engines depends on the number of GPUs used to run inference.

View File

@ -30,7 +30,7 @@ LLaVA-NeXT is an extension of LLaVA. TRT-LLM currently supports [Mistral-7b](htt
# copy the image newlines tensor to engine directory
cp tmp/trt_models/${MODEL_NAME}/fp16/1-gpu/vision/image_newlines.safetensors tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/vision
```
3. Generate TRT-LLM engine for LLaMA following example in `examples/llama/README.md`
3. Generate TRT-LLM engine for LLaMA following example in `examples/models/core/llama/README.md`
```bash
python ../llama/convert_checkpoint.py \

View File

@ -319,7 +319,7 @@ def main(args):
if is_enc_dec:
logger.warning(
"This path is an encoder-decoder model. Using different handling.")
assert not args.use_py_session, "Encoder-decoder models don't have a unified python runtime, please use its own examples/enc_dec/run.py instead."
assert not args.use_py_session, "Encoder-decoder models don't have a unified python runtime, please use its own examples/models/core/enc_dec/run.py instead."
model_name, model_version = read_model_name(
args.engine_dir if not is_enc_dec else os.path.

View File

@ -62,7 +62,7 @@ wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/merges.txt
```
2. Convert the Hugging Face checkpoint into TensorRT-LLM format.
Run below command lines in [`examples/gptj`](../gptj) directory.
Run below command lines in [`examples/models/contrib/gpt`](../gptj) directory.
```bash
# Build a float16 checkpoint using HF weights.
python convert_checkpoint.py --model_dir ./gpt-j-6b \
@ -113,7 +113,7 @@ python3 ../summarize.py --engine_dir ./trt_engines/gptj_fp16_tp1.refit \
1. Download the llama-7b-hf checkpoint and saved in /llm-models/llama-models/llama-7b-hf/.
2. Calibrate the checkpoint and convert into TensorRT-LLM format.
Run below command lines in [`examples/llama`](../llama) directory.
Run below command lines in [`examples/models/core/llama`](../llama) directory.
```bash
# Calibrate INT4 using AMMO.
python ../quantization/quantize.py --model_dir /llm-models/llama-models/llama-7b-hf/ \
@ -154,7 +154,7 @@ python3 ../summarize.py --engine_dir trt_int4_AWQ_full_from_wtless \
1. Download the llama-7b-hf checkpoint and saved in /llm-models/llama-models/llama-7b-hf/.
2. Convert the checkpoint into TensorRT-LLM format.
Run below command lines in [`examples/llama`](../llama) directory.
Run below command lines in [`examples/models/core/llama`](../llama) directory.
```bash
python3 convert_checkpoint.py --model_dir /llm-models/llama-models/llama-7b-hf/ \
--output_dir ./llama-7b-hf-fp16-woq \
@ -194,7 +194,7 @@ python3 ../summarize.py --engine_dir ./engines/llama-7b-hf-fp16-woq-1gpu-wtless-
1. Download the llama-v2-70b-hf checkpoint and saved in /llm-models/llama-models-v2/llama-v2-70b-hf/.
2. Calibrate the checkpoint and convert into TensorRT-LLM format.
Run below command lines in [`examples/llama`](../llama) directory.
Run below command lines in [`examples/models/core/llama`](../llama) directory.
```bash
# Calibrate FP8 using AMMO.
python ../quantization/quantize.py --model_dir /llm-models/llama-models-v2/llama-v2-70b-hf/ \
@ -250,7 +250,7 @@ Building an engine from a pruned checkpoint will also allow the engine to be [re
#### Pruning a TensorRT-LLM Checkpoint
1. Install [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md) either through [pip](https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md#installation) or [from the source](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation/build-from-source-linux.md).
2. Download a model of your choice and convert it to a TensorRT-LLM checkpoint ([llama instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md#usage)).
2. Download a model of your choice and convert it to a TensorRT-LLM checkpoint ([llama instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/llama/README.md#usage)).
3. (Optional) Run the `trtllm-prune` command.
```bash
# Prunes the TRT-LLM checkpoint at ${CHECKPOINT_DIR}, and stores it in the directory ${CHECKPOINT_DIR}.pruned

View File

@ -515,7 +515,7 @@ class BertForSequenceClassification(BertBase):
remove_input_padding = default_net().plugin_config.remove_input_padding
# required as explicit input in remove_input_padding mode
# see examples/bert/run_remove_input_padding.py for how to create them from input_ids and input_lengths
# see examples/models/core/bert/run_remove_input_padding.py for how to create them from input_ids and input_lengths
if remove_input_padding:
assert token_type_ids is not None and \
position_ids is not None and \

View File

@ -138,7 +138,7 @@ def install_additional_requirements(python_exe, root_dir):
"pip",
"install",
"-r",
"examples/recurrentgemma/requirements.txt",
"examples/models/core/recurrentgemma/requirements.txt",
],
cwd=root_dir,
env=_os.environ,

View File

@ -5,7 +5,7 @@ hf_model_dir=$1
engine_dir=$2
# fake a 1-layer LLaMA model for CI
python3 ../../examples/llama/build.py \
python3 ../../examples/models/core/llama/build.py \
--use_gemm_plugin \
--enable_context_fmha \
--use_gpt_attention_plugin \