mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Sharan Chetlur 258c7540c0 open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725 ) Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> open source f8c0381a2bc50ee2739c3d8c2be481b31e5f00bd (#2736) Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Add note for blackwell (#2742) Update the docs to workaround the extra-index-url issue (#2744) update README.md (#2751) Fix github io pages (#2761) Update		2025-02-11 02:21:51 +00:00
..
example_advanced.py	open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725 )	2025-02-11 02:21:51 +00:00
example_basic.py	TensorRT-LLM v0.13 Update (#2269 )	2024-09-30 16:20:23 +08:00
example_debug.py	Update TensorRT-LLM v0.14.0 (#2401 )	2024-11-01 19:48:44 +08:00
example_logits_processor.py	open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725 )	2025-02-11 02:21:51 +00:00
README.md	Update TensorRT-LLM v0.14.0 (#2401 )	2024-11-01 19:48:44 +08:00

README.md

Python Bindings Example

This example shows how to use the python bindings interface to generate tokens using a TensorRT engine.

Setup

Build a TensorRT engine for one of the supported TensorRT-LLM model following instructions in the corresponding examples folder.

Usage

Basic example

Run example_basic.py, passing in the directory where the TensorRT engine was generated. For example:

cd examples/bindings
python3 example_basic.py --model_path=../llama/tmp/7B/trt_engines/fp16/1-gpu/

Debug example

This example shows how you can define which engine IO tensors should be kept or dumped to numpy files. Run example_debug.py, passing in the directory where the TensorRT engine was generated. For example:

cd examples/bindings
python3 example_debug.py --model_path=../llama/tmp/7B/trt_engines/fp16/1-gpu/

Advanced example

This example shows how you can use the python bindings to generate tokens for a larger number of requests concurrently and demonstrate how tokens can be returned in a streaming fashion.

The full list of supported input parameters can be obtained with:

pytho3 example_advanced.py -h

For example, assuming a CSV file named input_tokens.csv exist which contains the following input tokens:

1, 2, 3, 4, 5, 6
1, 2, 3, 4
1, 2, 3, 4, 5, 6, 7, 8, 9, 10

one can generate output tokens for those 3 prompts with:

python3 example_advanced.py --model_path <model_path> --input_tokens_csv_file input_tokens.csv

Upon successful completion, the output tokens will be written to file output_tokens.csv.

Multi-GPU Example

To run the two examples for models requiring more than one gpu, you can run the example with MPI.

For example, the basic example can be run as follows:

mpirun -n 4 --allow-run-as-root python3 example_basic.py --model_path=../llama/tmp/7B/trt_engines/fp16/4gpu_tp4_pp1/

The advanced example can also be run using the ORCHESTRATOR mode, where the additional processes needed for multi-GPU runs will automatically be spawned. This can be done by running:

python3 example_advanced.py --model_path=../llama/tmp/7B/trt_engines/fp16/4gpu_tp4_pp1/ --use_orchestrator_mode

Logits post processor example

This example shows how to generate JSON structured output using LogitsPostProcessor API.

python3 example_logits_processor.py -t <tokenizer_path> -e <engine_path> --batch_size 8

LogitsPostProcessorBatched, which fuses logits processing for all samples in a batch into a single callback, is enabled by --lpp_batched