Signed-off-by: Josh Bialkowski <1309820+cheshirekow@users.noreply.github.com> Co-authored-by: Josh Bialkowski <1309820+cheshirekow@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| CMakeLists.txt | ||
| executorExampleAdvanced.cpp | ||
| executorExampleAdvancedMultiInstances.cpp | ||
| executorExampleBasic.cpp | ||
| executorExampleDebug.cpp | ||
| executorExampleDisaggregated.cpp | ||
| executorExampleFastLogits.cpp | ||
| executorExampleKvEvents.cpp | ||
| executorExampleLogitsProcessor.cpp | ||
| inputTokens.csv | ||
| README.md | ||
Executor API examples
This directory contains several examples that demonstrate how to use the Executor API:
- The example defined in
executorExampleBasic.cppshows how you can generate output tokens for a single prompt in only a few lines of code. - The example defined in
executorExampleAdvanced.cppsupports more options such as providing an arbitrary number of input requests with arbitrary tokens per request and running in streaming mode. - The example defined in
executorExampleLogitsProcessor.cppshows how to useLogitsPostProcessorto control output tokens. - The example defined in
executorExampleFastLogits.cppshows how to useExternalDraftTokensConfigfor speculative decoding and optionally use the fast logits feature. - The example defined in
executorExampleKvEvents.cppshows how to use the KV cache event API. - The example defined in
executorExampleDisaggregated.cppshows how to use the disaggregated executor API.
Building the examples
To build the examples, you first need to build the TensorRT LLM C++ shared libraries (libtensorrt_llm.so and libnvinfer_plugin_tensorrt_llm.so) using the build_wheel.py script. Alternatively, if you have already build the TensorRT LLM libraries, you can modify the provided CMakeLists.txt such that the libtensorrt_llm.so and libnvinfer_plugin_tensorrt_llm.so are imported properly.
Once the TensorRT LLM libraries are built, you can run
mkdir build
cd build
cmake ..
make -j
from the ./examples/cpp/executor/ folder to build the basic and advanced examples.
Preparing the TensorRT LLM engine(s)
Before you run the examples, please make sure that you have already built engine(s) using the TensorRT LLM API.
Use trtllm-build to build the TRT-LLM engine.
Running the examples
executorExampleBasic
From the examples/cpp/executor/build folder, you can get run the executorExampleBasic example with:
./executorExampleBasic <path_to_engine_dir>
where <path_to_engine_dir> is the path to the directly containing the TensorRT engine files.
executorExampleDebug
This example shows how you can define which engine IO tensors should be dumped to numpy files.
From the examples/cpp/executor/build folder, you can get run the executorExampleDebug example with:
./executorExampleDebug <path_to_engine_dir>
where <path_to_engine_dir> is the path to the directly containing the TensorRT engine files.
executorExampleAdvanced
From the examples/cpp/executor/build folder, you can also run the executorExampleAdvanced example. To get the full list of supported input arguments, type
./executorExampleAdvanced -h
For example, you can run:
./executorExampleAdvanced --engine_dir <path_to_engine_dir> --input_tokens_csv_file ../inputTokens.csv
to run with the provided dummy input tokens from inputTokens.csv. Upon successful completion, you should see the following in the logs:
[TensorRT-LLM][INFO] Creating request with 6 input tokens
[TensorRT-LLM][INFO] Creating request with 4 input tokens
[TensorRT-LLM][INFO] Creating request with 10 input tokens
[TensorRT-LLM][INFO] Got 20 tokens for beam 0 for requestId 3
[TensorRT-LLM][INFO] Request id 3 is completed.
[TensorRT-LLM][INFO] Got 14 tokens for beam 0 for requestId 2
[TensorRT-LLM][INFO] Request id 2 is completed.
[TensorRT-LLM][INFO] Got 16 tokens for beam 0 for requestId 1
[TensorRT-LLM][INFO] Request id 1 is completed.
[TensorRT-LLM][INFO] Writing output tokens to outputTokens.csv
[TensorRT-LLM][INFO] Exiting.
Multi-GPU run
To run the executorExampleAdvanced on models that require multiple GPUs, you can run the example using MPI as follows:
mpirun -n <num_ranks> --allow-run-as-root ./executorExampleAdvanced --engine_dir <path_to_engine_dir> --input_tokens_csv_file ../inputTokens.csv
where <num_ranks> must equal to tp*pp for the TensorRT engine. By default GPU device IDs [0...(num_ranks-1)] will be used.
Alternatively, it's also possible to run multi-GPU model by using the so-called Orchestrator communication mode, where the Executor instance will automatically spawn additional processes to run the model on multiple GPUs. To use the Orchestrator communication mode, you can run the example with:
./executorExampleAdvanced --engine_dir <path_to_engine_dir> --input_tokens_csv_file ../inputTokens.csv --use_orchestrator_mode --worker_executable_path <path_to_executor_worker>
where <path_to_executor_worker> is the absolute path to the stand-alone executor worker executable, located atcpp/build/tensorrt_llm/executor_worker/executorWorker by default.
executorExampleFastLogits
To run the executorExampleFastLogits, you need two GPUs (one for the draft model and one for the target model). You can run it as follows:
mpirun -n 3 --allow-run-as-root ./executorExampleFastLogits --engine_dir <path_to_target_engine> --draft_engine_dir <path_to_draft_engine> --num_draft_tokens=3
The examples uses 3 MPI ranks (one for the orchestrator, one for the draft model and one for the target model).
Use --fast_logits=false to disable the fast logits feature.
executorExampleKvEvents
From the examples/cpp/executor/build folder, you can get run the executorExampleKvEvents example with:
./executorExampleKvEvents --engine_dir <path_to_engine_dir>
where <path_to_engine_dir> is the path to the directly containing the TensorRT engine files.
This example shows how the KV Cache Event API can be used to reconstruct the state of TRT-LLM's internal radix tree. This can be used in applications such as smart routing to route requests between multiple executor instances to maximize KV Cache reuse. Events are emitted when blocks are stored, removed, or updated in the radix tree.
executorExampleDisaggregated
From the examples/cpp/executor/build folder, you can also run the executorExampleDisaggregated example. To get the full list of supported input arguments, type
./executorExampleDisaggregated -h
Note setting TRTLLM_USE_UCX_KVCACHE=1 is required to run disaggregated executor.
For example, you can run :
export TRTLLM_USE_UCX_KVCACHE=1
mpirun -n <num_ranks> --allow-run-as-root --oversubscribe ./executorExampleDisaggregated --context_engine_dir <path_to_context_engine_dir> --context_rank_size <num_ranks_for_context> --generation_engine_dir <path_to_generation_engine_dir> --generation_rank_size <num_ranks_for_generation> --input_tokens ../inputTokens.csv
where <num_ranks_for_context> must equal to tp*pp for the context engine, and <num_ranks_for_generation> must equal to tp*pp for the generation engine,the context engine and generation engine can be heterogeneous in parallelism. <num_ranks> must equal to <num_ranks_for_context>+<num_ranks_for_generation>+1, the additional rank is used as orchestrator process.