TensorRT-LLMs/triton_backend/all_models/disaggregated_serving
Guoming Zhang 9f0f52249e [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
..
disaggregated_serving_bls [nvbugs/5309940] Add support for input output token counts (#5445) 2025-06-28 04:39:39 +08:00
disaggregated_serving.md [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850) 2025-09-25 21:02:35 +08:00
README.md [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850) 2025-09-25 21:02:35 +08:00

Running Disaggregated Serving with Triton TensorRT LLM Backend

Overview

Disaggregated serving refers to a technique that uses separate GPUs for running the context and generation phases of LLM inference.

For Triton integration, a BLS model named disaggregated_serving_bls has been created that orchestrates the disaggregated serving pipeline. This BLS model requires the TRT-LLM model names that are going to be used for context and generation phases.

This example assumes access to a two GPU device systems with CUDA_VISIBLE_DEVICES set to 0,1.

Model Repository Setup and Start Server

  1. Setup the model repository as instructed in the LLaMa guide.

  2. Create context and generation models with the desired tensor-parallel configuration. We will be using context and generation model names for context and generation models respectively. The context and generation models should be copying the config tensorrt_llm model.

  3. Set the participant_ids for context and generation models to 1 and 2 respectively.

  4. Set the gpu_device_ids for context and generation models to 0 and 1 respectively.

  5. Set the context_model_name and generation_model_name to context and generation in the disaggregated_serving_bls model configuration.

Your model repository should look like below:

disaggreagted_serving/
|-- context
|   |-- 1
|   `-- config.pbtxt
|-- disaggregated_serving_bls
|   |-- 1
|   |   `-- model.py
|   `-- config.pbtxt
|-- ensemble
|   |-- 1
|   `-- config.pbtxt
|-- generation
|   |-- 1
|   `-- config.pbtxt
|-- postprocessing
|   |-- 1
|   |   `-- model.py
|   `-- config.pbtxt
`-- preprocessing
    |-- 1
    |   `-- model.py
    `-- config.pbtxt
  1. Rename the tensorrt_llm model in the ensemble config.pbtxt file to disaggregated_serving_bls.

  2. Launch the Triton Server:

python3 scripts/launch_triton_server.py --world_size 3 --tensorrt_llm_model_name context,generation --multi-model --disable-spawn-processes

![NOTE]

The world size should be equal to tp*pp of context model + tp*pp of generation model + 1. The additional process is required for the orchestrator.

  1. Send a request to the server.
python3 inflight_batcher_llm/client/end_to_end_grpc_client.py -S -p "Machine learning is"

Creating Multiple Copies of the Context and Generation Models (Data Parallelism)

You can also create multiple copies of the context and generation models. This can be achieved by setting the participant_ids and gpu_device_ids for each instance.

For example, if you have a context model with tp=2 and you want to create 2 copies of it, you can set the participant_ids to 1,2;3,4, gpu_device_ids to 0,1;2,3 (assuming a 4-GPU system), and set the count in instance_groups section of the model configuration to 2. This will create 2 copies of the context model where the first copy will be on GPU 0 and 1, and the second copy will be on GPU 2 and 3.

Known Issues

  1. Only C++ version of the backend is supported right now.