Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| disaggregated_serving_bls | ||
| disaggregated_serving.md | ||
| README.md | ||
Running Disaggregated Serving with Triton TensorRT LLM Backend
Overview
Disaggregated serving refers to a technique that uses separate GPUs for running the context and generation phases of LLM inference.
For Triton integration, a BLS model named disaggregated_serving_bls has been created that orchestrates the disaggregated serving pipeline. This BLS model requires the TRT-LLM model names that are going to be used for context and generation phases.
This example assumes access to a two GPU device systems with CUDA_VISIBLE_DEVICES
set to 0,1.
Model Repository Setup and Start Server
-
Setup the model repository as instructed in the LLaMa guide.
-
Create context and generation models with the desired tensor-parallel configuration. We will be using
contextandgenerationmodel names for context and generation models respectively. The context and generation models should be copying the config tensorrt_llm model. -
Set the
participant_idsfor context and generation models to1and2respectively. -
Set the
gpu_device_idsfor context and generation models to0and1respectively. -
Set the
context_model_nameandgeneration_model_nametocontextandgenerationin the disaggregated_serving_bls model configuration.
Your model repository should look like below:
disaggreagted_serving/
|-- context
| |-- 1
| `-- config.pbtxt
|-- disaggregated_serving_bls
| |-- 1
| | `-- model.py
| `-- config.pbtxt
|-- ensemble
| |-- 1
| `-- config.pbtxt
|-- generation
| |-- 1
| `-- config.pbtxt
|-- postprocessing
| |-- 1
| | `-- model.py
| `-- config.pbtxt
`-- preprocessing
|-- 1
| `-- model.py
`-- config.pbtxt
-
Rename the
tensorrt_llmmodel in theensembleconfig.pbtxt file todisaggregated_serving_bls. -
Launch the Triton Server:
python3 scripts/launch_triton_server.py --world_size 3 --tensorrt_llm_model_name context,generation --multi-model --disable-spawn-processes
![NOTE]
The world size should be equal to
tp*ppof context model +tp*ppof generation model + 1. The additional process is required for the orchestrator.
- Send a request to the server.
python3 inflight_batcher_llm/client/end_to_end_grpc_client.py -S -p "Machine learning is"
Creating Multiple Copies of the Context and Generation Models (Data Parallelism)
You can also create multiple copies of the context and generation models. This can be
achieved by setting the participant_ids and gpu_device_ids for each instance.
For example, if you have a context model with tp=2 and you want to create 2
copies of it, you can set the participant_ids to 1,2;3,4,
gpu_device_ids to 0,1;2,3 (assuming a 4-GPU system), and set the count
in instance_groups section of the model configuration to 2. This will create 2
copies of the context model where the first copy will be on GPU 0 and 1, and the
second copy will be on GPU 2 and 3.
Known Issues
- Only C++ version of the backend is supported right now.