TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Kaiyu Xie 3aa6b11d13 Update TensorRT-LLM (#2936 ) * Update TensorRT-LLM --------- Co-authored-by: changcui <cuichang147@gmail.com>		2025-03-18 21:25:19 +08:00
..
clients	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
disagg_config.yaml	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
launch_disaggregated_server.py	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
launch_disaggregated_workers.py	Update TensorRT-LLM (#2936 )	2025-03-18 21:25:19 +08:00
README.md	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00

README.md

To launch context and gen servers, use:

export TRTLLM_USE_MPI_KVCACHE=1
mpirun --allow-run-as-root -n 2 python3 launch_disaggregated_workers.py -c disagg_config.yaml &> output_workers &

Then, launch the disaggregated server which will do the orchestration between context and generation servers

python3 launch_disaggregated_server.py -c disagg_config.yaml  &> output_disagg &

Once ctx, gen and disagg servers are launched, one can send requests to disagg server using curl:

curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "prompt": "NVIDIA is a great company because",
        "max_tokens": 16,
        "temperature": 0
    }' -w "\n"

Or using the provided client:

cd client
python3 disagg_client.py -c ../disagg_config.yaml -p prompts.json