TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Kaiyu Xie ab5b19e027 Update TensorRT-LLM (#2820 )		2025-02-25 21:21:49 +08:00
..
clients	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
test	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
disagg_config.yaml	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
launch_disaggregated_server.py	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
launch_disaggregated_workers.py	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
README.md	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00

README.md

To launch context and gen servers, use:

mpirun --allow-run-as-root -n 2 python3 launch_disaggregated_workers.py -c disagg_config.yaml &> output_workers &

Then, launch the disaggregated server which will do the orchestration between context and generation servers

python3 launch_disaggregated_server.py -c disagg_config.yaml  &> output_disagg &

Once ctx, gen and disagg servers are launched, one can send requests to disagg server using curl:

curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "prompt": "NVIDIA is a great company because",
        "max_tokens": 16,
        "temperature": 0
    }' -w "\n"

Or using the provided client:

cd client
python3 disagg_client.py -c ../disagg_config.yaml -p prompts.json