mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-04 02:02:01 +08:00
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| async.py | ||
| README.md | ||
| requirements.txt | ||
| server.py | ||
| test_executor.py | ||
Asynchronous generation in python
Install the requirements
pip install -r examples/server/requirements.txt
Directly from python, with the HL API
Due to limitation from the HLAPI implementation, currently only LLaMA models are supported:
python3 examples/server/async.py <path_to_hf_llama_dir>
Using the server interface for TensorRT-LLM
Start the server
python3 -m examples.server.server <path_to_tllm_engine_dir> <tokenizer_type> &
Send requests
You can pass request arguments like "max_new_tokens", "top_p", "top_k" in your JSON dict:
curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8}'
You can also use the streaming interface with:
curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8, "streaming": true}' --output -