mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-04 02:02:01 +08:00

History

Kaiyu Xie 5955b8afba Update TensorRT-LLM Release branch (#1192 ) * Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>		2024-02-29 17:20:55 +08:00
..
async.py	Update TensorRT-LLM Release branch (#1192 )	2024-02-29 17:20:55 +08:00
README.md	Update TensorRT-LLM Release branch (#1192 )	2024-02-29 17:20:55 +08:00
requirements.txt	Update TensorRT-LLM Release branch (#1192 )	2024-02-29 17:20:55 +08:00
server.py	Update TensorRT-LLM Release branch (#1192 )	2024-02-29 17:20:55 +08:00
test_executor.py	Update TensorRT-LLM Release branch (#1192 )	2024-02-29 17:20:55 +08:00

README.md

Asynchronous generation in python

Install the requirements

pip install -r examples/server/requirements.txt

Directly from python, with the HL API

Due to limitation from the HLAPI implementation, currently only LLaMA models are supported: python3 examples/server/async.py <path_to_hf_llama_dir>

Using the server interface for TensorRT-LLM

Start the server

python3 -m examples.server.server <path_to_tllm_engine_dir> <tokenizer_type> &

Send requests

You can pass request arguments like "max_new_tokens", "top_p", "top_k" in your JSON dict: curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8}'

You can also use the streaming interface with: curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8, "streaming": true}' --output -