TensorRT-LLMs/examples/server
Kaiyu Xie 5955b8afba
Update TensorRT-LLM Release branch (#1192)
* Update TensorRT-LLM

---------

Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2024-02-29 17:20:55 +08:00
..
async.py Update TensorRT-LLM Release branch (#1192) 2024-02-29 17:20:55 +08:00
README.md Update TensorRT-LLM Release branch (#1192) 2024-02-29 17:20:55 +08:00
requirements.txt Update TensorRT-LLM Release branch (#1192) 2024-02-29 17:20:55 +08:00
server.py Update TensorRT-LLM Release branch (#1192) 2024-02-29 17:20:55 +08:00
test_executor.py Update TensorRT-LLM Release branch (#1192) 2024-02-29 17:20:55 +08:00

Asynchronous generation in python

Install the requirements

pip install -r examples/server/requirements.txt

Directly from python, with the HL API

Due to limitation from the HLAPI implementation, currently only LLaMA models are supported: python3 examples/server/async.py <path_to_hf_llama_dir>

Using the server interface for TensorRT-LLM

Start the server

python3 -m examples.server.server <path_to_tllm_engine_dir> <tokenizer_type> &

Send requests

You can pass request arguments like "max_new_tokens", "top_p", "top_k" in your JSON dict: curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8}'

You can also use the streaming interface with: curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8, "streaming": true}' --output -