TensorRT-LLMs/examples/server/README.md
Kaiyu Xie 5955b8afba
Update TensorRT-LLM Release branch (#1192)
* Update TensorRT-LLM

---------

Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2024-02-29 17:20:55 +08:00

26 lines
841 B
Markdown

# Asynchronous generation in python
## Install the requirements
` pip install -r examples/server/requirements.txt`
## Directly from python, with the HL API
Due to limitation from the HLAPI implementation, currently only LLaMA models are supported:
`python3 examples/server/async.py <path_to_hf_llama_dir>`
## Using the server interface for TensorRT-LLM
### Start the server
`python3 -m examples.server.server <path_to_tllm_engine_dir> <tokenizer_type> &`
### Send requests
You can pass request arguments like "max_new_tokens", "top_p", "top_k" in your JSON dict:
`curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8}'`
You can also use the streaming interface with:
`curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8, "streaming": true}' --output -`