mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-04 10:11:47 +08:00
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
26 lines
841 B
Markdown
26 lines
841 B
Markdown
# Asynchronous generation in python
|
|
|
|
## Install the requirements
|
|
|
|
` pip install -r examples/server/requirements.txt`
|
|
|
|
## Directly from python, with the HL API
|
|
|
|
Due to limitation from the HLAPI implementation, currently only LLaMA models are supported:
|
|
`python3 examples/server/async.py <path_to_hf_llama_dir>`
|
|
|
|
|
|
## Using the server interface for TensorRT-LLM
|
|
|
|
### Start the server
|
|
|
|
`python3 -m examples.server.server <path_to_tllm_engine_dir> <tokenizer_type> &`
|
|
|
|
### Send requests
|
|
|
|
You can pass request arguments like "max_new_tokens", "top_p", "top_k" in your JSON dict:
|
|
`curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8}'`
|
|
|
|
You can also use the streaming interface with:
|
|
`curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8, "streaming": true}' --output -`
|