* Update TensorRT-LLM --------- Co-authored-by: Denis Kayshev <topenkoff@gmail.com> Co-authored-by: akhoroshev <arthoroshev@gmail.com> Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com> Update |
||
|---|---|---|
| .. | ||
| chat.py | ||
| fastapi_server.py | ||
| openai_client.py | ||
| README.md | ||
| requirements.txt | ||
Apps examples with GenerationExecutor / LLM API
OpenAI API
The trtllm-serve command launches an OpenAI compatible server which supports v1/version, v1/completions and v1/chat/completions. openai_client.py is a simple example using OpenAI client to query your model. To start the server, you can run
trtllm-serve <model>
Then you can query the APIs by running our example client or by curl.
v1/completions
Query by curl:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": <model_name>,
"prompt": "Where is New York?",
"max_tokens": 16,
"temperature": 0
}'
Query by our example:
python3 ./openai_client.py --prompt "Where is New York?" --api completions
v1/chat/completions
Query by curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": <model_name>,
"messages":[{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Where is New York?"}],
"max_tokens": 16,
"temperature": 0
}'
Query by our example:
python3 ./openai_client.py --prompt "Where is New York?" --api chat
Python chat
chat.py provides a small examples to play around with your model. Before running, install additional requirements with pip install -r ./requirements.txt. Then you can run it with
python3 ./chat.py --model <model_dir> --tokenizer <tokenizer_path> --tp_size <tp_size>
Please run python3 ./chat.py --help for more information on the arguments.
Note that, the model_dir could accept the following formats:
- A path to a built TRT-LLM engine
- A path to a local HuggingFace model
- The name of a HuggingFace model such as "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
FastAPI server
NOTE: This FastAPI-based server is only an example for demonstrating the usage
of TensorRT-LLM LLM API. It is not intended for production use.
For production, use the trtllm-serve command. The server exposes OpenAI compatible API endpoints.
Install the additional requirements
pip install -r ./requirements.txt
Start the server
Start the server with:
python3 ./fastapi_server.py <model_dir>&
Note that, the model_dir could accept same formats as in the chat example. If you are using an engine build with trtllm-build, remember to pass the tokenizer path:
python3 ./fastapi_server.py <model_dir> --tokenizer <tokenizer_dir>&
To get more information on all the arguments, please run python3 ./fastapi_server.py --help.
Send requests
You can pass request arguments like "max_tokens", "top_p", "top_k" in your JSON dict:
curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_tokens": 8}'
You can also use the streaming interface with:
curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_tokens": 8, "streaming": true}'