mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
* blossom-ci.yml: run vulnerability scan on blossom * open source efb18c1256f8c9c3d47b7d0c740b83e5d5ebe0ec --------- Co-authored-by: niukuo <6831097+niukuo@users.noreply.github.com> Co-authored-by: pei0033 <59505847+pei0033@users.noreply.github.com> Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
96 lines
3.0 KiB
Markdown
96 lines
3.0 KiB
Markdown
# Apps examples with GenerationExecutor / LLM API
|
|
## OpenAI API
|
|
The `trtllm-serve` command launches an OpenAI compatible server which supports `v1/version`, `v1/completions` and `v1/chat/completions`. [openai_client.py](./openai_client.py) is a simple example using OpenAI client to query your model. To start the server, you can run
|
|
```
|
|
trtllm-serve <model>
|
|
```
|
|
Then you can query the APIs by running our example client or by `curl`.
|
|
### v1/completions
|
|
Query by `curl`:
|
|
```
|
|
curl http://localhost:8000/v1/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": <model_name>,
|
|
"prompt": "Where is New York?",
|
|
"max_tokens": 16,
|
|
"temperature": 0
|
|
}'
|
|
```
|
|
Query by our example:
|
|
```
|
|
python3 ./openai_client.py --prompt "Where is New York?" --api completions
|
|
```
|
|
### v1/chat/completions
|
|
Query by `curl`:
|
|
```
|
|
curl http://localhost:8000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": <model_name>,
|
|
"messages":[{"role": "system", "content": "You are a helpful assistant."},
|
|
{"role": "user", "content": "Where is New York?"}],
|
|
"max_tokens": 16,
|
|
"temperature": 0
|
|
}'
|
|
```
|
|
Query by our example:
|
|
```
|
|
python3 ./openai_client.py --prompt "Where is New York?" --api chat
|
|
```
|
|
## Python chat
|
|
|
|
[chat.py](./chat.py) provides a small examples to play around with your model. Before running, install additional requirements with ` pip install -r ./requirements.txt`. Then you can run it with
|
|
|
|
```
|
|
python3 ./chat.py --model <model_dir> --tokenizer <tokenizer_path> --tp_size <tp_size>
|
|
```
|
|
|
|
Please run `python3 ./chat.py --help` for more information on the arguments.
|
|
|
|
Note that, the `model_dir` could accept the following formats:
|
|
|
|
1. A path to a built TRT-LLM engine
|
|
2. A path to a local HuggingFace model
|
|
3. The name of a HuggingFace model such as "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
|
|
|
|
## FastAPI server
|
|
|
|
NOTE: This FastAPI-based server is only an example for demonstrating the usage
|
|
of TensorRT-LLM LLM API. It is not intended for production use.
|
|
For production, use the `trtllm-serve` command. The server exposes OpenAI compatible API endpoints.
|
|
|
|
### Install the additional requirements
|
|
|
|
```
|
|
pip install -r ./requirements.txt
|
|
```
|
|
|
|
### Start the server
|
|
|
|
Start the server with:
|
|
|
|
```
|
|
python3 ./fastapi_server.py <model_dir>&
|
|
```
|
|
|
|
Note that, the `model_dir` could accept same formats as in the chat example. If you are using an engine build with `trtllm-build`, remember to pass the tokenizer path:
|
|
|
|
```
|
|
python3 ./fastapi_server.py <model_dir> --tokenizer <tokenizer_dir>&
|
|
```
|
|
|
|
To get more information on all the arguments, please run `python3 ./fastapi_server.py --help`.
|
|
|
|
### Send requests
|
|
|
|
You can pass request arguments like "max_tokens", "top_p", "top_k" in your JSON dict:
|
|
```
|
|
curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_tokens": 8}'
|
|
```
|
|
|
|
You can also use the streaming interface with:
|
|
```
|
|
curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_tokens": 8, "streaming": true}'
|
|
```
|