mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Dan Blanaru 16d2467ea8 Update TensorRT-LLM (#2755 ) * Update TensorRT-LLM --------- Co-authored-by: Denis Kayshev <topenkoff@gmail.com> Co-authored-by: akhoroshev <arthoroshev@gmail.com> Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com> Update		2025-02-11 03:01:00 +00:00
..
chat.py	Update TensorRT-LLM (#2333 )	2024-10-15 15:28:40 +08:00
fastapi_server.py	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
openai_client.py	open source 7f370deb0090d885d7518c2b146399ba3933c004 (#2273 )	2024-09-30 13:51:19 +02:00
README.md	Update TensorRT-LLM (#2532 )	2024-12-04 21:16:56 +08:00
requirements.txt	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00

README.md

Apps examples with GenerationExecutor / LLM API

OpenAI API

The trtllm-serve command launches an OpenAI compatible server which supports v1/version, v1/completions and v1/chat/completions. openai_client.py is a simple example using OpenAI client to query your model. To start the server, you can run

trtllm-serve <model>

Then you can query the APIs by running our example client or by curl.

v1/completions

Query by curl:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": <model_name>,
        "prompt": "Where is New York?",
        "max_tokens": 16,
        "temperature": 0
    }'

Query by our example:

python3 ./openai_client.py --prompt "Where is New York?" --api completions

v1/chat/completions

Query by curl:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": <model_name>,
        "messages":[{"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": "Where is New York?"}],
        "max_tokens": 16,
        "temperature": 0
    }'

Query by our example:

python3 ./openai_client.py --prompt "Where is New York?" --api chat

Python chat

chat.py provides a small examples to play around with your model. Before running, install additional requirements with pip install -r ./requirements.txt. Then you can run it with

python3 ./chat.py --model <model_dir> --tokenizer <tokenizer_path> --tp_size <tp_size>

Please run python3 ./chat.py --help for more information on the arguments.

Note that, the model_dir could accept the following formats:

A path to a built TRT-LLM engine
A path to a local HuggingFace model
The name of a HuggingFace model such as "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

FastAPI server

NOTE: This FastAPI-based server is only an example for demonstrating the usage of TensorRT-LLM LLM API. It is not intended for production use. For production, use the trtllm-serve command. The server exposes OpenAI compatible API endpoints.

Install the additional requirements

pip install -r ./requirements.txt

Start the server

Start the server with:

python3 ./fastapi_server.py <model_dir>&

Note that, the model_dir could accept same formats as in the chat example. If you are using an engine build with trtllm-build, remember to pass the tokenizer path:

python3 ./fastapi_server.py <model_dir> --tokenizer <tokenizer_dir>&

To get more information on all the arguments, please run python3 ./fastapi_server.py --help.

Send requests

You can pass request arguments like "max_tokens", "top_p", "top_k" in your JSON dict:

curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_tokens": 8}'

You can also use the streaming interface with:

curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_tokens": 8, "streaming": true}'