TensorRT-LLMs/examples/apps
Yan Chunwei 0c26059703
chore: Cleanup deprecated APIs from LLM-API (part 1/2) (#3732)
* beam_width and max_new_token

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove beam_width

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove min_length

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove return_num_sequences

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-07 13:20:25 +08:00
..
chat.py chore: Cleanup deprecated APIs from LLM-API (part 1/2) (#3732) 2025-05-07 13:20:25 +08:00
fastapi_server.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
README.md doc: refactor trtllm-serve examples and doc (#3187) 2025-04-04 11:40:43 +08:00
requirements.txt Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00

Apps examples with GenerationExecutor / LLM API

Python chat

chat.py provides a small examples to play around with your model. Before running, install additional requirements with pip install -r ./requirements.txt. Then you can run it with

python3 ./chat.py --model <model_dir> --tokenizer <tokenizer_path> --tp_size <tp_size>

Please run python3 ./chat.py --help for more information on the arguments.

Note that, the model_dir could accept the following formats:

  1. A path to a built TRT-LLM engine
  2. A path to a local HuggingFace model
  3. The name of a HuggingFace model such as "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

FastAPI server

NOTE: This FastAPI-based server is only an example for demonstrating the usage of TensorRT-LLM LLM API. It is not intended for production use. For production, use the trtllm-serve command. The server exposes OpenAI compatible API endpoints.

Install the additional requirements

pip install -r ./requirements.txt

Start the server

Start the server with:

python3 ./fastapi_server.py <model_dir>&

Note that, the model_dir could accept same formats as in the chat example. If you are using an engine build with trtllm-build, remember to pass the tokenizer path:

python3 ./fastapi_server.py <model_dir> --tokenizer <tokenizer_dir>&

To get more information on all the arguments, please run python3 ./fastapi_server.py --help.

Send requests

You can pass request arguments like "max_tokens", "top_p", "top_k" in your JSON dict:

curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_tokens": 8}'

You can also use the streaming interface with:

curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_tokens": 8, "streaming": true}'