mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

JunyiXu-nv af899d2fe7 [TRTLLM-9860][doc] Add docs and examples for Responses API (#9946 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>		2025-12-14 21:46:13 -08:00
..
chat_completions	[TRTLLM-9242][doc] Add examples showcasing openai compatible APIs (#9520 )	2025-12-03 11:47:02 +08:00
responses	[TRTLLM-9860][doc] Add docs and examples for Responses API (#9946 )	2025-12-14 21:46:13 -08:00
README.md	[TRTLLM-9860][doc] Add docs and examples for Responses API (#9946 )	2025-12-14 21:46:13 -08:00

README.md

OpenAI API Compatibility Examples

This directory contains individual, self-contained examples demonstrating TensorRT-LLM's OpenAI API compatibility. Examples are organized by API endpoint.

Prerequisites

Start the trtllm-serve server:
```
trtllm-serve meta-llama/Llama-3.1-8B-Instruct
```
for reasoning model or model with tool calling ability. Specify --tool_parser and --reasoning_parser, e.g.
```
trtllm-serve Qwen/Qwen3-8B --reasoning_parser "qwen3" --tool_parser "qwen3"
```

Running Examples

Each example is a standalone Python script. Run from the example's directory:

# From chat_completions directory
cd chat_completions
python example_01_basic_chat.py

Or run with full path from the repository root:

python examples/serve/compatibility/chat_completions/example_01_basic_chat.py

📋 Complete Example List

Chat Completions (`/v1/chat/completions`)

Example	File	Description
01	`chat_completions/example_01_basic_chat.py`	Basic non-streaming chat completion
02	`chat_completions/example_02_streaming_chat.py`	Streaming responses with real-time delivery
03	`chat_completions/example_03_multi_turn_conversation.py`	Multi-turn conversation with context
04	`chat_completions/example_04_streaming_with_usage.py`	Streaming with continuous token usage stats
05	`chat_completions/example_05_json_mode.py`	Structured output with JSON schema
06	`chat_completions/example_06_tool_calling.py`	Function/tool calling with tools
07	`chat_completions/example_07_advanced_sampling.py`	TensorRT-LLM extended sampling parameters

Responses (`/v1/responses`)

Example	File	Description
01	`responses/example_01_basic_chat.py`	Basic non-streaming response
02	`responses/example_02_streaming_chat.py`	Streaming with event handling
03	`responses/example_03_multi_turn_conversation.py`	Multi-turn using `previous_response_id`
04	`responses/example_04_json_mode.py`	Structured output with JSON schema
05	`responses/example_05_tool_calling.py`	Function/tool calling with tools

Configuration

All examples use these default settings:

base_url = "http://localhost:8000/v1"
api_key = "tensorrt_llm"  # Can be any string

To use a different server:

client = OpenAI(
    base_url="http://YOUR_SERVER:PORT/v1",
    api_key="your_key",
)

Model Requirements

Some examples require specific model capabilities:

Feature	Model Requirement
JSON Mode	xgrammar support
Tool Calling	Tool-capable model (Qwen3, GPT-OSS, Kimi K2)
Others	Any model