mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-06 03:01:50 +08:00

History

JunyiXu-nv beffbd6002 [TRTLLM-9242][doc] Add examples showcasing openai compatible APIs (#9520 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>		2025-12-03 11:47:02 +08:00
..
example_01_basic_chat.py	[TRTLLM-9242][doc] Add examples showcasing openai compatible APIs (#9520 )	2025-12-03 11:47:02 +08:00
example_02_streaming_chat.py	[TRTLLM-9242][doc] Add examples showcasing openai compatible APIs (#9520 )	2025-12-03 11:47:02 +08:00
example_03_multi_turn_conversation.py	[TRTLLM-9242][doc] Add examples showcasing openai compatible APIs (#9520 )	2025-12-03 11:47:02 +08:00
example_04_streaming_with_usage.py	[TRTLLM-9242][doc] Add examples showcasing openai compatible APIs (#9520 )	2025-12-03 11:47:02 +08:00
example_05_json_mode.py	[TRTLLM-9242][doc] Add examples showcasing openai compatible APIs (#9520 )	2025-12-03 11:47:02 +08:00
example_06_tool_calling.py	[TRTLLM-9242][doc] Add examples showcasing openai compatible APIs (#9520 )	2025-12-03 11:47:02 +08:00
example_07_advanced_sampling.py	[TRTLLM-9242][doc] Add examples showcasing openai compatible APIs (#9520 )	2025-12-03 11:47:02 +08:00
README.md	[TRTLLM-9242][doc] Add examples showcasing openai compatible APIs (#9520 )	2025-12-03 11:47:02 +08:00

README.md

Chat Completions API Examples

Examples for the /v1/chat/completions endpoint - the most versatile API for conversational AI.

Quick Start

# Run the basic example
python example_01_basic_chat.py

Examples Overview

Basic Examples

example_01_basic_chat.py - Start here!
- Simple request/response
- Shows token usage
- Non-streaming mode
example_02_streaming_chat.py - Real-time responses
- Stream tokens as generated
- Better UX for long responses
- Server-Sent Events (SSE)
example_03_multi_turn_conversation.py - Context management
- Multiple conversation turns
- Conversation history
- Follow-up questions
example_04_streaming_with_usage.py - Streaming + metrics
- Continuous token counts
- stream_options parameter
- Monitor resource usage

Advanced Examples

example_05_json_mode.py - Structured output
- JSON schema validation
- Structured data extraction
- Requires xgrammar
example_06_tool_calling.py - Function calling
- External tool integration
- Function definitions
- Requires compatible model (Qwen3, gpt_oss)
example_07_advanced_sampling.py - Fine-tuned control
- top_k, repetition_penalty
- Custom stop sequences
- TensorRT-LLM extensions

Key Concepts

Non-Streaming vs Streaming

Non-Streaming (stream=False):

Wait for complete response
Single response object
Simple to use

Streaming (stream=True):

Tokens delivered as generated
Better perceived latency
Server-Sent Events (SSE)

Conversation Context

Messages accumulate in the messages array:

messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"},  # Next turn
]

Tool Calling

Define functions the model can call:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {...}
    }
}]

Model Requirements

Feature	Requirement
Basic chat	Any model
Streaming	Any model
Multi-turn	Any model
JSON mode	xgrammar support
Tool calling	Compatible model (Qwen3 and gpt_oss.)