TensorRT-LLMs/examples/serve/compatibility/responses/README.md
JunyiXu-nv af899d2fe7
[TRTLLM-9860][doc] Add docs and examples for Responses API (#9946)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-12-14 21:46:13 -08:00

103 lines
2.5 KiB
Markdown

# Responses API Examples
Examples for the `/v1/responses` endpoint. All examples in this directory use the Responses API, demonstrating features such as streaming, tool/function calling, and multi-turn dialogue.
## Quick Start
```bash
# Run the basic example
python example_01_basic_chat.py
```
## Examples Overview
### Basic Examples
1. **`example_01_basic_chat.py`** - Start here!
- Simple request/response
- Non-streaming mode
- Uses `input` parameter for user message
2. **`example_02_streaming_chat.py`** - Real-time responses
- Stream tokens as generated
- Handles various event types (`response.created`, `response.output_text.delta`, etc.)
- Server-Sent Events (SSE)
3. **`example_03_multi_turn_conversation.py`** - Context management
- Multiple conversation turns
- Uses `previous_response_id` to maintain context
- Follow-up questions without resending history
### Advanced Examples
4. **`example_04_json_mode.py`** - Structured output
- JSON schema validation via `text.format`
- Structured data extraction
- Requires xgrammar support
5. **`example_05_tool_calling.py`** - Function calling
- External tool integration
- Function definitions with `tools` parameter
- Tool result handling with `function_call_output`
- Requires compatible model (Qwen3, GPT-OSS, Kimi K2)
## Key Concepts
### Non-Streaming vs Streaming
**Non-Streaming** (`stream=False`):
- Wait for complete response
- Single response object
- Simple to use
**Streaming** (`stream=True`):
- Tokens delivered as generated
- Better perceived latency
- Server-Sent Events (SSE)
### Multi-turn Context
Use `previous_response_id` to continue conversations:
```python
# First turn
response1 = client.responses.create(
model=model,
input="What is 15 multiplied by 23?",
)
# Second turn - references previous response
response2 = client.responses.create(
model=model,
input="Now divide that result by 5",
previous_response_id=response1.id,
)
```
### Tool Calling
Define functions the model can call:
```python
tools = [{
"name": "get_weather",
"type": "function",
"description": "Get the current weather in a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"},
},
"required": ["location"],
}
}]
```
## Model Requirements
| Feature | Requirement |
|---------|-------------|
| Basic chat | Any model |
| Streaming | Any model |
| Multi-turn | Any model |
| JSON mode | xgrammar support |
| Tool calling | Compatible model (Qwen3, GPT-OSS, Kimi K2) |