TensorRT-LLMs/examples/serve/compatibility/responses
JunyiXu-nv af899d2fe7
[TRTLLM-9860][doc] Add docs and examples for Responses API (#9946)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-12-14 21:46:13 -08:00
..
example_01_basic_chat.py [TRTLLM-9860][doc] Add docs and examples for Responses API (#9946) 2025-12-14 21:46:13 -08:00
example_02_streaming_chat.py [TRTLLM-9860][doc] Add docs and examples for Responses API (#9946) 2025-12-14 21:46:13 -08:00
example_03_multi_turn_conversation.py [TRTLLM-9860][doc] Add docs and examples for Responses API (#9946) 2025-12-14 21:46:13 -08:00
example_04_json_mode.py [TRTLLM-9860][doc] Add docs and examples for Responses API (#9946) 2025-12-14 21:46:13 -08:00
example_05_tool_calling.py [TRTLLM-9860][doc] Add docs and examples for Responses API (#9946) 2025-12-14 21:46:13 -08:00
README.md [TRTLLM-9860][doc] Add docs and examples for Responses API (#9946) 2025-12-14 21:46:13 -08:00

Responses API Examples

Examples for the /v1/responses endpoint. All examples in this directory use the Responses API, demonstrating features such as streaming, tool/function calling, and multi-turn dialogue.

Quick Start

# Run the basic example
python example_01_basic_chat.py

Examples Overview

Basic Examples

  1. example_01_basic_chat.py - Start here!

    • Simple request/response
    • Non-streaming mode
    • Uses input parameter for user message
  2. example_02_streaming_chat.py - Real-time responses

    • Stream tokens as generated
    • Handles various event types (response.created, response.output_text.delta, etc.)
    • Server-Sent Events (SSE)
  3. example_03_multi_turn_conversation.py - Context management

    • Multiple conversation turns
    • Uses previous_response_id to maintain context
    • Follow-up questions without resending history

Advanced Examples

  1. example_04_json_mode.py - Structured output

    • JSON schema validation via text.format
    • Structured data extraction
    • Requires xgrammar support
  2. example_05_tool_calling.py - Function calling

    • External tool integration
    • Function definitions with tools parameter
    • Tool result handling with function_call_output
    • Requires compatible model (Qwen3, GPT-OSS, Kimi K2)

Key Concepts

Non-Streaming vs Streaming

Non-Streaming (stream=False):

  • Wait for complete response
  • Single response object
  • Simple to use

Streaming (stream=True):

  • Tokens delivered as generated
  • Better perceived latency
  • Server-Sent Events (SSE)

Multi-turn Context

Use previous_response_id to continue conversations:

# First turn
response1 = client.responses.create(
    model=model,
    input="What is 15 multiplied by 23?",
)

# Second turn - references previous response
response2 = client.responses.create(
    model=model,
    input="Now divide that result by 5",
    previous_response_id=response1.id,
)

Tool Calling

Define functions the model can call:

tools = [{
    "name": "get_weather",
    "type": "function",
    "description": "Get the current weather in a location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {"type": "string"},
        },
        "required": ["location"],
    }
}]

Model Requirements

Feature Requirement
Basic chat Any model
Streaming Any model
Multi-turn Any model
JSON mode xgrammar support
Tool calling Compatible model (Qwen3, GPT-OSS, Kimi K2)