mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Liao Lanyu a2e9153cb0 [None][doc] Add K2 tool calling examples (#6667 ) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>		2025-08-11 16:25:41 +08:00
..
kimi_k2_tool_calling_example.py	[None][doc] Add K2 tool calling examples (#6667 )	2025-08-11 16:25:41 +08:00
README.md	[None][doc] Add K2 tool calling examples (#6667 )	2025-08-11 16:25:41 +08:00

README.md

K2 (Kimi-K2-Instruct)

Overview

Kimi K2 is Moonshot AI's Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. It achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. Notably, K2 also excels in agentic capabilities, demonstrating outstanding performance across complex, multi-step tasks.

Prerequisites for Tool Calling in Kimi-K2

K2 model supports tool calling functionality. The official guide can be found at: tool_call_guidance

As described in the official guide, a tool calling process in Kimi-K2 includes:

Passing function descriptions to Kimi-K2.
Kimi-K2 decides to make a function call and returns the necessary information for the function call to the user.
The user performs the function call, collects the call results, and passes the function call results to Kimi-K2
Kimi-K2 continues to generate content based on the function call results until the model believes it has obtained sufficient information to respond to the user

Tools are the primary way to define callable functions for K2. Each tool requires:

A unique name
A clear description
A JSON schema defining the expected parameters

A possible example of tool description(you may refer to Using tools for more information) is as follows:

# Collect the tool descriptions in tools
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather information. Call this tool when the user needs to get weather information",
         "parameters": {
              "type": "object",
              "required": ["location"],
              "properties": {
                  "location": {
                      "type": "string",
                      "description": "location name",
                }
            }
        }
    }
}]

Kimi currently supports two main approaches for tool calling:

Use openai.OpenAI to send messages to Kimi-K2 together with tool descriptions. In this mode, the descriptions of the tools are passed as an argument to client.chat.completions.create, and the tool-call details can be read directly from the corresponding fields in the response.
Manually parse the tool-call requests from the outputs generated by Kimi-K2. The tool call requests generated by Kimi-K2 are wrapped by <|tool_calls_section_begin|> and <|tool_calls_section_end|>, with each tool call wrapped by <|tool_call_begin|> and <|tool_call_end|>. The tool ID and arguments are separated by <|tool_call_argument_begin|>. The format of the tool ID is functions.{func_name}:{idx}, from which we can parse the function name.

Note that TensorRT-LLM does not support the first approach for now. If you deploy K2 with TensorRT-LLM, you need to manually parse the tool-call requests from the outputs.

The next section is an example that deploys the K2 model using TensorRT-LLM and then manually parses the tool-call results.

Example: Manually Parsing Tool-Call Requests from Kimi-K2 Outputs

First, launch a server using trtllm-serve:

cat > ./extra_llm_api_options.yaml <<EOF
# define your extra parameters here
cuda_graph_config:
  batch_sizes:
    - 1
    - 4
enable_attention_dp: False
EOF

trtllm-serve  \
    --model /path_to_model/Kimi-K2-Instruct/ \
    --backend pytorch \
    --tp_size 8 \
    --ep_size 8 \
    --extra_llm_api_options extra_llm_api_options.yaml

Run the script kimi_k2_tool_calling_example.py, which performs the following steps:

The client provides tool definitions and a user prompt to the LLM server.
Instead of answering the prompt directly, the LLM server responds with a selected tool and corresponding arguments based on the user prompt.
The client calls the selected tool with the arguments and retrieves the results.

For example, you can query "What's the weather like in shanghai today?" with the following command:

python kimi_k2_tool_calling_example.py \
    --model "moonshotai/Kimi-K2-Instruct" \
    --prompt "What's the weather like in shanghai today?"

The output would look similar to:

[The original output from Kimi-K2]: <|tool_calls_section_begin|>
<|tool_call_begin|>functions.get_weather:0<|tool_call_argument_begin|>{"location": "shanghai"}<|tool_call_end|>
<|tool_calls_section_end|>user

[The tool-call requests parsed from the output]: [{'id': 'functions.get_weather:0', 'type': 'function', 'function': {'name': 'get_weather', 'arguments': '{"location": "shanghai"}'}}]

[Tool call result]: tool_name=get_weather, tool_result=Cloudy

The tool call works successfully:

In [The original output from Kimi-K2], the LLM selects the correct tool get_weather and provides the appropriate arguments.
In [The tool-call requests parsed from the output], the client parses the LLM response.
In [Tool call result], the client executes the tool function and get the result.

Let's try another query, "What's the weather like in beijing today?", using a predefined system prompt to specify the output format as shown below.

python kimi_k2_tool_calling_example.py \
    --model "moonshotai/Kimi-K2-Instruct" \
    --prompt "What's the weather like in beijing today?"
    --specify_output_format

The output would look like:

[The original output from Kimi-K2]: [get_weather(location='beijing')]user

[The tool-call requests parsed from the output]: [{'type': 'function', 'function': {'name': 'get_weather', 'arguments': {'location': 'beijing'}}}]

[Tool call result]: tool_name=get_weather, tool_result=Sunny

Once again, the tool call works successfully and the original output from Kimi-K2 is formatted.

Note that, without guided decoding or other deterministic tool adapters, K2 sometimes deviates from the specified output format. Because TensorRT-LLM does not support K2 with guided decoding for now, you have to parse the tool calls carefully from the raw model output to ensure they meet the required format.