mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
[None][docs] Add README for Nemotron Nano v3 (#10017)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com> Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com> Co-authored-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
This commit is contained in:
parent
6b5ebaae3e
commit
28b02b4f5a
@ -84,6 +84,8 @@ In addition, the following models have been officially validated using the defau
|
||||
- nvidia/Llama-3_3-Nemotron-Super-49B-v1
|
||||
- nvidia/Mistral-NeMo-Minitron-8B-Base
|
||||
- nvidia/Nemotron-Flash-3B-Instruct
|
||||
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
|
||||
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
|
||||
- perplexity-ai/r1-1776-distill-llama-70b
|
||||
|
||||
</details>
|
||||
|
||||
@ -18,6 +18,7 @@ The following is a table of supported models for the PyTorch backend:
|
||||
| `MixtralForCausalLM` | Mixtral | `mistralai/Mixtral-8x7B-v0.1` |
|
||||
| `MllamaForConditionalGeneration` | Llama 3.2 | `meta-llama/Llama-3.2-11B-Vision` |
|
||||
| `NemotronForCausalLM` | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base` |
|
||||
| `NemotronHForCausalLM` | Nemotron-3-Nano | `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8` |
|
||||
| `NemotronNASForCausalLM` | NemotronNAS | `nvidia/Llama-3_3-Nemotron-Super-49B-v1` |
|
||||
| `Phi3ForCausalLM` | Phi-4 | `microsoft/Phi-4` |
|
||||
| `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/Qwen2-7B-Instruct` |
|
||||
|
||||
@ -83,6 +83,8 @@ In addition, the following models have been officially validated using the defau
|
||||
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8
|
||||
- nvidia/Llama-3_3-Nemotron-Super-49B-v1
|
||||
- nvidia/Mistral-NeMo-Minitron-8B-Base
|
||||
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
|
||||
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
|
||||
- perplexity-ai/r1-1776-distill-llama-70b
|
||||
|
||||
</details>
|
||||
|
||||
194
examples/models/core/nemotron/README_nemotron_nano_v3.md
Normal file
194
examples/models/core/nemotron/README_nemotron_nano_v3.md
Normal file
@ -0,0 +1,194 @@
|
||||
# Nemotron Nano V3 model
|
||||
|
||||
## Overview
|
||||
|
||||
The Nemotron Nano V3 model uses a hybrid Mamba-Transformer MoE architecture and supports a 1M
|
||||
token context length. This enables developers to build reliable, high-throughput agents across
|
||||
complex, multi-document, and long-duration applications.
|
||||
|
||||
This document outlines the procedures for executing Nemotron Nano V3 using TensorRT LLM. The
|
||||
implementation supports both single and multi-GPU configurations via the AutoDeploy backend.
|
||||
Additionally, ModelOpt was employed to derive FP8 and NVFP4 checkpoints from the source checkpoint.
|
||||
The model repositories are:
|
||||
* [BF16 repository](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16)
|
||||
* [FP8 repository](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8)
|
||||
|
||||
Nemotron Nano V3 supports the following features:
|
||||
* BF16, FP8 with KV cache FP8, NVFP4 model formats.
|
||||
* Single and multi-GPU inference.
|
||||
* Support 1M token context with long context/generation sequences.
|
||||
|
||||
# Usage
|
||||
|
||||
## Online serving example
|
||||
|
||||
We can follow the configuration file from [nano_v3.yaml](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/auto_deploy/nano_v3.yaml).
|
||||
|
||||
For the server:
|
||||
|
||||
```sh
|
||||
# Example configuration:
|
||||
cat > nano_v3.yaml<<EOF
|
||||
runtime: trtllm
|
||||
compile_backend: torch-cudagraph
|
||||
max_batch_size: 384
|
||||
max_seq_len: 65536
|
||||
enable_chunked_prefill: true
|
||||
attn_backend: flashinfer
|
||||
model_factory: AutoModelForCausalLM
|
||||
skip_loading_weights: false
|
||||
free_mem_ratio: 0.9
|
||||
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 320, 384]
|
||||
kv_cache_config:
|
||||
# disable kv_cache reuse since not supported for hybrid/ssm models
|
||||
enable_block_reuse: false
|
||||
transforms:
|
||||
detect_sharding:
|
||||
sharding_dims: ['ep', 'bmm']
|
||||
allreduce_strategy: 'SYMM_MEM'
|
||||
manual_config:
|
||||
head_dim: 128
|
||||
tp_plan:
|
||||
# mamba SSM layer
|
||||
"in_proj": "mamba"
|
||||
"out_proj": "rowwise"
|
||||
# attention layer
|
||||
"q_proj": "colwise"
|
||||
"k_proj": "colwise"
|
||||
"v_proj": "colwise"
|
||||
"o_proj": "rowwise"
|
||||
# NOTE: consider not sharding shared experts and/or
|
||||
# latent projections at all, keeping them replicated.
|
||||
# To do so, comment out the corresponding entries.
|
||||
# moe layer: SHARED experts
|
||||
"up_proj": "colwise"
|
||||
"down_proj": "rowwise"
|
||||
# MoLE: latent projections: simple shard
|
||||
"fc1_latent_proj": "gather"
|
||||
"fc2_latent_proj": "gather"
|
||||
multi_stream_moe:
|
||||
stage: compile
|
||||
enabled: true
|
||||
insert_cached_ssm_attention:
|
||||
cache_config:
|
||||
mamba_dtype: float32
|
||||
fuse_mamba_a_log:
|
||||
stage: post_load_fusion
|
||||
enabled: true
|
||||
EOF
|
||||
|
||||
# Launch trtllm-server.
|
||||
TRTLLM_ENABLE_PDL=1 trtllm-serve <model_path> \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--backend _autodeploy \
|
||||
--trust_remote_code \
|
||||
--extra_llm_api_options nano_v3.yaml
|
||||
|
||||
# OR you can launch trtllm-server to support reasoning content parsing.
|
||||
TRTLLM_ENABLE_PDL=1 trtllm-serve <model_path> \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--backend _autodeploy \
|
||||
--trust_remote_code \
|
||||
--reasoning_parser nano-v3 \
|
||||
--extra_llm_api_options nano_v3.yaml
|
||||
|
||||
# OR you can launch trtllm-server to support tool-calling.
|
||||
TRTLLM_ENABLE_PDL=1 trtllm-serve <model_path> \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--backend _autodeploy \
|
||||
--trust_remote_code \
|
||||
--reasoning_parser nano-v3 \
|
||||
--tool_parser qwen3_coder \
|
||||
--extra_llm_api_options nano_v3.yaml
|
||||
```
|
||||
|
||||
For the client:
|
||||
|
||||
```sh
|
||||
# Simple query example from client.
|
||||
curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16",
|
||||
"messages": [
|
||||
{
|
||||
"role":"user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Hello, my name is"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"max_tokens": 128,
|
||||
"temperature": 0
|
||||
}' | jq
|
||||
|
||||
# Simple query example (with reasoning disabled)
|
||||
curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16",
|
||||
"messages": [
|
||||
{
|
||||
"role":"user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Hello, my name is"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"max_tokens": 128,
|
||||
"temperature": 0,
|
||||
"chat_template_kwargs": {"enable_thinking": false}
|
||||
}' | jq
|
||||
```
|
||||
|
||||
## Offline inference example
|
||||
|
||||
```sh
|
||||
python examples/auto_deploy/build_and_run_ad.py --model <model_path> --args.compile_backend torch-cudagraph
|
||||
```
|
||||
|
||||
**More verbose offline inference example**:
|
||||
|
||||
Use a yaml:
|
||||
|
||||
```sh
|
||||
cat > nano_v3_offline.yaml<<EOF
|
||||
model:
|
||||
nvidia/NVIDIA-Nemotron-Nano-31B-A3-v3
|
||||
args:
|
||||
compile_backend: torch-cudagraph
|
||||
enable_chunked_prefill: true
|
||||
kv_cache_config:
|
||||
# disable kv_cache reuse since not supported for hybrid/ssm models
|
||||
enable_block_reuse: false
|
||||
EOF
|
||||
|
||||
python examples/auto_deploy/build_and_run_ad.py --yaml-extra nano_v3_offline.yaml
|
||||
```
|
||||
|
||||
The CLI can also be used to override certain config values:
|
||||
|
||||
```sh
|
||||
python examples/auto_deploy/build_and_run_ad.py \
|
||||
--model nvidia/NVIDIA-Nemotron-Nano-31B-A3-v3 \
|
||||
--args.compile_backend torch-cudagraph \
|
||||
--args.enable_chunked_prefill true \
|
||||
--args.kv_cache_config.enable_block_reuse false
|
||||
```
|
||||
|
||||
# Notes
|
||||
|
||||
* More examples can be found in [trtllm_cookbook](https://github.com/NVIDIA-NeMo/Nemotron/blob/main/usage-cookbook/Nemotron-3-Nano/trtllm_cookbook.ipynb).
|
||||
* prefix-cache is not supported for Nano v3 yet, so please set `enable_block_reuse: false` when launching a server.
|
||||
* mamba-cache-dtype should be set to float32 to support better long sequences when launching a server.
|
||||
Loading…
Reference in New Issue
Block a user