[None][docs] Add README for Nemotron Nano v3 (#10017)

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Co-authored-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
This commit is contained in:
William Zhang 2025-12-15 22:17:24 -08:00 committed by GitHub
parent 6b5ebaae3e
commit 28b02b4f5a
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 199 additions and 0 deletions

View File

@ -84,6 +84,8 @@ In addition, the following models have been officially validated using the defau
- nvidia/Llama-3_3-Nemotron-Super-49B-v1
- nvidia/Mistral-NeMo-Minitron-8B-Base
- nvidia/Nemotron-Flash-3B-Instruct
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
- perplexity-ai/r1-1776-distill-llama-70b
</details>

View File

@ -18,6 +18,7 @@ The following is a table of supported models for the PyTorch backend:
| `MixtralForCausalLM` | Mixtral | `mistralai/Mixtral-8x7B-v0.1` |
| `MllamaForConditionalGeneration` | Llama 3.2 | `meta-llama/Llama-3.2-11B-Vision` |
| `NemotronForCausalLM` | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base` |
| `NemotronHForCausalLM` | Nemotron-3-Nano | `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8` |
| `NemotronNASForCausalLM` | NemotronNAS | `nvidia/Llama-3_3-Nemotron-Super-49B-v1` |
| `Phi3ForCausalLM` | Phi-4 | `microsoft/Phi-4` |
| `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/Qwen2-7B-Instruct` |

View File

@ -83,6 +83,8 @@ In addition, the following models have been officially validated using the defau
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8
- nvidia/Llama-3_3-Nemotron-Super-49B-v1
- nvidia/Mistral-NeMo-Minitron-8B-Base
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
- perplexity-ai/r1-1776-distill-llama-70b
</details>

View File

@ -0,0 +1,194 @@
# Nemotron Nano V3 model
## Overview
The Nemotron Nano V3 model uses a hybrid Mamba-Transformer MoE architecture and supports a 1M
token context length. This enables developers to build reliable, high-throughput agents across
complex, multi-document, and long-duration applications.
This document outlines the procedures for executing Nemotron Nano V3 using TensorRT LLM. The
implementation supports both single and multi-GPU configurations via the AutoDeploy backend.
Additionally, ModelOpt was employed to derive FP8 and NVFP4 checkpoints from the source checkpoint.
The model repositories are:
* [BF16 repository](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16)
* [FP8 repository](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8)
Nemotron Nano V3 supports the following features:
* BF16, FP8 with KV cache FP8, NVFP4 model formats.
* Single and multi-GPU inference.
* Support 1M token context with long context/generation sequences.
# Usage
## Online serving example
We can follow the configuration file from [nano_v3.yaml](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/auto_deploy/nano_v3.yaml).
For the server:
```sh
# Example configuration:
cat > nano_v3.yaml<<EOF
runtime: trtllm
compile_backend: torch-cudagraph
max_batch_size: 384
max_seq_len: 65536
enable_chunked_prefill: true
attn_backend: flashinfer
model_factory: AutoModelForCausalLM
skip_loading_weights: false
free_mem_ratio: 0.9
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 320, 384]
kv_cache_config:
# disable kv_cache reuse since not supported for hybrid/ssm models
enable_block_reuse: false
transforms:
detect_sharding:
sharding_dims: ['ep', 'bmm']
allreduce_strategy: 'SYMM_MEM'
manual_config:
head_dim: 128
tp_plan:
# mamba SSM layer
"in_proj": "mamba"
"out_proj": "rowwise"
# attention layer
"q_proj": "colwise"
"k_proj": "colwise"
"v_proj": "colwise"
"o_proj": "rowwise"
# NOTE: consider not sharding shared experts and/or
# latent projections at all, keeping them replicated.
# To do so, comment out the corresponding entries.
# moe layer: SHARED experts
"up_proj": "colwise"
"down_proj": "rowwise"
# MoLE: latent projections: simple shard
"fc1_latent_proj": "gather"
"fc2_latent_proj": "gather"
multi_stream_moe:
stage: compile
enabled: true
insert_cached_ssm_attention:
cache_config:
mamba_dtype: float32
fuse_mamba_a_log:
stage: post_load_fusion
enabled: true
EOF
# Launch trtllm-server.
TRTLLM_ENABLE_PDL=1 trtllm-serve <model_path> \
--host 0.0.0.0 \
--port 8000 \
--backend _autodeploy \
--trust_remote_code \
--extra_llm_api_options nano_v3.yaml
# OR you can launch trtllm-server to support reasoning content parsing.
TRTLLM_ENABLE_PDL=1 trtllm-serve <model_path> \
--host 0.0.0.0 \
--port 8000 \
--backend _autodeploy \
--trust_remote_code \
--reasoning_parser nano-v3 \
--extra_llm_api_options nano_v3.yaml
# OR you can launch trtllm-server to support tool-calling.
TRTLLM_ENABLE_PDL=1 trtllm-serve <model_path> \
--host 0.0.0.0 \
--port 8000 \
--backend _autodeploy \
--trust_remote_code \
--reasoning_parser nano-v3 \
--tool_parser qwen3_coder \
--extra_llm_api_options nano_v3.yaml
```
For the client:
```sh
# Simple query example from client.
curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16",
"messages": [
{
"role":"user",
"content": [
{
"type": "text",
"text": "Hello, my name is"
}
]
}
],
"max_tokens": 128,
"temperature": 0
}' | jq
# Simple query example (with reasoning disabled)
curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16",
"messages": [
{
"role":"user",
"content": [
{
"type": "text",
"text": "Hello, my name is"
}
]
}
],
"max_tokens": 128,
"temperature": 0,
"chat_template_kwargs": {"enable_thinking": false}
}' | jq
```
## Offline inference example
```sh
python examples/auto_deploy/build_and_run_ad.py --model <model_path> --args.compile_backend torch-cudagraph
```
**More verbose offline inference example**:
Use a yaml:
```sh
cat > nano_v3_offline.yaml<<EOF
model:
nvidia/NVIDIA-Nemotron-Nano-31B-A3-v3
args:
compile_backend: torch-cudagraph
enable_chunked_prefill: true
kv_cache_config:
# disable kv_cache reuse since not supported for hybrid/ssm models
enable_block_reuse: false
EOF
python examples/auto_deploy/build_and_run_ad.py --yaml-extra nano_v3_offline.yaml
```
The CLI can also be used to override certain config values:
```sh
python examples/auto_deploy/build_and_run_ad.py \
--model nvidia/NVIDIA-Nemotron-Nano-31B-A3-v3 \
--args.compile_backend torch-cudagraph \
--args.enable_chunked_prefill true \
--args.kv_cache_config.enable_block_reuse false
```
# Notes
* More examples can be found in [trtllm_cookbook](https://github.com/NVIDIA-NeMo/Nemotron/blob/main/usage-cookbook/Nemotron-3-Nano/trtllm_cookbook.ipynb).
* prefix-cache is not supported for Nano v3 yet, so please set `enable_block_reuse: false` when launching a server.
* mamba-cache-dtype should be set to float32 to support better long sequences when launching a server.