diff --git a/docs/source/features/auto_deploy/support_matrix.md b/docs/source/features/auto_deploy/support_matrix.md index fec6d841af..9c9d56bea6 100644 --- a/docs/source/features/auto_deploy/support_matrix.md +++ b/docs/source/features/auto_deploy/support_matrix.md @@ -84,6 +84,8 @@ In addition, the following models have been officially validated using the defau - nvidia/Llama-3_3-Nemotron-Super-49B-v1 - nvidia/Mistral-NeMo-Minitron-8B-Base - nvidia/Nemotron-Flash-3B-Instruct +- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 +- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 - perplexity-ai/r1-1776-distill-llama-70b diff --git a/docs/source/models/supported-models.md b/docs/source/models/supported-models.md index 40f3840073..d4ada87f58 100644 --- a/docs/source/models/supported-models.md +++ b/docs/source/models/supported-models.md @@ -18,6 +18,7 @@ The following is a table of supported models for the PyTorch backend: | `MixtralForCausalLM` | Mixtral | `mistralai/Mixtral-8x7B-v0.1` | | `MllamaForConditionalGeneration` | Llama 3.2 | `meta-llama/Llama-3.2-11B-Vision` | | `NemotronForCausalLM` | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base` | +| `NemotronHForCausalLM` | Nemotron-3-Nano | `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8` | | `NemotronNASForCausalLM` | NemotronNAS | `nvidia/Llama-3_3-Nemotron-Super-49B-v1` | | `Phi3ForCausalLM` | Phi-4 | `microsoft/Phi-4` | | `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/Qwen2-7B-Instruct` | diff --git a/docs/source/torch/auto_deploy/support_matrix.md b/docs/source/torch/auto_deploy/support_matrix.md index f0158253dd..037585461d 100644 --- a/docs/source/torch/auto_deploy/support_matrix.md +++ b/docs/source/torch/auto_deploy/support_matrix.md @@ -83,6 +83,8 @@ In addition, the following models have been officially validated using the defau - nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8 - nvidia/Llama-3_3-Nemotron-Super-49B-v1 - nvidia/Mistral-NeMo-Minitron-8B-Base +- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 +- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 - perplexity-ai/r1-1776-distill-llama-70b diff --git a/examples/models/core/nemotron/README_nemotron_nano_v3.md b/examples/models/core/nemotron/README_nemotron_nano_v3.md new file mode 100644 index 0000000000..dac512f47e --- /dev/null +++ b/examples/models/core/nemotron/README_nemotron_nano_v3.md @@ -0,0 +1,194 @@ +# Nemotron Nano V3 model + +## Overview + +The Nemotron Nano V3 model uses a hybrid Mamba-Transformer MoE architecture and supports a 1M +token context length. This enables developers to build reliable, high-throughput agents across +complex, multi-document, and long-duration applications. + +This document outlines the procedures for executing Nemotron Nano V3 using TensorRT LLM. The +implementation supports both single and multi-GPU configurations via the AutoDeploy backend. +Additionally, ModelOpt was employed to derive FP8 and NVFP4 checkpoints from the source checkpoint. +The model repositories are: +* [BF16 repository](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) +* [FP8 repository](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8) + +Nemotron Nano V3 supports the following features: +* BF16, FP8 with KV cache FP8, NVFP4 model formats. +* Single and multi-GPU inference. +* Support 1M token context with long context/generation sequences. + +# Usage + +## Online serving example + +We can follow the configuration file from [nano_v3.yaml](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/auto_deploy/nano_v3.yaml). + +For the server: + +```sh +# Example configuration: +cat > nano_v3.yaml< \ +--host 0.0.0.0 \ +--port 8000 \ +--backend _autodeploy \ +--trust_remote_code \ +--extra_llm_api_options nano_v3.yaml + +# OR you can launch trtllm-server to support reasoning content parsing. +TRTLLM_ENABLE_PDL=1 trtllm-serve \ +--host 0.0.0.0 \ +--port 8000 \ +--backend _autodeploy \ +--trust_remote_code \ +--reasoning_parser nano-v3 \ +--extra_llm_api_options nano_v3.yaml + +# OR you can launch trtllm-server to support tool-calling. +TRTLLM_ENABLE_PDL=1 trtllm-serve \ +--host 0.0.0.0 \ +--port 8000 \ +--backend _autodeploy \ +--trust_remote_code \ +--reasoning_parser nano-v3 \ +--tool_parser qwen3_coder \ +--extra_llm_api_options nano_v3.yaml +``` + +For the client: + +```sh +# Simple query example from client. +curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \ +-H 'accept: application/json' \ +-H 'Content-Type: application/json' \ +-d '{ + "model": "nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16", + "messages": [ + { + "role":"user", + "content": [ + { + "type": "text", + "text": "Hello, my name is" + } + ] + } + ], + "max_tokens": 128, + "temperature": 0 + }' | jq + +# Simple query example (with reasoning disabled) +curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' \ +-H 'accept: application/json' \ +-H 'Content-Type: application/json' \ +-d '{ + "model": "nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16", + "messages": [ + { + "role":"user", + "content": [ + { + "type": "text", + "text": "Hello, my name is" + } + ] + } + ], + "max_tokens": 128, + "temperature": 0, + "chat_template_kwargs": {"enable_thinking": false} + }' | jq +``` + +## Offline inference example + +```sh +python examples/auto_deploy/build_and_run_ad.py --model --args.compile_backend torch-cudagraph +``` + +**More verbose offline inference example**: + +Use a yaml: + +```sh +cat > nano_v3_offline.yaml<