[TRTC-43] [feat] Add config db and docs (#9420)

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com> Co-authored-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2026-01-13 22:18:36 +08:00 · 2025-12-11 12:00:03 -08:00 · 2025-12-11 12:00:03 -08:00 · fd1270b9ab
commit fd1270b9ab
parent 24f92721f2
195 changed files with 6234 additions and 45 deletions
--- a/.gitignore
+++ b/.gitignore
@ -55,6 +55,8 @@ tensorrt_llm/scripts
 *docs/source/_cpp_gen*
 docs/source/**/*.rst
 !docs/source/examples/index.rst
+!docs/source/deployment-guide/config_table.rst
+!docs/source/deployment-guide/note_sections.rst
 *.swp

 # Testing
--- a/docs/source/deployment-guide/config_table.rst
+++ b/docs/source/deployment-guide/config_table.rst
--- a/docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md
+++ b/docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md
@ -66,7 +66,7 @@ We maintain YAML configuration files with recommended performance settings in th

 ```shell
 TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
-EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml
 ```

 Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -74,7 +74,7 @@ Note: if you don't have access to the source code locally, you can manually crea
 ````{admonition} Show code
 :class: dropdown

-```{literalinclude} ../../../examples/configs/deepseek-r1-throughput.yaml
+```{literalinclude} ../../../examples/configs/curated/deepseek-r1-throughput.yaml
 ---
 language: shell
 prepend: |
@ -90,7 +90,7 @@ To use the `DeepGEMM` MOE backend on B200/GB200, use this config instead:

 ```shell
 TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
-EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-deepgemm.yaml
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-deepgemm.yaml
 ```

 Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -98,7 +98,7 @@ Note: if you don't have access to the source code locally, you can manually crea
 ````{admonition} Show code
 :class: dropdown

-```{literalinclude} ../../../examples/configs/deepseek-r1-deepgemm.yaml
+```{literalinclude} ../../../examples/configs/curated/deepseek-r1-deepgemm.yaml
 ---
 language: shell
 prepend: |
@ -154,7 +154,7 @@ These options provide control over TensorRT LLM's behavior and are set within th

 #### `trust_remote_code`

-&emsp;**Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
+* **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.

 #### `kv_cache_config`

@ -429,3 +429,23 @@ $$
 $$
 \text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
 $$
+
+## Preconfigured Recipes
+
+The following tables list recommended configurations from the comprehensive database for different performance profiles.
+
+```{eval-rst}
+.. include:: note_sections.rst
+   :start-after: .. start-note-traffic-patterns
+   :end-before: .. end-note-traffic-patterns
+
+.. include:: config_table.rst
+   :start-after: .. start-deepseek-ai/DeepSeek-R1-0528
+   :end-before: .. end-deepseek-ai/DeepSeek-R1-0528
+```
+
+```{eval-rst}
+.. include:: config_table.rst
+   :start-after: .. start-nvidia/DeepSeek-R1-0528-FP4-v2
+   :end-before: .. end-nvidia/DeepSeek-R1-0528-FP4-v2
+```
--- a/docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md
+++ b/docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md
@ -64,7 +64,7 @@ For low-latency use cases:

 ```shell
 TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
-EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-latency.yaml
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-latency.yaml
 ```

 Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -72,7 +72,7 @@ Note: if you don't have access to the source code locally, you can manually crea
 ````{admonition} Show code
 :class: dropdown

-```{literalinclude} ../../../examples/configs/gpt-oss-120b-latency.yaml
+```{literalinclude} ../../../examples/configs/curated/gpt-oss-120b-latency.yaml
 ---
 language: shell
 prepend: |
@ -88,7 +88,7 @@ For max-throughput use cases:

 ```shell
 TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
-EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-throughput.yaml
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-throughput.yaml
 ```

 Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -96,7 +96,7 @@ Note: if you don't have access to the source code locally, you can manually crea
 ````{admonition} Show code
 :class: dropdown

-```{literalinclude} ../../../examples/configs/gpt-oss-120b-throughput.yaml
+```{literalinclude} ../../../examples/configs/curated/gpt-oss-120b-throughput.yaml
 ---
 language: shell
 prepend: |
@ -377,3 +377,17 @@ $$
 $$
 \text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
 $$
+
+## Preconfigured Recipes
+
+The following table lists recommended configurations from the comprehensive database for different performance profiles.
+
+```{eval-rst}
+.. include:: note_sections.rst
+   :start-after: .. start-note-traffic-patterns
+   :end-before: .. end-note-traffic-patterns
+
+.. include:: config_table.rst
+   :start-after: .. start-openai/gpt-oss-120b
+   :end-before: .. end-openai/gpt-oss-120b
+```
--- a/docs/source/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md
+++ b/docs/source/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md
@ -58,7 +58,7 @@ We maintain YAML configuration files with recommended performance settings in th

 ```shell
 TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
-EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/llama-3.3-70b.yaml
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/llama-3.3-70b.yaml
 ```

 Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -66,7 +66,7 @@ Note: if you don't have access to the source code locally, you can manually crea
 ````{admonition} Show code
 :class: dropdown

-```{literalinclude} ../../../examples/configs/llama-3.3-70b.yaml
+```{literalinclude} ../../../examples/configs/curated/llama-3.3-70b.yaml
 ---
 language: shell
 prepend: |
--- a/docs/source/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md
+++ b/docs/source/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md
@ -57,7 +57,7 @@ We maintain YAML configuration files with recommended performance settings in th

 ```shell
 TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
-EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/llama-4-scout.yaml
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/llama-4-scout.yaml
 ```

 Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -65,7 +65,7 @@ Note: if you don't have access to the source code locally, you can manually crea
 ````{admonition} Show code
 :class: dropdown

-```{literalinclude} ../../../examples/configs/llama-4-scout.yaml
+```{literalinclude} ../../../examples/configs/curated/llama-4-scout.yaml
 ---
 language: shell
 prepend: |
--- a/docs/source/deployment-guide/deployment-guide-for-qwen3-next-on-trtllm.md
+++ b/docs/source/deployment-guide/deployment-guide-for-qwen3-next-on-trtllm.md
@ -35,7 +35,7 @@ We maintain YAML configuration files with recommended performance settings in th

 ```shell
 TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
-EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/qwen3-next.yaml
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml
 ```

 Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -43,7 +43,7 @@ Note: if you don't have access to the source code locally, you can manually crea
 ````{admonition} Show code
 :class: dropdown

-```{literalinclude} ../../../examples/configs/qwen3-next.yaml
+```{literalinclude} ../../../examples/configs/curated/qwen3-next.yaml
 ---
 language: shell
 prepend: |
--- a/docs/source/deployment-guide/deployment-guide-for-qwen3-on-trtllm.md
+++ b/docs/source/deployment-guide/deployment-guide-for-qwen3-on-trtllm.md
@ -40,7 +40,7 @@ We maintain YAML configuration files with recommended performance settings in th

 ```shell
 TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
-EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/qwen3.yaml
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/qwen3.yaml
 ```

 Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -48,7 +48,7 @@ Note: if you don't have access to the source code locally, you can manually crea
 ````{admonition} Show code
 :class: dropdown

-```{literalinclude} ../../../examples/configs/qwen3.yaml
+```{literalinclude} ../../../examples/configs/curated/qwen3.yaml
 ---
 language: shell
 prepend: |
--- a/docs/source/deployment-guide/index.rst
+++ b/docs/source/deployment-guide/index.rst
@ -6,15 +6,20 @@ Quick Start for Popular Models

 The table below contains ``trtllm-serve`` commands that can be used to easily deploy popular models including DeepSeek-R1, gpt-oss, Llama 4, Qwen3, and more.

-We maintain LLM API configuration files for these models containing recommended performance settings in the `examples/configs <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs>`_ directory. The TensorRT LLM Docker container makes the config files available at ``/app/tensorrt_llm/examples/configs``, but you can customize this as needed:
+We maintain LLM API configuration files for these models containing recommended performance settings in two locations:
+
+* **Curated Examples**: `examples/configs/curated <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs/curated>`_ - Hand-picked configurations for common scenarios.
+* **Comprehensive Database**: `examples/configs/database <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs/database>`_ - A more comprehensive set of known-good configurations for various GPUs and traffic patterns.
+
+The TensorRT LLM Docker container makes these config files available at ``/app/tensorrt_llm/examples/configs/curated`` and ``/app/tensorrt_llm/examples/configs/database`` respectively. You can reference them as needed:

 .. code-block:: bash

   export TRTLLM_DIR="/app/tensorrt_llm" # path to the TensorRT LLM repo in your local environment

-.. note::
-
-   The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, you may benefit from additional tuning. In the future, we plan to provide more configs for a wider range of traffic patterns.
+.. include:: note_sections.rst
+   :start-after: .. start-note-quick-start-isl-osl
+   :end-before: .. end-note-quick-start-isl-osl

 This table is designed to provide a straightforward starting point; for detailed model-specific deployment guides, check out the guides below.

@ -30,53 +35,53 @@ This table is designed to provide a straightforward starting point; for detailed
   * - `DeepSeek-R1 <https://huggingface.co/deepseek-ai/DeepSeek-R1-0528>`_
     - H100, H200
     - Max Throughput
-     - `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-throughput.yaml>`_
-     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml``
+     - `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-throughput.yaml>`_
+     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml``
   * - `DeepSeek-R1 <https://huggingface.co/deepseek-ai/DeepSeek-R1-0528>`_
     - B200, GB200
     - Max Throughput
-     - `deepseek-r1-deepgemm.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-deepgemm.yaml>`_
-     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-deepgemm.yaml``
+     - `deepseek-r1-deepgemm.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-deepgemm.yaml>`_
+     - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-deepgemm.yaml``
   * - `DeepSeek-R1 (NVFP4) <https://huggingface.co/nvidia/DeepSeek-R1-FP4>`_
     - B200, GB200
     - Max Throughput
-     - `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-throughput.yaml>`_
-     - ``trtllm-serve nvidia/DeepSeek-R1-FP4 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml``
+     - `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-throughput.yaml>`_
+     - ``trtllm-serve nvidia/DeepSeek-R1-FP4 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml``
   * - `DeepSeek-R1 (NVFP4) <https://huggingface.co/nvidia/DeepSeek-R1-FP4-v2>`_
     - B200, GB200
     - Min Latency
-     - `deepseek-r1-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-latency.yaml>`_
-     - ``trtllm-serve nvidia/DeepSeek-R1-FP4-v2 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-latency.yaml``
+     - `deepseek-r1-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-latency.yaml>`_
+     - ``trtllm-serve nvidia/DeepSeek-R1-FP4-v2 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-latency.yaml``
   * - `gpt-oss-120b <https://huggingface.co/openai/gpt-oss-120b>`_
     - Any
     - Max Throughput
-     - `gpt-oss-120b-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/gpt-oss-120b-throughput.yaml>`_
-     - ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/gpt-oss-120b-throughput.yaml``
+     - `gpt-oss-120b-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/gpt-oss-120b-throughput.yaml>`_
+     - ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-throughput.yaml``
   * - `gpt-oss-120b <https://huggingface.co/openai/gpt-oss-120b>`_
     - Any
     - Min Latency
-     - `gpt-oss-120b-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/gpt-oss-120b-latency.yaml>`_
-     - ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/gpt-oss-120b-latency.yaml``
+     - `gpt-oss-120b-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/gpt-oss-120b-latency.yaml>`_
+     - ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-latency.yaml``
   * - `Qwen3-Next-80B-A3B-Thinking <https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking>`_
     - Any
     - Max Throughput
-     - `qwen3-next.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/qwen3-next.yaml>`_
-     - ``trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/qwen3-next.yaml``
+     - `qwen3-next.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/qwen3-next.yaml>`_
+     - ``trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml``
   * - Qwen3 family (e.g. `Qwen3-30B-A3B <https://huggingface.co/Qwen/Qwen3-30B-A3B>`_)
     - Any
     - Max Throughput
-     - `qwen3.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/qwen3.yaml>`_
-     - ``trtllm-serve Qwen/Qwen3-30B-A3B --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/qwen3.yaml`` (swap to another Qwen3 model name as needed)
+     - `qwen3.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/qwen3.yaml>`_
+     - ``trtllm-serve Qwen/Qwen3-30B-A3B --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/qwen3.yaml`` (swap to another Qwen3 model name as needed)
   * - `Llama-3.3-70B (FP8) <https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8>`_
     - Any
     - Max Throughput
-     - `llama-3.3-70b.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/llama-3.3-70b.yaml>`_
-     - ``trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/llama-3.3-70b.yaml``
+     - `llama-3.3-70b.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/llama-3.3-70b.yaml>`_
+     - ``trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/llama-3.3-70b.yaml``
   * - `Llama 4 Scout (FP8) <https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP8>`_
     - Any
     - Max Throughput
-     - `llama-4-scout.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/llama-4-scout.yaml>`_
-     - ``trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/llama-4-scout.yaml``
+     - `llama-4-scout.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/llama-4-scout.yaml>`_
+     - ``trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/llama-4-scout.yaml``

 Model-Specific Deployment Guides
 ---------------------------------
@ -94,3 +99,10 @@ The deployment guides below provide more detailed instructions for serving speci
   deployment-guide-for-qwen3-on-trtllm.md
   deployment-guide-for-qwen3-next-on-trtllm.md
   deployment-guide-for-kimi-k2-thinking-on-trtllm.md
+
+Comprehensive Configuration Database
+------------------------------------
+
+The table below lists all available pre-configured model scenarios in the TensorRT LLM configuration database. Each row represents a specific model, GPU, and performance profile combination with recommended request settings.
+
+.. include:: config_table.rst
--- a/docs/source/deployment-guide/note_sections.rst
+++ b/docs/source/deployment-guide/note_sections.rst
@ -0,0 +1,36 @@
+..
+   Reusable note sections for deployment guides.
+   Include specific notes using:
+
+   .. include:: note_sections.rst
+      :start-after: .. start-note-<name>
+      :end-before: .. end-note-<name>
+
+.. start-note-traffic-patterns
+
+.. note::
+
+   **Traffic Patterns**: The ISL (Input Sequence Length) and OSL (Output Sequence Length)
+   values in each configuration represent the **maximum supported values** for that config.
+   Requests exceeding these limits may result in errors.
+
+   To handle requests with input sequences **longer than the configured ISL**, add the following
+   to your config file:
+
+   .. code-block:: yaml
+
+      enable_chunked_prefill: true
+
+   This enables chunked prefill, which processes long input sequences in chunks rather than
+   requiring them to fit within a single prefill operation. Note that enabling chunked prefill
+   does **not** guarantee optimal performance—these configs are tuned for the specified ISL/OSL.
+
+.. end-note-traffic-patterns
+
+.. start-note-quick-start-isl-osl
+
+.. note::
+
+   The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, refer to the :ref:`Comprehensive Configuration Database` section below which covers a larger set of traffic patterns and performance profiles.
+
+.. end-note-quick-start-isl-osl
--- a/examples/configs/curated/deepseek-r1-deepgemm.yaml
+++ b/examples/configs/curated/deepseek-r1-deepgemm.yaml
--- a/examples/configs/curated/deepseek-r1-latency.yaml
+++ b/examples/configs/curated/deepseek-r1-latency.yaml
--- a/examples/configs/curated/deepseek-r1-throughput.yaml
+++ b/examples/configs/curated/deepseek-r1-throughput.yaml
--- a/examples/configs/curated/gpt-oss-120b-latency.yaml
+++ b/examples/configs/curated/gpt-oss-120b-latency.yaml
--- a/examples/configs/curated/gpt-oss-120b-throughput.yaml
+++ b/examples/configs/curated/gpt-oss-120b-throughput.yaml
--- a/examples/configs/curated/llama-3.3-70b.yaml
+++ b/examples/configs/curated/llama-3.3-70b.yaml
--- a/examples/configs/curated/llama-4-scout.yaml
+++ b/examples/configs/curated/llama-4-scout.yaml
--- a/examples/configs/curated/qwen3-disagg-prefill.yaml
+++ b/examples/configs/curated/qwen3-disagg-prefill.yaml
--- a/examples/configs/curated/qwen3-next.yaml
+++ b/examples/configs/curated/qwen3-next.yaml
--- a/examples/configs/curated/qwen3.yaml
+++ b/examples/configs/curated/qwen3.yaml
--- a/examples/configs/database/database.py
+++ b/examples/configs/database/database.py
@ -0,0 +1,64 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from pathlib import Path
+from typing import Any, Dict, Iterator, List
+
+import yaml
+from pydantic import BaseModel, Field, RootModel
+
+DATABASE_LIST_PATH = Path(__file__).parent / "lookup.yaml"
+
+
+class RecipeConstraints(BaseModel):
+    """Recipe record for scenario list."""
+
+    model: str = Field(description="Model name")
+    gpu: str = Field(description="GPU name")
+    isl: int = Field(description="Input sequence length")
+    osl: int = Field(description="Output sequence length")
+    concurrency: int = Field(description="Concurrency")
+    config_path: str = Field(description="Configuration path")
+    num_gpus: int = Field(description="Number of GPUs")
+
+    def load_config(self) -> Dict[str, Any]:
+        """Load and return the YAML config at config_path."""
+        with open(self.config_path) as f:
+            data = yaml.safe_load(f)
+        return data if data is not None else {}
+
+
+class Recipe(BaseModel):
+    """Recipe that describes a single scenario."""
+
+    constraints: RecipeConstraints = Field(description="Recipe constraints")
+    env_overrides: Dict[str, Any] = Field(description="Environment overrides", default_factory=dict)
+    config: Dict[str, Any] = Field(description="Configuration overrides", default_factory=dict)
+
+
+class RecipeList(RootModel[List[RecipeConstraints]]):
+    @classmethod
+    def from_yaml(cls, yaml_path: Path) -> "RecipeList":
+        """Load and validate recipe list from YAML file."""
+        with open(yaml_path) as f:
+            data = yaml.safe_load(f)
+        return cls(data)
+
+    def __iter__(self) -> Iterator[RecipeConstraints]:
+        return iter(self.root)
+
+    def __len__(self) -> int:
+        return len(self.root)
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc16.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc16.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 256
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: DEEPGEMM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc32.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc32.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 256
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: DEEPGEMM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc4.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc4.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 256
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: DEEPGEMM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc64.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc64.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 256
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: DEEPGEMM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc8.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc8.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 256
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: DEEPGEMM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc16.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc16.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 256
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: DEEPGEMM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc32.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc32.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 256
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: DEEPGEMM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc4.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc4.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 256
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: DEEPGEMM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc64.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc64.yaml
@ -0,0 +1,22 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 256
+enable_attention_dp: true
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: DEEPGEMM
+attention_dp_config:
+  batching_wait_iters: 0
+  enable_balance: true
+  timeout_iters: 60
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc8.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/8k1k_tp8_conc8.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 256
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: DEEPGEMM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc16.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc16.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 128
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.75
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc32.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc32.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 128
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.75
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc4.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc4.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 128
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.75
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc64.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc64.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 128
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.75
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc8.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/1k1k_tp8_conc8.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 128
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.75
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc16.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc16.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 128
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.75
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc32.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc32.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 128
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.75
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc4.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc4.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 128
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.75
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc64.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc64.yaml
@ -0,0 +1,22 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 128
+enable_attention_dp: true
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.75
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+attention_dp_config:
+  batching_wait_iters: 0
+  enable_balance: true
+  timeout_iters: 60
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc8.yaml
+++ b/examples/configs/database/deepseek-ai/DeepSeek-R1-0528/H200/8k1k_tp8_conc8.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 128
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.75
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/lookup.yaml
+++ b/examples/configs/database/lookup.yaml
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc128.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc128.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1216
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc16.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc16.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc256.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc256.yaml
@ -0,0 +1,22 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: true
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+attention_dp_config:
+  batching_wait_iters: 0
+  enable_balance: true
+  timeout_iters: 60
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1344
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc32.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc32.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc4.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc4.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc64.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc64.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc8.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp4_conc8.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc128.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc128.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1216
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc16.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc16.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc256.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc256.yaml
@ -0,0 +1,22 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: true
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+attention_dp_config:
+  batching_wait_iters: 0
+  enable_balance: true
+  timeout_iters: 60
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1344
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc32.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc32.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc4.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc4.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc64.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc64.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc8.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/1k1k_tp8_conc8.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 1152
+max_seq_len: 2068
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc128.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc128.yaml
@ -0,0 +1,22 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: true
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+attention_dp_config:
+  batching_wait_iters: 0
+  enable_balance: true
+  timeout_iters: 60
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8384
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc16.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc16.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc256.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc256.yaml
@ -0,0 +1,22 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: true
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+attention_dp_config:
+  batching_wait_iters: 0
+  enable_balance: true
+  timeout_iters: 60
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8512
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc32.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc32.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc4.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc4.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc64.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc64.yaml
@ -0,0 +1,22 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: true
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+attention_dp_config:
+  batching_wait_iters: 0
+  enable_balance: true
+  timeout_iters: 60
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc8.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp4_conc8.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc128.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc128.yaml
@ -0,0 +1,22 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: true
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+attention_dp_config:
+  batching_wait_iters: 0
+  enable_balance: true
+  timeout_iters: 60
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8384
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc16.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc16.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc256.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc256.yaml
@ -0,0 +1,22 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: true
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+attention_dp_config:
+  batching_wait_iters: 0
+  enable_balance: true
+  timeout_iters: 60
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8512
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc32.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc32.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc4.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc4.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc64.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc64.yaml
@ -0,0 +1,22 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: true
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: CUTLASS
+attention_dp_config:
+  batching_wait_iters: 0
+  enable_balance: true
+  timeout_iters: 60
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc8.yaml
+++ b/examples/configs/database/nvidia/DeepSeek-R1-0528-FP4-v2/B200/8k1k_tp8_conc8.yaml
@ -0,0 +1,18 @@
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 512
+enable_attention_dp: false
+print_iter_log: true
+kv_cache_config:
+  dtype: fp8
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+stream_interval: 10
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 8320
+max_seq_len: 9416
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc16.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc16.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 16
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc32.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc32.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 32
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc4.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc4.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 4
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc64.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc64.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 64
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc8.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp1_conc8.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 8
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc16.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc16.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 16
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 2
+moe_expert_parallel_size: 2
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc32.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc32.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 32
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 2
+moe_expert_parallel_size: 2
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc4.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc4.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 4
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 2
+moe_expert_parallel_size: 2
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc64.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc64.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 64
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 2
+moe_expert_parallel_size: 2
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc8.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp2_conc8.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 8
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 2
+moe_expert_parallel_size: 2
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc16.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc16.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 16
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc32.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc32.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 32
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc4.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc4.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 4
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc64.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc64.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 64
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc8.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp4_conc8.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 8
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc16.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc16.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 16
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc32.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc32.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 32
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc4.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc4.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 4
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc64.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc64.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 64
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc8.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k1k_tp8_conc8.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 8
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 2068
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc16.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc16.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 16
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 9236
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc32.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc32.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 32
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 9236
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc4.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc4.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 4
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 9236
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc64.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc64.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 64
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 9236
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc8.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp1_conc8.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 8
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 9236
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc16.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc16.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 16
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 2
+moe_expert_parallel_size: 2
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 9236
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc32.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc32.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 32
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 2
+moe_expert_parallel_size: 2
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 9236
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc4.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc4.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 4
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 2
+moe_expert_parallel_size: 2
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 9236
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc64.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc64.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 64
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 2
+moe_expert_parallel_size: 2
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 9236
--- a/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc8.yaml
+++ b/examples/configs/database/openai/gpt-oss-120b/B200/1k8k_tp2_conc8.yaml
@ -0,0 +1,22 @@
+env_overrides:
+  TRTLLM_ENABLE_PDL: 1
+  NCCL_GRAPH_REGISTER: 0
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 8
+enable_attention_dp: false
+kv_cache_config:
+  dtype: fp8
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.85
+print_iter_log: true
+stream_interval: 20
+num_postprocess_workers: 4
+moe_config:
+  backend: TRTLLM
+tensor_parallel_size: 2
+moe_expert_parallel_size: 2
+trust_remote_code: true
+backend: pytorch
+max_num_tokens: 20000
+max_seq_len: 9236
--- a/Show More
+++ b/Show More