[TRTC-43] [feat] Add config db and docs (#9420)

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
This commit is contained in:
Venky 2025-12-11 12:00:03 -08:00 committed by GitHub
parent 24f92721f2
commit fd1270b9ab
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
195 changed files with 6234 additions and 45 deletions

2
.gitignore vendored
View File

@ -55,6 +55,8 @@ tensorrt_llm/scripts
*docs/source/_cpp_gen*
docs/source/**/*.rst
!docs/source/examples/index.rst
!docs/source/deployment-guide/config_table.rst
!docs/source/deployment-guide/note_sections.rst
*.swp
# Testing

File diff suppressed because it is too large Load Diff

View File

@ -66,7 +66,7 @@ We maintain YAML configuration files with recommended performance settings in th
```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml
```
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -74,7 +74,7 @@ Note: if you don't have access to the source code locally, you can manually crea
````{admonition} Show code
:class: dropdown
```{literalinclude} ../../../examples/configs/deepseek-r1-throughput.yaml
```{literalinclude} ../../../examples/configs/curated/deepseek-r1-throughput.yaml
---
language: shell
prepend: |
@ -90,7 +90,7 @@ To use the `DeepGEMM` MOE backend on B200/GB200, use this config instead:
```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-deepgemm.yaml
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-deepgemm.yaml
```
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -98,7 +98,7 @@ Note: if you don't have access to the source code locally, you can manually crea
````{admonition} Show code
:class: dropdown
```{literalinclude} ../../../examples/configs/deepseek-r1-deepgemm.yaml
```{literalinclude} ../../../examples/configs/curated/deepseek-r1-deepgemm.yaml
---
language: shell
prepend: |
@ -154,7 +154,7 @@ These options provide control over TensorRT LLM's behavior and are set within th
#### `trust_remote_code`
&emsp;**Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
* **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
#### `kv_cache_config`
@ -429,3 +429,23 @@ $$
$$
\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
$$
## Preconfigured Recipes
The following tables list recommended configurations from the comprehensive database for different performance profiles.
```{eval-rst}
.. include:: note_sections.rst
:start-after: .. start-note-traffic-patterns
:end-before: .. end-note-traffic-patterns
.. include:: config_table.rst
:start-after: .. start-deepseek-ai/DeepSeek-R1-0528
:end-before: .. end-deepseek-ai/DeepSeek-R1-0528
```
```{eval-rst}
.. include:: config_table.rst
:start-after: .. start-nvidia/DeepSeek-R1-0528-FP4-v2
:end-before: .. end-nvidia/DeepSeek-R1-0528-FP4-v2
```

View File

@ -64,7 +64,7 @@ For low-latency use cases:
```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-latency.yaml
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-latency.yaml
```
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -72,7 +72,7 @@ Note: if you don't have access to the source code locally, you can manually crea
````{admonition} Show code
:class: dropdown
```{literalinclude} ../../../examples/configs/gpt-oss-120b-latency.yaml
```{literalinclude} ../../../examples/configs/curated/gpt-oss-120b-latency.yaml
---
language: shell
prepend: |
@ -88,7 +88,7 @@ For max-throughput use cases:
```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-throughput.yaml
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-throughput.yaml
```
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -96,7 +96,7 @@ Note: if you don't have access to the source code locally, you can manually crea
````{admonition} Show code
:class: dropdown
```{literalinclude} ../../../examples/configs/gpt-oss-120b-throughput.yaml
```{literalinclude} ../../../examples/configs/curated/gpt-oss-120b-throughput.yaml
---
language: shell
prepend: |
@ -377,3 +377,17 @@ $$
$$
\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
$$
## Preconfigured Recipes
The following table lists recommended configurations from the comprehensive database for different performance profiles.
```{eval-rst}
.. include:: note_sections.rst
:start-after: .. start-note-traffic-patterns
:end-before: .. end-note-traffic-patterns
.. include:: config_table.rst
:start-after: .. start-openai/gpt-oss-120b
:end-before: .. end-openai/gpt-oss-120b
```

View File

@ -58,7 +58,7 @@ We maintain YAML configuration files with recommended performance settings in th
```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/llama-3.3-70b.yaml
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/llama-3.3-70b.yaml
```
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -66,7 +66,7 @@ Note: if you don't have access to the source code locally, you can manually crea
````{admonition} Show code
:class: dropdown
```{literalinclude} ../../../examples/configs/llama-3.3-70b.yaml
```{literalinclude} ../../../examples/configs/curated/llama-3.3-70b.yaml
---
language: shell
prepend: |

View File

@ -57,7 +57,7 @@ We maintain YAML configuration files with recommended performance settings in th
```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/llama-4-scout.yaml
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/llama-4-scout.yaml
```
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -65,7 +65,7 @@ Note: if you don't have access to the source code locally, you can manually crea
````{admonition} Show code
:class: dropdown
```{literalinclude} ../../../examples/configs/llama-4-scout.yaml
```{literalinclude} ../../../examples/configs/curated/llama-4-scout.yaml
---
language: shell
prepend: |

View File

@ -35,7 +35,7 @@ We maintain YAML configuration files with recommended performance settings in th
```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/qwen3-next.yaml
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml
```
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -43,7 +43,7 @@ Note: if you don't have access to the source code locally, you can manually crea
````{admonition} Show code
:class: dropdown
```{literalinclude} ../../../examples/configs/qwen3-next.yaml
```{literalinclude} ../../../examples/configs/curated/qwen3-next.yaml
---
language: shell
prepend: |

View File

@ -40,7 +40,7 @@ We maintain YAML configuration files with recommended performance settings in th
```shell
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/qwen3.yaml
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/qwen3.yaml
```
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
@ -48,7 +48,7 @@ Note: if you don't have access to the source code locally, you can manually crea
````{admonition} Show code
:class: dropdown
```{literalinclude} ../../../examples/configs/qwen3.yaml
```{literalinclude} ../../../examples/configs/curated/qwen3.yaml
---
language: shell
prepend: |

View File

@ -6,15 +6,20 @@ Quick Start for Popular Models
The table below contains ``trtllm-serve`` commands that can be used to easily deploy popular models including DeepSeek-R1, gpt-oss, Llama 4, Qwen3, and more.
We maintain LLM API configuration files for these models containing recommended performance settings in the `examples/configs <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs>`_ directory. The TensorRT LLM Docker container makes the config files available at ``/app/tensorrt_llm/examples/configs``, but you can customize this as needed:
We maintain LLM API configuration files for these models containing recommended performance settings in two locations:
* **Curated Examples**: `examples/configs/curated <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs/curated>`_ - Hand-picked configurations for common scenarios.
* **Comprehensive Database**: `examples/configs/database <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs/database>`_ - A more comprehensive set of known-good configurations for various GPUs and traffic patterns.
The TensorRT LLM Docker container makes these config files available at ``/app/tensorrt_llm/examples/configs/curated`` and ``/app/tensorrt_llm/examples/configs/database`` respectively. You can reference them as needed:
.. code-block:: bash
export TRTLLM_DIR="/app/tensorrt_llm" # path to the TensorRT LLM repo in your local environment
.. note::
The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, you may benefit from additional tuning. In the future, we plan to provide more configs for a wider range of traffic patterns.
.. include:: note_sections.rst
:start-after: .. start-note-quick-start-isl-osl
:end-before: .. end-note-quick-start-isl-osl
This table is designed to provide a straightforward starting point; for detailed model-specific deployment guides, check out the guides below.
@ -30,53 +35,53 @@ This table is designed to provide a straightforward starting point; for detailed
* - `DeepSeek-R1 <https://huggingface.co/deepseek-ai/DeepSeek-R1-0528>`_
- H100, H200
- Max Throughput
- `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-throughput.yaml>`_
- ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml``
- `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-throughput.yaml>`_
- ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml``
* - `DeepSeek-R1 <https://huggingface.co/deepseek-ai/DeepSeek-R1-0528>`_
- B200, GB200
- Max Throughput
- `deepseek-r1-deepgemm.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-deepgemm.yaml>`_
- ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-deepgemm.yaml``
- `deepseek-r1-deepgemm.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-deepgemm.yaml>`_
- ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-deepgemm.yaml``
* - `DeepSeek-R1 (NVFP4) <https://huggingface.co/nvidia/DeepSeek-R1-FP4>`_
- B200, GB200
- Max Throughput
- `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-throughput.yaml>`_
- ``trtllm-serve nvidia/DeepSeek-R1-FP4 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml``
- `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-throughput.yaml>`_
- ``trtllm-serve nvidia/DeepSeek-R1-FP4 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml``
* - `DeepSeek-R1 (NVFP4) <https://huggingface.co/nvidia/DeepSeek-R1-FP4-v2>`_
- B200, GB200
- Min Latency
- `deepseek-r1-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-latency.yaml>`_
- ``trtllm-serve nvidia/DeepSeek-R1-FP4-v2 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-latency.yaml``
- `deepseek-r1-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-latency.yaml>`_
- ``trtllm-serve nvidia/DeepSeek-R1-FP4-v2 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-latency.yaml``
* - `gpt-oss-120b <https://huggingface.co/openai/gpt-oss-120b>`_
- Any
- Max Throughput
- `gpt-oss-120b-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/gpt-oss-120b-throughput.yaml>`_
- ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/gpt-oss-120b-throughput.yaml``
- `gpt-oss-120b-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/gpt-oss-120b-throughput.yaml>`_
- ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-throughput.yaml``
* - `gpt-oss-120b <https://huggingface.co/openai/gpt-oss-120b>`_
- Any
- Min Latency
- `gpt-oss-120b-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/gpt-oss-120b-latency.yaml>`_
- ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/gpt-oss-120b-latency.yaml``
- `gpt-oss-120b-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/gpt-oss-120b-latency.yaml>`_
- ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-latency.yaml``
* - `Qwen3-Next-80B-A3B-Thinking <https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking>`_
- Any
- Max Throughput
- `qwen3-next.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/qwen3-next.yaml>`_
- ``trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/qwen3-next.yaml``
- `qwen3-next.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/qwen3-next.yaml>`_
- ``trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml``
* - Qwen3 family (e.g. `Qwen3-30B-A3B <https://huggingface.co/Qwen/Qwen3-30B-A3B>`_)
- Any
- Max Throughput
- `qwen3.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/qwen3.yaml>`_
- ``trtllm-serve Qwen/Qwen3-30B-A3B --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/qwen3.yaml`` (swap to another Qwen3 model name as needed)
- `qwen3.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/qwen3.yaml>`_
- ``trtllm-serve Qwen/Qwen3-30B-A3B --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/qwen3.yaml`` (swap to another Qwen3 model name as needed)
* - `Llama-3.3-70B (FP8) <https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8>`_
- Any
- Max Throughput
- `llama-3.3-70b.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/llama-3.3-70b.yaml>`_
- ``trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/llama-3.3-70b.yaml``
- `llama-3.3-70b.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/llama-3.3-70b.yaml>`_
- ``trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/llama-3.3-70b.yaml``
* - `Llama 4 Scout (FP8) <https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP8>`_
- Any
- Max Throughput
- `llama-4-scout.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/llama-4-scout.yaml>`_
- ``trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/llama-4-scout.yaml``
- `llama-4-scout.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/llama-4-scout.yaml>`_
- ``trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/llama-4-scout.yaml``
Model-Specific Deployment Guides
---------------------------------
@ -94,3 +99,10 @@ The deployment guides below provide more detailed instructions for serving speci
deployment-guide-for-qwen3-on-trtllm.md
deployment-guide-for-qwen3-next-on-trtllm.md
deployment-guide-for-kimi-k2-thinking-on-trtllm.md
Comprehensive Configuration Database
------------------------------------
The table below lists all available pre-configured model scenarios in the TensorRT LLM configuration database. Each row represents a specific model, GPU, and performance profile combination with recommended request settings.
.. include:: config_table.rst

View File

@ -0,0 +1,36 @@
..
Reusable note sections for deployment guides.
Include specific notes using:
.. include:: note_sections.rst
:start-after: .. start-note-<name>
:end-before: .. end-note-<name>
.. start-note-traffic-patterns
.. note::
**Traffic Patterns**: The ISL (Input Sequence Length) and OSL (Output Sequence Length)
values in each configuration represent the **maximum supported values** for that config.
Requests exceeding these limits may result in errors.
To handle requests with input sequences **longer than the configured ISL**, add the following
to your config file:
.. code-block:: yaml
enable_chunked_prefill: true
This enables chunked prefill, which processes long input sequences in chunks rather than
requiring them to fit within a single prefill operation. Note that enabling chunked prefill
does **not** guarantee optimal performance—these configs are tuned for the specified ISL/OSL.
.. end-note-traffic-patterns
.. start-note-quick-start-isl-osl
.. note::
The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, refer to the :ref:`Comprehensive Configuration Database` section below which covers a larger set of traffic patterns and performance profiles.
.. end-note-quick-start-isl-osl

View File

@ -0,0 +1,64 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path
from typing import Any, Dict, Iterator, List
import yaml
from pydantic import BaseModel, Field, RootModel
DATABASE_LIST_PATH = Path(__file__).parent / "lookup.yaml"
class RecipeConstraints(BaseModel):
"""Recipe record for scenario list."""
model: str = Field(description="Model name")
gpu: str = Field(description="GPU name")
isl: int = Field(description="Input sequence length")
osl: int = Field(description="Output sequence length")
concurrency: int = Field(description="Concurrency")
config_path: str = Field(description="Configuration path")
num_gpus: int = Field(description="Number of GPUs")
def load_config(self) -> Dict[str, Any]:
"""Load and return the YAML config at config_path."""
with open(self.config_path) as f:
data = yaml.safe_load(f)
return data if data is not None else {}
class Recipe(BaseModel):
"""Recipe that describes a single scenario."""
constraints: RecipeConstraints = Field(description="Recipe constraints")
env_overrides: Dict[str, Any] = Field(description="Environment overrides", default_factory=dict)
config: Dict[str, Any] = Field(description="Configuration overrides", default_factory=dict)
class RecipeList(RootModel[List[RecipeConstraints]]):
@classmethod
def from_yaml(cls, yaml_path: Path) -> "RecipeList":
"""Load and validate recipe list from YAML file."""
with open(yaml_path) as f:
data = yaml.safe_load(f)
return cls(data)
def __iter__(self) -> Iterator[RecipeConstraints]:
return iter(self.root)
def __len__(self) -> int:
return len(self.root)

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: DEEPGEMM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: DEEPGEMM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: DEEPGEMM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: DEEPGEMM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: DEEPGEMM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: DEEPGEMM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: DEEPGEMM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: DEEPGEMM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,22 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: true
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: DEEPGEMM
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: DEEPGEMM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 128
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 128
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 128
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 128
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 128
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 128
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 128
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 128
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,22 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 128
enable_attention_dp: true
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 128
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 1216
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: true
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 1344
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1216
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: true
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1344
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 1152
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: true
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 8384
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,22 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: true
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 8512
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,22 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: true
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,22 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: true
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8384
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,22 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: true
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8512
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,22 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: true
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: CUTLASS
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,18 @@
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: false
print_iter_log: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
stream_interval: 10
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 8320
max_seq_len: 9416

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 16
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 1
moe_expert_parallel_size: 1
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 32
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 1
moe_expert_parallel_size: 1
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 4
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 1
moe_expert_parallel_size: 1
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 64
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 1
moe_expert_parallel_size: 1
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 8
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 1
moe_expert_parallel_size: 1
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 16
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 2
moe_expert_parallel_size: 2
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 32
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 2
moe_expert_parallel_size: 2
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 4
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 2
moe_expert_parallel_size: 2
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 64
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 2
moe_expert_parallel_size: 2
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 8
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 2
moe_expert_parallel_size: 2
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 16
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 32
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 4
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 64
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 8
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 16
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 32
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 4
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 64
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 8
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 2068

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 16
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 1
moe_expert_parallel_size: 1
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 9236

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 32
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 1
moe_expert_parallel_size: 1
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 9236

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 4
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 1
moe_expert_parallel_size: 1
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 9236

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 64
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 1
moe_expert_parallel_size: 1
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 9236

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 8
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 1
moe_expert_parallel_size: 1
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 9236

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 16
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 2
moe_expert_parallel_size: 2
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 9236

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 32
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 2
moe_expert_parallel_size: 2
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 9236

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 4
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 2
moe_expert_parallel_size: 2
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 9236

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 64
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 2
moe_expert_parallel_size: 2
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 9236

View File

@ -0,0 +1,22 @@
env_overrides:
TRTLLM_ENABLE_PDL: 1
NCCL_GRAPH_REGISTER: 0
cuda_graph_config:
enable_padding: true
max_batch_size: 8
enable_attention_dp: false
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.85
print_iter_log: true
stream_interval: 20
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
tensor_parallel_size: 2
moe_expert_parallel_size: 2
trust_remote_code: true
backend: pytorch
max_num_tokens: 20000
max_seq_len: 9236

Some files were not shown because too many files have changed in this diff Show More