mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
[TRTC-43] [feat] Add config db and docs (#9420)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com> Co-authored-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
This commit is contained in:
parent
24f92721f2
commit
fd1270b9ab
2
.gitignore
vendored
2
.gitignore
vendored
@ -55,6 +55,8 @@ tensorrt_llm/scripts
|
||||
*docs/source/_cpp_gen*
|
||||
docs/source/**/*.rst
|
||||
!docs/source/examples/index.rst
|
||||
!docs/source/deployment-guide/config_table.rst
|
||||
!docs/source/deployment-guide/note_sections.rst
|
||||
*.swp
|
||||
|
||||
# Testing
|
||||
|
||||
1074
docs/source/deployment-guide/config_table.rst
Normal file
1074
docs/source/deployment-guide/config_table.rst
Normal file
File diff suppressed because it is too large
Load Diff
@ -66,7 +66,7 @@ We maintain YAML configuration files with recommended performance settings in th
|
||||
|
||||
```shell
|
||||
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml
|
||||
```
|
||||
|
||||
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
|
||||
@ -74,7 +74,7 @@ Note: if you don't have access to the source code locally, you can manually crea
|
||||
````{admonition} Show code
|
||||
:class: dropdown
|
||||
|
||||
```{literalinclude} ../../../examples/configs/deepseek-r1-throughput.yaml
|
||||
```{literalinclude} ../../../examples/configs/curated/deepseek-r1-throughput.yaml
|
||||
---
|
||||
language: shell
|
||||
prepend: |
|
||||
@ -90,7 +90,7 @@ To use the `DeepGEMM` MOE backend on B200/GB200, use this config instead:
|
||||
|
||||
```shell
|
||||
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-deepgemm.yaml
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-deepgemm.yaml
|
||||
```
|
||||
|
||||
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
|
||||
@ -98,7 +98,7 @@ Note: if you don't have access to the source code locally, you can manually crea
|
||||
````{admonition} Show code
|
||||
:class: dropdown
|
||||
|
||||
```{literalinclude} ../../../examples/configs/deepseek-r1-deepgemm.yaml
|
||||
```{literalinclude} ../../../examples/configs/curated/deepseek-r1-deepgemm.yaml
|
||||
---
|
||||
language: shell
|
||||
prepend: |
|
||||
@ -154,7 +154,7 @@ These options provide control over TensorRT LLM's behavior and are set within th
|
||||
|
||||
#### `trust_remote_code`
|
||||
|
||||
 **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
|
||||
* **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
|
||||
|
||||
#### `kv_cache_config`
|
||||
|
||||
@ -429,3 +429,23 @@ $$
|
||||
$$
|
||||
\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
|
||||
$$
|
||||
|
||||
## Preconfigured Recipes
|
||||
|
||||
The following tables list recommended configurations from the comprehensive database for different performance profiles.
|
||||
|
||||
```{eval-rst}
|
||||
.. include:: note_sections.rst
|
||||
:start-after: .. start-note-traffic-patterns
|
||||
:end-before: .. end-note-traffic-patterns
|
||||
|
||||
.. include:: config_table.rst
|
||||
:start-after: .. start-deepseek-ai/DeepSeek-R1-0528
|
||||
:end-before: .. end-deepseek-ai/DeepSeek-R1-0528
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. include:: config_table.rst
|
||||
:start-after: .. start-nvidia/DeepSeek-R1-0528-FP4-v2
|
||||
:end-before: .. end-nvidia/DeepSeek-R1-0528-FP4-v2
|
||||
```
|
||||
|
||||
@ -64,7 +64,7 @@ For low-latency use cases:
|
||||
|
||||
```shell
|
||||
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-latency.yaml
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-latency.yaml
|
||||
```
|
||||
|
||||
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
|
||||
@ -72,7 +72,7 @@ Note: if you don't have access to the source code locally, you can manually crea
|
||||
````{admonition} Show code
|
||||
:class: dropdown
|
||||
|
||||
```{literalinclude} ../../../examples/configs/gpt-oss-120b-latency.yaml
|
||||
```{literalinclude} ../../../examples/configs/curated/gpt-oss-120b-latency.yaml
|
||||
---
|
||||
language: shell
|
||||
prepend: |
|
||||
@ -88,7 +88,7 @@ For max-throughput use cases:
|
||||
|
||||
```shell
|
||||
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-throughput.yaml
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-throughput.yaml
|
||||
```
|
||||
|
||||
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
|
||||
@ -96,7 +96,7 @@ Note: if you don't have access to the source code locally, you can manually crea
|
||||
````{admonition} Show code
|
||||
:class: dropdown
|
||||
|
||||
```{literalinclude} ../../../examples/configs/gpt-oss-120b-throughput.yaml
|
||||
```{literalinclude} ../../../examples/configs/curated/gpt-oss-120b-throughput.yaml
|
||||
---
|
||||
language: shell
|
||||
prepend: |
|
||||
@ -377,3 +377,17 @@ $$
|
||||
$$
|
||||
\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
|
||||
$$
|
||||
|
||||
## Preconfigured Recipes
|
||||
|
||||
The following table lists recommended configurations from the comprehensive database for different performance profiles.
|
||||
|
||||
```{eval-rst}
|
||||
.. include:: note_sections.rst
|
||||
:start-after: .. start-note-traffic-patterns
|
||||
:end-before: .. end-note-traffic-patterns
|
||||
|
||||
.. include:: config_table.rst
|
||||
:start-after: .. start-openai/gpt-oss-120b
|
||||
:end-before: .. end-openai/gpt-oss-120b
|
||||
```
|
||||
|
||||
@ -58,7 +58,7 @@ We maintain YAML configuration files with recommended performance settings in th
|
||||
|
||||
```shell
|
||||
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/llama-3.3-70b.yaml
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/llama-3.3-70b.yaml
|
||||
```
|
||||
|
||||
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
|
||||
@ -66,7 +66,7 @@ Note: if you don't have access to the source code locally, you can manually crea
|
||||
````{admonition} Show code
|
||||
:class: dropdown
|
||||
|
||||
```{literalinclude} ../../../examples/configs/llama-3.3-70b.yaml
|
||||
```{literalinclude} ../../../examples/configs/curated/llama-3.3-70b.yaml
|
||||
---
|
||||
language: shell
|
||||
prepend: |
|
||||
|
||||
@ -57,7 +57,7 @@ We maintain YAML configuration files with recommended performance settings in th
|
||||
|
||||
```shell
|
||||
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/llama-4-scout.yaml
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/llama-4-scout.yaml
|
||||
```
|
||||
|
||||
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
|
||||
@ -65,7 +65,7 @@ Note: if you don't have access to the source code locally, you can manually crea
|
||||
````{admonition} Show code
|
||||
:class: dropdown
|
||||
|
||||
```{literalinclude} ../../../examples/configs/llama-4-scout.yaml
|
||||
```{literalinclude} ../../../examples/configs/curated/llama-4-scout.yaml
|
||||
---
|
||||
language: shell
|
||||
prepend: |
|
||||
|
||||
@ -35,7 +35,7 @@ We maintain YAML configuration files with recommended performance settings in th
|
||||
|
||||
```shell
|
||||
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/qwen3-next.yaml
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml
|
||||
```
|
||||
|
||||
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
|
||||
@ -43,7 +43,7 @@ Note: if you don't have access to the source code locally, you can manually crea
|
||||
````{admonition} Show code
|
||||
:class: dropdown
|
||||
|
||||
```{literalinclude} ../../../examples/configs/qwen3-next.yaml
|
||||
```{literalinclude} ../../../examples/configs/curated/qwen3-next.yaml
|
||||
---
|
||||
language: shell
|
||||
prepend: |
|
||||
|
||||
@ -40,7 +40,7 @@ We maintain YAML configuration files with recommended performance settings in th
|
||||
|
||||
```shell
|
||||
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/qwen3.yaml
|
||||
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/qwen3.yaml
|
||||
```
|
||||
|
||||
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
|
||||
@ -48,7 +48,7 @@ Note: if you don't have access to the source code locally, you can manually crea
|
||||
````{admonition} Show code
|
||||
:class: dropdown
|
||||
|
||||
```{literalinclude} ../../../examples/configs/qwen3.yaml
|
||||
```{literalinclude} ../../../examples/configs/curated/qwen3.yaml
|
||||
---
|
||||
language: shell
|
||||
prepend: |
|
||||
|
||||
@ -6,15 +6,20 @@ Quick Start for Popular Models
|
||||
|
||||
The table below contains ``trtllm-serve`` commands that can be used to easily deploy popular models including DeepSeek-R1, gpt-oss, Llama 4, Qwen3, and more.
|
||||
|
||||
We maintain LLM API configuration files for these models containing recommended performance settings in the `examples/configs <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs>`_ directory. The TensorRT LLM Docker container makes the config files available at ``/app/tensorrt_llm/examples/configs``, but you can customize this as needed:
|
||||
We maintain LLM API configuration files for these models containing recommended performance settings in two locations:
|
||||
|
||||
* **Curated Examples**: `examples/configs/curated <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs/curated>`_ - Hand-picked configurations for common scenarios.
|
||||
* **Comprehensive Database**: `examples/configs/database <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs/database>`_ - A more comprehensive set of known-good configurations for various GPUs and traffic patterns.
|
||||
|
||||
The TensorRT LLM Docker container makes these config files available at ``/app/tensorrt_llm/examples/configs/curated`` and ``/app/tensorrt_llm/examples/configs/database`` respectively. You can reference them as needed:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
export TRTLLM_DIR="/app/tensorrt_llm" # path to the TensorRT LLM repo in your local environment
|
||||
|
||||
.. note::
|
||||
|
||||
The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, you may benefit from additional tuning. In the future, we plan to provide more configs for a wider range of traffic patterns.
|
||||
.. include:: note_sections.rst
|
||||
:start-after: .. start-note-quick-start-isl-osl
|
||||
:end-before: .. end-note-quick-start-isl-osl
|
||||
|
||||
This table is designed to provide a straightforward starting point; for detailed model-specific deployment guides, check out the guides below.
|
||||
|
||||
@ -30,53 +35,53 @@ This table is designed to provide a straightforward starting point; for detailed
|
||||
* - `DeepSeek-R1 <https://huggingface.co/deepseek-ai/DeepSeek-R1-0528>`_
|
||||
- H100, H200
|
||||
- Max Throughput
|
||||
- `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-throughput.yaml>`_
|
||||
- ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml``
|
||||
- `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-throughput.yaml>`_
|
||||
- ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml``
|
||||
* - `DeepSeek-R1 <https://huggingface.co/deepseek-ai/DeepSeek-R1-0528>`_
|
||||
- B200, GB200
|
||||
- Max Throughput
|
||||
- `deepseek-r1-deepgemm.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-deepgemm.yaml>`_
|
||||
- ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-deepgemm.yaml``
|
||||
- `deepseek-r1-deepgemm.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-deepgemm.yaml>`_
|
||||
- ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-deepgemm.yaml``
|
||||
* - `DeepSeek-R1 (NVFP4) <https://huggingface.co/nvidia/DeepSeek-R1-FP4>`_
|
||||
- B200, GB200
|
||||
- Max Throughput
|
||||
- `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-throughput.yaml>`_
|
||||
- ``trtllm-serve nvidia/DeepSeek-R1-FP4 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml``
|
||||
- `deepseek-r1-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-throughput.yaml>`_
|
||||
- ``trtllm-serve nvidia/DeepSeek-R1-FP4 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml``
|
||||
* - `DeepSeek-R1 (NVFP4) <https://huggingface.co/nvidia/DeepSeek-R1-FP4-v2>`_
|
||||
- B200, GB200
|
||||
- Min Latency
|
||||
- `deepseek-r1-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/deepseek-r1-latency.yaml>`_
|
||||
- ``trtllm-serve nvidia/DeepSeek-R1-FP4-v2 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-latency.yaml``
|
||||
- `deepseek-r1-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/deepseek-r1-latency.yaml>`_
|
||||
- ``trtllm-serve nvidia/DeepSeek-R1-FP4-v2 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-latency.yaml``
|
||||
* - `gpt-oss-120b <https://huggingface.co/openai/gpt-oss-120b>`_
|
||||
- Any
|
||||
- Max Throughput
|
||||
- `gpt-oss-120b-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/gpt-oss-120b-throughput.yaml>`_
|
||||
- ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/gpt-oss-120b-throughput.yaml``
|
||||
- `gpt-oss-120b-throughput.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/gpt-oss-120b-throughput.yaml>`_
|
||||
- ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-throughput.yaml``
|
||||
* - `gpt-oss-120b <https://huggingface.co/openai/gpt-oss-120b>`_
|
||||
- Any
|
||||
- Min Latency
|
||||
- `gpt-oss-120b-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/gpt-oss-120b-latency.yaml>`_
|
||||
- ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/gpt-oss-120b-latency.yaml``
|
||||
- `gpt-oss-120b-latency.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/gpt-oss-120b-latency.yaml>`_
|
||||
- ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-latency.yaml``
|
||||
* - `Qwen3-Next-80B-A3B-Thinking <https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking>`_
|
||||
- Any
|
||||
- Max Throughput
|
||||
- `qwen3-next.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/qwen3-next.yaml>`_
|
||||
- ``trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/qwen3-next.yaml``
|
||||
- `qwen3-next.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/qwen3-next.yaml>`_
|
||||
- ``trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml``
|
||||
* - Qwen3 family (e.g. `Qwen3-30B-A3B <https://huggingface.co/Qwen/Qwen3-30B-A3B>`_)
|
||||
- Any
|
||||
- Max Throughput
|
||||
- `qwen3.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/qwen3.yaml>`_
|
||||
- ``trtllm-serve Qwen/Qwen3-30B-A3B --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/qwen3.yaml`` (swap to another Qwen3 model name as needed)
|
||||
- `qwen3.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/qwen3.yaml>`_
|
||||
- ``trtllm-serve Qwen/Qwen3-30B-A3B --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/qwen3.yaml`` (swap to another Qwen3 model name as needed)
|
||||
* - `Llama-3.3-70B (FP8) <https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8>`_
|
||||
- Any
|
||||
- Max Throughput
|
||||
- `llama-3.3-70b.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/llama-3.3-70b.yaml>`_
|
||||
- ``trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/llama-3.3-70b.yaml``
|
||||
- `llama-3.3-70b.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/llama-3.3-70b.yaml>`_
|
||||
- ``trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/llama-3.3-70b.yaml``
|
||||
* - `Llama 4 Scout (FP8) <https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP8>`_
|
||||
- Any
|
||||
- Max Throughput
|
||||
- `llama-4-scout.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/llama-4-scout.yaml>`_
|
||||
- ``trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/llama-4-scout.yaml``
|
||||
- `llama-4-scout.yaml <https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/llama-4-scout.yaml>`_
|
||||
- ``trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/curated/llama-4-scout.yaml``
|
||||
|
||||
Model-Specific Deployment Guides
|
||||
---------------------------------
|
||||
@ -94,3 +99,10 @@ The deployment guides below provide more detailed instructions for serving speci
|
||||
deployment-guide-for-qwen3-on-trtllm.md
|
||||
deployment-guide-for-qwen3-next-on-trtllm.md
|
||||
deployment-guide-for-kimi-k2-thinking-on-trtllm.md
|
||||
|
||||
Comprehensive Configuration Database
|
||||
------------------------------------
|
||||
|
||||
The table below lists all available pre-configured model scenarios in the TensorRT LLM configuration database. Each row represents a specific model, GPU, and performance profile combination with recommended request settings.
|
||||
|
||||
.. include:: config_table.rst
|
||||
|
||||
36
docs/source/deployment-guide/note_sections.rst
Normal file
36
docs/source/deployment-guide/note_sections.rst
Normal file
@ -0,0 +1,36 @@
|
||||
..
|
||||
Reusable note sections for deployment guides.
|
||||
Include specific notes using:
|
||||
|
||||
.. include:: note_sections.rst
|
||||
:start-after: .. start-note-<name>
|
||||
:end-before: .. end-note-<name>
|
||||
|
||||
.. start-note-traffic-patterns
|
||||
|
||||
.. note::
|
||||
|
||||
**Traffic Patterns**: The ISL (Input Sequence Length) and OSL (Output Sequence Length)
|
||||
values in each configuration represent the **maximum supported values** for that config.
|
||||
Requests exceeding these limits may result in errors.
|
||||
|
||||
To handle requests with input sequences **longer than the configured ISL**, add the following
|
||||
to your config file:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
enable_chunked_prefill: true
|
||||
|
||||
This enables chunked prefill, which processes long input sequences in chunks rather than
|
||||
requiring them to fit within a single prefill operation. Note that enabling chunked prefill
|
||||
does **not** guarantee optimal performance—these configs are tuned for the specified ISL/OSL.
|
||||
|
||||
.. end-note-traffic-patterns
|
||||
|
||||
.. start-note-quick-start-isl-osl
|
||||
|
||||
.. note::
|
||||
|
||||
The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, refer to the :ref:`Comprehensive Configuration Database` section below which covers a larger set of traffic patterns and performance profiles.
|
||||
|
||||
.. end-note-quick-start-isl-osl
|
||||
64
examples/configs/database/database.py
Normal file
64
examples/configs/database/database.py
Normal file
@ -0,0 +1,64 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, Iterator, List
|
||||
|
||||
import yaml
|
||||
from pydantic import BaseModel, Field, RootModel
|
||||
|
||||
DATABASE_LIST_PATH = Path(__file__).parent / "lookup.yaml"
|
||||
|
||||
|
||||
class RecipeConstraints(BaseModel):
|
||||
"""Recipe record for scenario list."""
|
||||
|
||||
model: str = Field(description="Model name")
|
||||
gpu: str = Field(description="GPU name")
|
||||
isl: int = Field(description="Input sequence length")
|
||||
osl: int = Field(description="Output sequence length")
|
||||
concurrency: int = Field(description="Concurrency")
|
||||
config_path: str = Field(description="Configuration path")
|
||||
num_gpus: int = Field(description="Number of GPUs")
|
||||
|
||||
def load_config(self) -> Dict[str, Any]:
|
||||
"""Load and return the YAML config at config_path."""
|
||||
with open(self.config_path) as f:
|
||||
data = yaml.safe_load(f)
|
||||
return data if data is not None else {}
|
||||
|
||||
|
||||
class Recipe(BaseModel):
|
||||
"""Recipe that describes a single scenario."""
|
||||
|
||||
constraints: RecipeConstraints = Field(description="Recipe constraints")
|
||||
env_overrides: Dict[str, Any] = Field(description="Environment overrides", default_factory=dict)
|
||||
config: Dict[str, Any] = Field(description="Configuration overrides", default_factory=dict)
|
||||
|
||||
|
||||
class RecipeList(RootModel[List[RecipeConstraints]]):
|
||||
@classmethod
|
||||
def from_yaml(cls, yaml_path: Path) -> "RecipeList":
|
||||
"""Load and validate recipe list from YAML file."""
|
||||
with open(yaml_path) as f:
|
||||
data = yaml.safe_load(f)
|
||||
return cls(data)
|
||||
|
||||
def __iter__(self) -> Iterator[RecipeConstraints]:
|
||||
return iter(self.root)
|
||||
|
||||
def __len__(self) -> int:
|
||||
return len(self.root)
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 256
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: DEEPGEMM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 256
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: DEEPGEMM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 256
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: DEEPGEMM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 256
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: DEEPGEMM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 256
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: DEEPGEMM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 256
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: DEEPGEMM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 256
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: DEEPGEMM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 256
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: DEEPGEMM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,22 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 256
|
||||
enable_attention_dp: true
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: DEEPGEMM
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 256
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: DEEPGEMM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 128
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.75
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 128
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.75
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 128
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.75
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 128
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.75
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 128
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.75
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 128
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.75
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 128
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.75
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 128
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.75
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,22 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 128
|
||||
enable_attention_dp: true
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.75
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 128
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.75
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
1176
examples/configs/database/lookup.yaml
Normal file
1176
examples/configs/database/lookup.yaml
Normal file
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1216
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: true
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1344
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1216
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: true
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1344
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 1152
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: true
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8384
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,22 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: true
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8512
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,22 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: true
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,22 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: true
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8384
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,22 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: true
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8512
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,22 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: true
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,18 @@
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: false
|
||||
print_iter_log: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
enable_block_reuse: false
|
||||
stream_interval: 10
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 8320
|
||||
max_seq_len: 9416
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 16
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 1
|
||||
moe_expert_parallel_size: 1
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 32
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 1
|
||||
moe_expert_parallel_size: 1
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 4
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 1
|
||||
moe_expert_parallel_size: 1
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 64
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 1
|
||||
moe_expert_parallel_size: 1
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 8
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 1
|
||||
moe_expert_parallel_size: 1
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 16
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 2
|
||||
moe_expert_parallel_size: 2
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 32
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 2
|
||||
moe_expert_parallel_size: 2
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 4
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 2
|
||||
moe_expert_parallel_size: 2
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 64
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 2
|
||||
moe_expert_parallel_size: 2
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 8
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 2
|
||||
moe_expert_parallel_size: 2
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 16
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 32
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 4
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 64
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 8
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 4
|
||||
moe_expert_parallel_size: 4
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 16
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 32
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 4
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 64
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 8
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 8
|
||||
moe_expert_parallel_size: 8
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 2068
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 16
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 1
|
||||
moe_expert_parallel_size: 1
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 9236
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 32
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 1
|
||||
moe_expert_parallel_size: 1
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 9236
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 4
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 1
|
||||
moe_expert_parallel_size: 1
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 9236
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 64
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 1
|
||||
moe_expert_parallel_size: 1
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 9236
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 8
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 1
|
||||
moe_expert_parallel_size: 1
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 9236
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 16
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 2
|
||||
moe_expert_parallel_size: 2
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 9236
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 32
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 2
|
||||
moe_expert_parallel_size: 2
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 9236
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 4
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 2
|
||||
moe_expert_parallel_size: 2
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 9236
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 64
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 2
|
||||
moe_expert_parallel_size: 2
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 9236
|
||||
@ -0,0 +1,22 @@
|
||||
env_overrides:
|
||||
TRTLLM_ENABLE_PDL: 1
|
||||
NCCL_GRAPH_REGISTER: 0
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 8
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
enable_block_reuse: false
|
||||
free_gpu_memory_fraction: 0.85
|
||||
print_iter_log: true
|
||||
stream_interval: 20
|
||||
num_postprocess_workers: 4
|
||||
moe_config:
|
||||
backend: TRTLLM
|
||||
tensor_parallel_size: 2
|
||||
moe_expert_parallel_size: 2
|
||||
trust_remote_code: true
|
||||
backend: pytorch
|
||||
max_num_tokens: 20000
|
||||
max_seq_len: 9236
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user