mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
[https://nvbugs/5729847][doc] fix broken links to modelopt (#9868)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
This commit is contained in:
parent
8550abf142
commit
67ffa90d62
@ -438,10 +438,10 @@ checkpoint. For the Llama-3.1 models, TensorRT LLM provides the following checkp
|
||||
- [`nvidia/Llama-3.1-70B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8)
|
||||
- [`nvidia/Llama-3.1-405B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8)
|
||||
|
||||
To understand more about how to quantize your own checkpoints, refer to ModelOpt [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/deployment/1_tensorrt_llm.html).
|
||||
To understand more about how to quantize your own checkpoints, refer to ModelOpt [documentation](https://nvidia.github.io/Model-Optimizer/deployment/3_unified_hf.html).
|
||||
|
||||
`trtllm-bench` utilizes the `hf_quant_config.json` file present in the pre-quantized checkpoints above. The configuration
|
||||
file is present in checkpoints quantized with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
|
||||
file is present in checkpoints quantized with [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer)
|
||||
and describes the compute and KV cache quantization that checkpoint was compiled with. For example, from the checkpoints
|
||||
above:
|
||||
|
||||
|
||||
@ -7,7 +7,7 @@ This document summarizes performance measurements of TensorRT-LLM on a number of
|
||||
The data in the following tables is provided as a reference point to help users validate observed performance.
|
||||
It should *not* be considered as the peak performance that can be delivered by TensorRT-LLM.
|
||||
|
||||
Not all configurations were tested for all GPUs.
|
||||
Not all configurations were tested for all GPUs.
|
||||
|
||||
We attempted to keep commands as simple as possible to ease reproducibility and left many options at their default settings.
|
||||
Tuning batch sizes, parallelism configurations, and other options may lead to improved performance depending on your situation.
|
||||
@ -24,9 +24,9 @@ and shows the throughput scenario under maximum load. The reported metric is `Ou
|
||||
|
||||
The performance numbers below were collected using the steps described in this document.
|
||||
|
||||
Testing was performed on models with weights quantized using [ModelOpt](https://nvidia.github.io/TensorRT-Model-Optimizer/#) and published by NVIDIA on the [Model Optimizer HuggingFace Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4).
|
||||
Testing was performed on models with weights quantized using [ModelOpt](https://nvidia.github.io/Model-Optimizer/) and published by NVIDIA on the [Model Optimizer HuggingFace Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4).
|
||||
|
||||
RTX 6000 Pro Blackwell Server Edition data is now included in the perf overview. RTX 6000 systems can benefit from enabling pipeline parallelism (PP) in LLM workloads, so we included several new benchmarks for this GPU at various TP x PP combinations. That data is presented in a separate table for each network.
|
||||
RTX 6000 Pro Blackwell Server Edition data is now included in the perf overview. RTX 6000 systems can benefit from enabling pipeline parallelism (PP) in LLM workloads, so we included several new benchmarks for this GPU at various TP x PP combinations. That data is presented in a separate table for each network.
|
||||
|
||||
|
||||
### Hardware
|
||||
@ -64,7 +64,7 @@ nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
|
||||
|
||||
All performance values are measured in `output tokens per second per GPU`, where `output tokens` includes the first and all subsequent generated tokens (input tokens are not included).
|
||||
|
||||
Data in these tables is taken from the `Per GPU Output Throughput (tps/gpu)` metric reported by `trtllm-bench`.
|
||||
Data in these tables is taken from the `Per GPU Output Throughput (tps/gpu)` metric reported by `trtllm-bench`.
|
||||
The calculations for metrics reported by trtllm-bench can be found in the dataclasses [reporting.py](../../../tensorrt_llm/bench/dataclasses/reporting.py#L570) and [statistics.py](../../../tensorrt_llm/bench/dataclasses/statistics.py#L188)
|
||||
|
||||
|
||||
|
||||
@ -23,7 +23,7 @@ The default PyTorch backend supports FP4 and FP8 quantization on the latest Blac
|
||||
|
||||
### Running Pre-quantized Models
|
||||
|
||||
TensorRT LLM can directly run [pre-quantized models](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) generated with the [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
|
||||
TensorRT LLM can directly run [pre-quantized models](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) generated with the [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer).
|
||||
|
||||
```python
|
||||
from tensorrt_llm import LLM
|
||||
@ -91,7 +91,7 @@ The language component decides which quantization methods are supported by a giv
|
||||
```
|
||||
|
||||
|
||||
## Hardware Support Matrix
|
||||
## Hardware Support Matrix
|
||||
|
||||
| Model | NVFP4 | MXFP4 | FP8(per tensor)| FP8(block scaling) | FP8(rowwise) | FP8 KV Cache |W4A8 AWQ | W4A16 AWQ | W4A8 GPTQ | W4A16 GPTQ |
|
||||
| :------------- | :---: | :---: | :---: | :---: | :---: | :---: | :-------: | :-------: | :--------: | :--------: |
|
||||
@ -109,4 +109,4 @@ FP8 block wise scaling GEMM kernels for sm100/103 are using MXFP8 recipe (E4M3 a
|
||||
## Quick Links
|
||||
|
||||
- [Pre-quantized Models by ModelOpt](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)
|
||||
- [ModelOpt Support Matrix](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/0_support_matrix.html)
|
||||
- [ModelOpt Support Matrix](https://nvidia.github.io/Model-Optimizer/guides/0_support_matrix.html)
|
||||
|
||||
@ -92,14 +92,14 @@ python lm_eval_ad.py \
|
||||
|
||||
### Mixed-precision Quantization using TensorRT Model Optimizer
|
||||
|
||||
TensorRT Model Optimizer [AutoQuantize](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) algorithm is a PTQ algorithm from ModelOpt which quantizes a model by searching for the best quantization format per-layer while meeting the performance constraint specified by the user. This way, `AutoQuantize` enables to trade-off model accuracy for performance.
|
||||
TensorRT Model Optimizer [AutoQuantize](https://nvidia.github.io/Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) algorithm is a PTQ algorithm from ModelOpt which quantizes a model by searching for the best quantization format per-layer while meeting the performance constraint specified by the user. This way, `AutoQuantize` enables to trade-off model accuracy for performance.
|
||||
|
||||
Currently `AutoQuantize` supports only `effective_bits` as the performance constraint (for both weight-only quantization and weight & activation quantization). See
|
||||
[AutoQuantize documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) for more details.
|
||||
[AutoQuantize documentation](https://nvidia.github.io/Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) for more details.
|
||||
|
||||
#### 1. Quantize a model with ModelOpt
|
||||
|
||||
Refer to [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_autodeploy/README.md) for generating quantized model checkpoint.
|
||||
Refer to [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/llm_autodeploy/README.md) for generating quantized model checkpoint.
|
||||
|
||||
#### 2. Deploy the quantized model with AutoDeploy
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user