mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
19 lines
652 B
Markdown
19 lines
652 B
Markdown
# Quantization
|
|
|
|
The PyTorch backend supports FP8 and NVFP4 quantization. You can pass quantized models in HF model hub,
|
|
which are generated by [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
|
|
|
|
```python
|
|
from tensorrt_llm._torch import LLM
|
|
llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8')
|
|
llm.generate("Hello, my name is")
|
|
```
|
|
|
|
Or you can try the following commands to get a quantized model by yourself:
|
|
|
|
```bash
|
|
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
|
|
cd TensorRT-Model-Optimizer/examples/llm_ptq
|
|
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --export_fmt hf
|
|
```
|