TensorRT-LLMs/_sources/torch/features/quantization.md.txt
2025-07-08 02:03:19 +00:00

19 lines
652 B
Plaintext

# Quantization
The PyTorch backend supports FP8 and NVFP4 quantization. You can pass quantized models in HF model hub,
which are generated by [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
```python
from tensorrt_llm._torch import LLM
llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8')
llm.generate("Hello, my name is")
```
Or you can try the following commands to get a quantized model by yourself:
```bash
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
cd TensorRT-Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --export_fmt hf
```