[DOC] Add INT8 W4A8 docs and Arm's supported quantization schemes (#34894)

Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2026-06-06 00:16:14 +00:00 · 2026-06-04 17:27:17 +01:00
parent 06f94633e7
commit 3da29aa4a5
7 changed files with 366 additions and 149 deletions
@@ -3,7 +3,7 @@
 Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.

 !!! tip
-    To get started with quantization, see [LLM Compressor](llm_compressor.md), a library for optimizing models for deployment with vLLM that supports FP8, INT8, INT4, and other quantization formats.
+    To get started with quantization, see [LLM Compressor](llm_compressor/README.md), a library for optimizing models for deployment with vLLM that supports FP8, INT8, INT4, and other quantization formats.

 The following are the supported quantization formats for vLLM:

@@ -12,9 +12,11 @@ The following are the supported quantization formats for vLLM:
 - [GGUF](gguf.md)
 - [GPTQModel](gptqmodel.md)
 - [Intel Neural Compressor](inc.md)
- [INT4 W4A16](int4.md)
- [INT8 W8A8](int8.md)
- [FP8 W8A8](fp8.md)
+- [LLM Compressor](llm_compressor/README.md)
+    - [FP8 W8A8](llm_compressor/fp8.md)
+    - [INT4 W4A16](llm_compressor/int4.md)
+    - [INT8 W4A8](llm_compressor/int8_w4a8.md)
+    - [INT8 W8A8](llm_compressor/int8_w8a8.md)
 - [NVIDIA Model Optimizer](modelopt.md)
 - [Online Quantization](online.md)
 - [AMD Quark](quark.md)
@@ -46,16 +48,17 @@ th:not(:first-child) {
 }
 </style>

-| Implementation            | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU |
-| ------------------------- | ----- | ------ | ------ | --- | ------ | ------- | --------- | ------- |
-| AWQ                       | ❌    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ✅︎        | ✅︎      |
-| GPTQ                      | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ✅︎        | ✅︎      |
-| Marlin (GPTQ/AWQ/FP8/FP4) | ❌    | ✅︎*    | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ❌      |
-| INT8 (W8A8)               | ❌    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ✅︎      |
-| FP8 (W8A8)                | ❌    | ❌     | ❌     | ✅︎  | ✅︎     | ✅︎      | ❌        | ❌      |
-| bitsandbytes              | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ❌      |
-| DeepSpeedFP               | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ❌      |
-| GGUF                      | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ✅︎      | ❌        | ❌      |
+| Implementation            | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | Arm CPU |
+| ------------------------- | ----- | ------ | ------ | --- | ------ | ------- | --------- | ------- | ------- |
+| AWQ                       | ❌    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ✅︎        | ✅︎      | ❌      |
+| GPTQ                      | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ✅︎        | ✅︎      | ❌      |
+| Marlin (GPTQ/AWQ/FP8/FP4) | ❌    | ✅︎*    | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ❌      | ❌      |
+| llm-compressor INT8 (W8A8)| ❌    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ✅︎      | ✅︎      |
+| llm-compressor INT8 (W4A8)| ❌    | ❌     | ❌     | ❌  | ❌     | ❌      | ❌        | ❌      | ✅︎      |
+| llm-compressor FP8 (W8A8) | ❌    | ❌     | ❌     | ✅︎  | ✅︎     | ✅︎      | ❌        | ❌      | ❌      |
+| bitsandbytes              | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ❌      | ❌      |
+| DeepSpeedFP               | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ❌      | ❌      |
+| GGUF                      | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ✅︎      | ❌        | ❌      | ❌      |

 - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
 - ✅︎ indicates that the quantization method is supported on the specified hardware.
@@ -21,9 +21,17 @@ The FP8 types typically supported in hardware have two distinct representations,
 To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:

 ```bash
-pip install llmcompressor
+(venv-llm-compressor) pip install llmcompressor
 ```

+Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
+
+```bash
+(venv-vllm) pip install vllm "lm-eval[api]>=0.4.12"
+```
+
+Please use separate environments for vLLM and llm-compressor as they might not work together.
+
 ## Quantization Process

 The quantization process involves three main steps:
@@ -57,36 +65,28 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r

 Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.

-??? code
+```python
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier

-    ```python
-    from llmcompressor import oneshot
-    from llmcompressor.modifiers.quantization import QuantizationModifier
+# Configure the simple PTQ quantization
+recipe = QuantizationModifier(
+    targets="Linear",
+    scheme="FP8_DYNAMIC",
+    ignore=["lm_head"],
+)

-    # Configure the simple PTQ quantization
-    recipe = QuantizationModifier(
-        targets="Linear",
-        scheme="FP8_DYNAMIC",
-        ignore=["lm_head"],
-    )
+# Apply the quantization algorithm.
+oneshot(model=model, recipe=recipe)

-    # Apply the quantization algorithm.
-    oneshot(model=model, recipe=recipe)
-
-    # Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic
-    SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
-    model.save_pretrained(SAVE_DIR)
-    tokenizer.save_pretrained(SAVE_DIR)
-    ```
+# Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic
+SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
+model.save_pretrained(SAVE_DIR)
+tokenizer.save_pretrained(SAVE_DIR)
+```

 ### 3. Evaluating Accuracy

-Install `vllm` and `lm-evaluation-harness` for evaluation:
-
-```bash
-pip install vllm "lm-eval[api]>=0.4.12"
-```
-
 Load and run the model in `vllm`:

 ```python
@@ -12,15 +12,17 @@ Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs re
 To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:

 ```bash
-pip install llmcompressor
+(venv-llm-compressor) pip install llmcompressor
 ```

 Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:

 ```bash
-pip install vllm "lm-eval[api]>=0.4.12"
+(venv-vllm) pip install vllm "lm-eval[api]>=0.4.12"
 ```

+Please use separate environments for vLLM and llm-compressor as they might not work together.
+
 ## Quantization Process

 The quantization process involves four main steps:
@@ -52,55 +54,51 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd
 It's best to use calibration data that closely matches your deployment data.
 For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:

-??? code
+```python
+from datasets import load_dataset

-    ```python
-    from datasets import load_dataset
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 2048

-    NUM_CALIBRATION_SAMPLES = 512
-    MAX_SEQUENCE_LENGTH = 2048
+# Load and preprocess the dataset
+ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

-    # Load and preprocess the dataset
-    ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
-    ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+def preprocess(example):
+    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+ds = ds.map(preprocess)

-    def preprocess(example):
-        return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
-    ds = ds.map(preprocess)
-
-    def tokenize(sample):
-        return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
-    ds = ds.map(tokenize, remove_columns=ds.column_names)
-    ```
+def tokenize(sample):
+    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+ds = ds.map(tokenize, remove_columns=ds.column_names)
+```

 ### 3. Applying Quantization

 Now, apply the quantization algorithms:

-??? code
+```python
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import GPTQModifier
+from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

-    ```python
-    from llmcompressor import oneshot
-    from llmcompressor.modifiers.quantization import GPTQModifier
-    from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+# Configure the quantization algorithms
+recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

-    # Configure the quantization algorithms
-    recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
+# Apply quantization
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+)

-    # Apply quantization
-    oneshot(
-        model=model,
-        dataset=ds,
-        recipe=recipe,
-        max_seq_length=MAX_SEQUENCE_LENGTH,
-        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
-    )
-
-    # Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128
-    SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
-    model.save_pretrained(SAVE_DIR, save_compressed=True)
-    tokenizer.save_pretrained(SAVE_DIR)
-    ```
+# Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128
+SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
+```

 This process creates a W4A16 model with weights quantized to 4-bit integers.

@@ -141,36 +139,34 @@ lm_eval --model vllm \

 The following is an example of an expanded quantization recipe you can tune to your own use case:

-??? code
-
-    ```python
-    from compressed_tensors.quantization import (
-        QuantizationArgs,
-        QuantizationScheme,
-        QuantizationStrategy,
-        QuantizationType,
-    ) 
-    recipe = GPTQModifier(
-        targets="Linear",
-        config_groups={
-            "config_group": QuantizationScheme(
-                targets=["Linear"],
-                weights=QuantizationArgs(
-                    num_bits=4,
-                    type=QuantizationType.INT,
-                    strategy=QuantizationStrategy.GROUP,
-                    group_size=128,
-                    symmetric=True,
-                    dynamic=False,
-                    actorder="weight",
-                ),
+```python
+from compressed_tensors.quantization import (
+    QuantizationArgs,
+    QuantizationScheme,
+    QuantizationStrategy,
+    QuantizationType,
+)
+recipe = GPTQModifier(
+    targets="Linear",
+    config_groups={
+        "config_group": QuantizationScheme(
+            targets=["Linear"],
+            weights=QuantizationArgs(
+                num_bits=4,
+                type=QuantizationType.INT,
+                strategy=QuantizationStrategy.GROUP,
+                group_size=128,
+                symmetric=True,
+                dynamic=False,
+                actorder="weight",
            ),
-        },
-        ignore=["lm_head"],
-        update_size=NUM_CALIBRATION_SAMPLES,
-        dampening_frac=0.01,
-    )
-    ```
+        ),
+    },
+    ignore=["lm_head"],
+    update_size=NUM_CALIBRATION_SAMPLES,
+    dampening_frac=0.01,
+)
+```

 ## Troubleshooting and Support

@@ -0,0 +1,217 @@
+# INT8 W4A8
+
+vLLM supports quantizing weights to INT4 and activations to INT8 for memory savings and inference acceleration.
+This quantization method is particularly useful for reducing model size while maintaining good performance.
+
+## Prerequisites
+
+To use INT8 W4A8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library.
+
+```bash
+(venv-llm-compressor) pip install llmcompressor
+```
+
+Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
+
+```bash
+(venv-vllm) pip install vllm "lm-eval[api]>=0.4.12"
+```
+
+Please use separate environments for vLLM and llm-compressor as they might not work together.
+
+## Quantization Process
+
+The quantization process involves four main steps:
+
+1. Loading the model
+2. Preparing calibration data
+3. Applying quantization
+4. Evaluating accuracy in vLLM
+
+### 1. Loading the Model
+
+Load your model and tokenizer using the standard `transformers` AutoModel classes:
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_ID,
+    dtype="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+```
+
+### 2. Preparing Calibration Data
+
+When quantizing activations to INT8 and weights to INT4, you need sample data to estimate the activation scales.
+It's best to use calibration data that closely matches your deployment data.
+For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
+
+```python
+from datasets import load_dataset
+
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 2048
+
+# Load and preprocess the dataset
+ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+
+def preprocess(example):
+    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+ds = ds.map(preprocess)
+
+def tokenize(sample):
+    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+ds = ds.map(tokenize, remove_columns=ds.column_names)
+```
+
+### 3. Applying Quantization
+
+Now, apply the quantization algorithms.
+
+The following recipes create W4A8 models (int4 weights, int8 activations). On Arm® CPUs, this is accelerated through [KleidiAI](https://github.com/ARM-software/kleidiai).
+
+Use groupwise for best accuracy, and channelwise for best inference performance.
+
+=== "Groupwise"
+
+    ```python
+    from llmcompressor import oneshot
+    from llmcompressor.modifiers.quantization import GPTQModifier
+
+    # Configure the quantization algorithms
+    recipe = [
+        GPTQModifier(
+            targets="Linear",
+            scheme="W4A8",
+            ignore=["lm_head"],
+            dampening_frac=0.01
+        ),
+    ]
+
+    # Apply quantization
+    oneshot(
+        model=model,
+        dataset=ds,
+        recipe=recipe,
+        max_seq_length=MAX_SEQUENCE_LENGTH,
+        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+    )
+
+    # Save the compressed model: Meta-Llama-3-8B-Instruct-W4A8-G128-Dynamic-Per-Token
+    SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A8-G128-Dynamic-Per-Token"
+    model.save_pretrained(SAVE_DIR, save_compressed=True)
+    tokenizer.save_pretrained(SAVE_DIR)
+    ```
+
+=== "Channelwise"
+
+    ```python
+    from llmcompressor import oneshot
+    from llmcompressor.modifiers.quantization import GPTQModifier
+    from compressed_tensors.quantization import QuantizationStrategy, QuantizationType
+
+    scheme = {
+        "targets": ["Linear"],
+        "weights": {
+            "num_bits": 4,
+            "type": QuantizationType.INT,
+            "strategy": QuantizationStrategy.CHANNEL,
+            "symmetric": True,
+            "dynamic": False,
+            "group_size": None,
+        },
+        "input_activations": {
+            "num_bits": 8,
+            "type": QuantizationType.INT,
+            "strategy": QuantizationStrategy.TOKEN,
+            "dynamic": True,
+            "symmetric": False,
+            "observer": None,
+        },
+        "output_activations": None,
+    }
+
+    recipe = [
+        GPTQModifier(
+            targets="Linear",
+            config_groups={"group_0": scheme},
+            ignore=["lm_head"],
+            dampening_frac=0.01,
+        ),
+    ]
+
+    oneshot(
+        model=model,
+        dataset=ds,
+        recipe=recipe,
+        max_seq_length=MAX_SEQUENCE_LENGTH,
+        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+    )
+
+    # Save the compressed model: Meta-Llama-3-8B-Instruct-W4A8-Channelwise-Dynamic-Per-Token
+    SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A8-Channelwise-Dynamic-Per-Token"
+    model.save_pretrained(SAVE_DIR, save_compressed=True)
+    tokenizer.save_pretrained(SAVE_DIR)
+    ```
+
+### 4. Evaluating Accuracy
+
+=== "Groupwise"
+
+    After quantization, you can load and run the model in vLLM:
+
+    ```python
+    from vllm import LLM
+
+    llm = LLM("./Meta-Llama-3-8B-Instruct-W4A8-G128-Dynamic-Per-Token")
+    ```
+
+    To evaluate accuracy, you can use `lm_eval`:
+
+    ```bash
+    lm_eval --model vllm \
+        --model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A8-G128-Dynamic-Per-Token",add_bos_token=true \
+        --tasks gsm8k \
+        --num_fewshot 5 \
+        --limit 250 \
+        --batch_size 'auto'
+    ```
+
+=== "Channelwise"
+
+    After quantization, you can load and run the model in vLLM:
+
+    ```python
+    from vllm import LLM
+
+    llm = LLM("./Meta-Llama-3-8B-Instruct-W4A8-Channelwise-Dynamic-Per-Token")
+    ```
+
+    To evaluate accuracy, you can use `lm_eval`:
+
+    ```bash
+    lm_eval --model vllm \
+        --model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A8-Channelwise-Dynamic-Per-Token",add_bos_token=true \
+        --tasks gsm8k \
+        --num_fewshot 5 \
+        --limit 250 \
+        --batch_size 'auto'
+    ```
+
+!!! note
+    Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
+
+## Best Practices
+
+- Start with 512 samples for calibration data (increase if accuracy drops)
+- Use a sequence length of 2048 as a starting point
+- Employ the chat template or instruction template that the model was trained with
+- If you've fine-tuned a model, consider using a sample of your training data for calibration
+
+## Troubleshooting and Support
+
+If you encounter any issues or have feature requests, please open an issue on the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) GitHub repository.
@@ -17,15 +17,17 @@ Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs re
 To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:

 ```bash
-pip install llmcompressor
+(venv-llm-compressor) pip install llmcompressor
 ```

 Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:

 ```bash
-pip install vllm "lm-eval[api]>=0.4.12"
+(venv-vllm) pip install vllm "lm-eval[api]>=0.4.12"
 ```

+Please use separate environments for vLLM and llm-compressor as they might not work together.
+
 ## Quantization Process

 The quantization process involves four main steps:
@@ -57,26 +59,24 @@ When quantizing activations to INT8, you need sample data to estimate the activa
 It's best to use calibration data that closely matches your deployment data.
 For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:

-??? code
+```python
+from datasets import load_dataset

-    ```python
-    from datasets import load_dataset
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 2048

-    NUM_CALIBRATION_SAMPLES = 512
-    MAX_SEQUENCE_LENGTH = 2048
+# Load and preprocess the dataset
+ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

-    # Load and preprocess the dataset
-    ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
-    ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+def preprocess(example):
+    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+ds = ds.map(preprocess)

-    def preprocess(example):
-        return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
-    ds = ds.map(preprocess)
-
-    def tokenize(sample):
-        return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
-    ds = ds.map(tokenize, remove_columns=ds.column_names)
-    ```
+def tokenize(sample):
+    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+ds = ds.map(tokenize, remove_columns=ds.column_names)
+```

 </details>

@@ -84,33 +84,31 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra

 Now, apply the quantization algorithms:

-??? code
+```python
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import GPTQModifier
+from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

-    ```python
-    from llmcompressor import oneshot
-    from llmcompressor.modifiers.quantization import GPTQModifier
-    from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+# Configure the quantization algorithms
+recipe = [
+    SmoothQuantModifier(smoothing_strength=0.8),
+    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
+]

-    # Configure the quantization algorithms
-    recipe = [
-        SmoothQuantModifier(smoothing_strength=0.8),
-        GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
-    ]
+# Apply quantization
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+)

-    # Apply quantization
-    oneshot(
-        model=model,
-        dataset=ds,
-        recipe=recipe,
-        max_seq_length=MAX_SEQUENCE_LENGTH,
-        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
-    )
-
-    # Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token
-    SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
-    model.save_pretrained(SAVE_DIR, save_compressed=True)
-    tokenizer.save_pretrained(SAVE_DIR)
-    ```
+# Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token
+SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
+```

 This process creates a W8A8 model with weights and activations quantized to 8-bit integers.

@@ -110,6 +110,9 @@ plugins:
      redirect_maps:
        features/spec_decode/README.md: features/speculative_decoding/README.md
        features/spec_decode/speculators.md: features/speculative_decoding/speculators.md
+        features/quantization/fp8.md: features/quantization/llm_compressor/fp8.md
+        features/quantization/int4.md: features/quantization/llm_compressor/int4.md
+        features/quantization/int8.md: features/quantization/llm_compressor/int8_w8a8.md
        serving/openai_compatible_server.md: serving/online_serving/README.md

 markdown_extensions: