diff --git a/docs/features/quantization/README.md b/docs/features/quantization/README.md index 6c4aa7d8aaa..2be357d8860 100644 --- a/docs/features/quantization/README.md +++ b/docs/features/quantization/README.md @@ -3,7 +3,7 @@ Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices. !!! tip - To get started with quantization, see [LLM Compressor](llm_compressor.md), a library for optimizing models for deployment with vLLM that supports FP8, INT8, INT4, and other quantization formats. + To get started with quantization, see [LLM Compressor](llm_compressor/README.md), a library for optimizing models for deployment with vLLM that supports FP8, INT8, INT4, and other quantization formats. The following are the supported quantization formats for vLLM: @@ -12,9 +12,11 @@ The following are the supported quantization formats for vLLM: - [GGUF](gguf.md) - [GPTQModel](gptqmodel.md) - [Intel Neural Compressor](inc.md) -- [INT4 W4A16](int4.md) -- [INT8 W8A8](int8.md) -- [FP8 W8A8](fp8.md) +- [LLM Compressor](llm_compressor/README.md) + - [FP8 W8A8](llm_compressor/fp8.md) + - [INT4 W4A16](llm_compressor/int4.md) + - [INT8 W4A8](llm_compressor/int8_w4a8.md) + - [INT8 W8A8](llm_compressor/int8_w8a8.md) - [NVIDIA Model Optimizer](modelopt.md) - [Online Quantization](online.md) - [AMD Quark](quark.md) @@ -46,16 +48,17 @@ th:not(:first-child) { } -| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | -| ------------------------- | ----- | ------ | ------ | --- | ------ | ------- | --------- | ------- | -| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | -| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | -| Marlin (GPTQ/AWQ/FP8/FP4) | ❌ | ✅︎* | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | -| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | -| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | -| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | -| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | -| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | +| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | Arm CPU | +| ------------------------- | ----- | ------ | ------ | --- | ------ | ------- | --------- | ------- | ------- | +| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | +| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | +| Marlin (GPTQ/AWQ/FP8/FP4) | ❌ | ✅︎* | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | +| llm-compressor INT8 (W8A8)| ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | ✅︎ | +| llm-compressor INT8 (W4A8)| ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅︎ | +| llm-compressor FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | +| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | +| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | +| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0. - ✅︎ indicates that the quantization method is supported on the specified hardware. diff --git a/docs/features/quantization/llm_compressor.md b/docs/features/quantization/llm_compressor/README.md similarity index 100% rename from docs/features/quantization/llm_compressor.md rename to docs/features/quantization/llm_compressor/README.md diff --git a/docs/features/quantization/fp8.md b/docs/features/quantization/llm_compressor/fp8.md similarity index 86% rename from docs/features/quantization/fp8.md rename to docs/features/quantization/llm_compressor/fp8.md index 2de71ce8da1..5dc1a7d43a0 100644 --- a/docs/features/quantization/fp8.md +++ b/docs/features/quantization/llm_compressor/fp8.md @@ -21,9 +21,17 @@ The FP8 types typically supported in hardware have two distinct representations, To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library: ```bash -pip install llmcompressor +(venv-llm-compressor) pip install llmcompressor ``` +Additionally, install `vllm` and `lm-evaluation-harness` for evaluation: + +```bash +(venv-vllm) pip install vllm "lm-eval[api]>=0.4.12" +``` + +Please use separate environments for vLLM and llm-compressor as they might not work together. + ## Quantization Process The quantization process involves three main steps: @@ -57,36 +65,28 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow. -??? code +```python +from llmcompressor import oneshot +from llmcompressor.modifiers.quantization import QuantizationModifier - ```python - from llmcompressor import oneshot - from llmcompressor.modifiers.quantization import QuantizationModifier +# Configure the simple PTQ quantization +recipe = QuantizationModifier( + targets="Linear", + scheme="FP8_DYNAMIC", + ignore=["lm_head"], +) - # Configure the simple PTQ quantization - recipe = QuantizationModifier( - targets="Linear", - scheme="FP8_DYNAMIC", - ignore=["lm_head"], - ) +# Apply the quantization algorithm. +oneshot(model=model, recipe=recipe) - # Apply the quantization algorithm. - oneshot(model=model, recipe=recipe) - - # Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic - SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" - model.save_pretrained(SAVE_DIR) - tokenizer.save_pretrained(SAVE_DIR) - ``` +# Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic +SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" +model.save_pretrained(SAVE_DIR) +tokenizer.save_pretrained(SAVE_DIR) +``` ### 3. Evaluating Accuracy -Install `vllm` and `lm-evaluation-harness` for evaluation: - -```bash -pip install vllm "lm-eval[api]>=0.4.12" -``` - Load and run the model in `vllm`: ```python diff --git a/docs/features/quantization/int4.md b/docs/features/quantization/llm_compressor/int4.md similarity index 62% rename from docs/features/quantization/int4.md rename to docs/features/quantization/llm_compressor/int4.md index 41c4b40574f..0e54797397a 100644 --- a/docs/features/quantization/int4.md +++ b/docs/features/quantization/llm_compressor/int4.md @@ -12,15 +12,17 @@ Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs re To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library: ```bash -pip install llmcompressor +(venv-llm-compressor) pip install llmcompressor ``` Additionally, install `vllm` and `lm-evaluation-harness` for evaluation: ```bash -pip install vllm "lm-eval[api]>=0.4.12" +(venv-vllm) pip install vllm "lm-eval[api]>=0.4.12" ``` +Please use separate environments for vLLM and llm-compressor as they might not work together. + ## Quantization Process The quantization process involves four main steps: @@ -52,55 +54,51 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd It's best to use calibration data that closely matches your deployment data. For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`: -??? code +```python +from datasets import load_dataset - ```python - from datasets import load_dataset +NUM_CALIBRATION_SAMPLES = 512 +MAX_SEQUENCE_LENGTH = 2048 - NUM_CALIBRATION_SAMPLES = 512 - MAX_SEQUENCE_LENGTH = 2048 +# Load and preprocess the dataset +ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") +ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) - # Load and preprocess the dataset - ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") - ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) +def preprocess(example): + return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} +ds = ds.map(preprocess) - def preprocess(example): - return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} - ds = ds.map(preprocess) - - def tokenize(sample): - return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) - ds = ds.map(tokenize, remove_columns=ds.column_names) - ``` +def tokenize(sample): + return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) +ds = ds.map(tokenize, remove_columns=ds.column_names) +``` ### 3. Applying Quantization Now, apply the quantization algorithms: -??? code +```python +from llmcompressor import oneshot +from llmcompressor.modifiers.quantization import GPTQModifier +from llmcompressor.modifiers.smoothquant import SmoothQuantModifier - ```python - from llmcompressor import oneshot - from llmcompressor.modifiers.quantization import GPTQModifier - from llmcompressor.modifiers.smoothquant import SmoothQuantModifier +# Configure the quantization algorithms +recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) - # Configure the quantization algorithms - recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) +# Apply quantization +oneshot( + model=model, + dataset=ds, + recipe=recipe, + max_seq_length=MAX_SEQUENCE_LENGTH, + num_calibration_samples=NUM_CALIBRATION_SAMPLES, +) - # Apply quantization - oneshot( - model=model, - dataset=ds, - recipe=recipe, - max_seq_length=MAX_SEQUENCE_LENGTH, - num_calibration_samples=NUM_CALIBRATION_SAMPLES, - ) - - # Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128 - SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128" - model.save_pretrained(SAVE_DIR, save_compressed=True) - tokenizer.save_pretrained(SAVE_DIR) - ``` +# Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128 +SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128" +model.save_pretrained(SAVE_DIR, save_compressed=True) +tokenizer.save_pretrained(SAVE_DIR) +``` This process creates a W4A16 model with weights quantized to 4-bit integers. @@ -141,36 +139,34 @@ lm_eval --model vllm \ The following is an example of an expanded quantization recipe you can tune to your own use case: -??? code - - ```python - from compressed_tensors.quantization import ( - QuantizationArgs, - QuantizationScheme, - QuantizationStrategy, - QuantizationType, - ) - recipe = GPTQModifier( - targets="Linear", - config_groups={ - "config_group": QuantizationScheme( - targets=["Linear"], - weights=QuantizationArgs( - num_bits=4, - type=QuantizationType.INT, - strategy=QuantizationStrategy.GROUP, - group_size=128, - symmetric=True, - dynamic=False, - actorder="weight", - ), +```python +from compressed_tensors.quantization import ( + QuantizationArgs, + QuantizationScheme, + QuantizationStrategy, + QuantizationType, +) +recipe = GPTQModifier( + targets="Linear", + config_groups={ + "config_group": QuantizationScheme( + targets=["Linear"], + weights=QuantizationArgs( + num_bits=4, + type=QuantizationType.INT, + strategy=QuantizationStrategy.GROUP, + group_size=128, + symmetric=True, + dynamic=False, + actorder="weight", ), - }, - ignore=["lm_head"], - update_size=NUM_CALIBRATION_SAMPLES, - dampening_frac=0.01, - ) - ``` + ), + }, + ignore=["lm_head"], + update_size=NUM_CALIBRATION_SAMPLES, + dampening_frac=0.01, +) +``` ## Troubleshooting and Support diff --git a/docs/features/quantization/llm_compressor/int8_w4a8.md b/docs/features/quantization/llm_compressor/int8_w4a8.md new file mode 100644 index 00000000000..cc6a0982832 --- /dev/null +++ b/docs/features/quantization/llm_compressor/int8_w4a8.md @@ -0,0 +1,217 @@ +# INT8 W4A8 + +vLLM supports quantizing weights to INT4 and activations to INT8 for memory savings and inference acceleration. +This quantization method is particularly useful for reducing model size while maintaining good performance. + +## Prerequisites + +To use INT8 W4A8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library. + +```bash +(venv-llm-compressor) pip install llmcompressor +``` + +Additionally, install `vllm` and `lm-evaluation-harness` for evaluation: + +```bash +(venv-vllm) pip install vllm "lm-eval[api]>=0.4.12" +``` + +Please use separate environments for vLLM and llm-compressor as they might not work together. + +## Quantization Process + +The quantization process involves four main steps: + +1. Loading the model +2. Preparing calibration data +3. Applying quantization +4. Evaluating accuracy in vLLM + +### 1. Loading the Model + +Load your model and tokenizer using the standard `transformers` AutoModel classes: + +```python +from transformers import AutoTokenizer, AutoModelForCausalLM + +MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct" +model = AutoModelForCausalLM.from_pretrained( + MODEL_ID, + dtype="auto", +) +tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) +``` + +### 2. Preparing Calibration Data + +When quantizing activations to INT8 and weights to INT4, you need sample data to estimate the activation scales. +It's best to use calibration data that closely matches your deployment data. +For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`: + +```python +from datasets import load_dataset + +NUM_CALIBRATION_SAMPLES = 512 +MAX_SEQUENCE_LENGTH = 2048 + +# Load and preprocess the dataset +ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") +ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) + +def preprocess(example): + return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} +ds = ds.map(preprocess) + +def tokenize(sample): + return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) +ds = ds.map(tokenize, remove_columns=ds.column_names) +``` + +### 3. Applying Quantization + +Now, apply the quantization algorithms. + +The following recipes create W4A8 models (int4 weights, int8 activations). On Arm® CPUs, this is accelerated through [KleidiAI](https://github.com/ARM-software/kleidiai). + +Use groupwise for best accuracy, and channelwise for best inference performance. + +=== "Groupwise" + + ```python + from llmcompressor import oneshot + from llmcompressor.modifiers.quantization import GPTQModifier + + # Configure the quantization algorithms + recipe = [ + GPTQModifier( + targets="Linear", + scheme="W4A8", + ignore=["lm_head"], + dampening_frac=0.01 + ), + ] + + # Apply quantization + oneshot( + model=model, + dataset=ds, + recipe=recipe, + max_seq_length=MAX_SEQUENCE_LENGTH, + num_calibration_samples=NUM_CALIBRATION_SAMPLES, + ) + + # Save the compressed model: Meta-Llama-3-8B-Instruct-W4A8-G128-Dynamic-Per-Token + SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A8-G128-Dynamic-Per-Token" + model.save_pretrained(SAVE_DIR, save_compressed=True) + tokenizer.save_pretrained(SAVE_DIR) + ``` + +=== "Channelwise" + + ```python + from llmcompressor import oneshot + from llmcompressor.modifiers.quantization import GPTQModifier + from compressed_tensors.quantization import QuantizationStrategy, QuantizationType + + scheme = { + "targets": ["Linear"], + "weights": { + "num_bits": 4, + "type": QuantizationType.INT, + "strategy": QuantizationStrategy.CHANNEL, + "symmetric": True, + "dynamic": False, + "group_size": None, + }, + "input_activations": { + "num_bits": 8, + "type": QuantizationType.INT, + "strategy": QuantizationStrategy.TOKEN, + "dynamic": True, + "symmetric": False, + "observer": None, + }, + "output_activations": None, + } + + recipe = [ + GPTQModifier( + targets="Linear", + config_groups={"group_0": scheme}, + ignore=["lm_head"], + dampening_frac=0.01, + ), + ] + + oneshot( + model=model, + dataset=ds, + recipe=recipe, + max_seq_length=MAX_SEQUENCE_LENGTH, + num_calibration_samples=NUM_CALIBRATION_SAMPLES, + ) + + # Save the compressed model: Meta-Llama-3-8B-Instruct-W4A8-Channelwise-Dynamic-Per-Token + SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A8-Channelwise-Dynamic-Per-Token" + model.save_pretrained(SAVE_DIR, save_compressed=True) + tokenizer.save_pretrained(SAVE_DIR) + ``` + +### 4. Evaluating Accuracy + +=== "Groupwise" + + After quantization, you can load and run the model in vLLM: + + ```python + from vllm import LLM + + llm = LLM("./Meta-Llama-3-8B-Instruct-W4A8-G128-Dynamic-Per-Token") + ``` + + To evaluate accuracy, you can use `lm_eval`: + + ```bash + lm_eval --model vllm \ + --model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A8-G128-Dynamic-Per-Token",add_bos_token=true \ + --tasks gsm8k \ + --num_fewshot 5 \ + --limit 250 \ + --batch_size 'auto' + ``` + +=== "Channelwise" + + After quantization, you can load and run the model in vLLM: + + ```python + from vllm import LLM + + llm = LLM("./Meta-Llama-3-8B-Instruct-W4A8-Channelwise-Dynamic-Per-Token") + ``` + + To evaluate accuracy, you can use `lm_eval`: + + ```bash + lm_eval --model vllm \ + --model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A8-Channelwise-Dynamic-Per-Token",add_bos_token=true \ + --tasks gsm8k \ + --num_fewshot 5 \ + --limit 250 \ + --batch_size 'auto' + ``` + +!!! note + Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations. + +## Best Practices + +- Start with 512 samples for calibration data (increase if accuracy drops) +- Use a sequence length of 2048 as a starting point +- Employ the chat template or instruction template that the model was trained with +- If you've fine-tuned a model, consider using a sample of your training data for calibration + +## Troubleshooting and Support + +If you encounter any issues or have feature requests, please open an issue on the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) GitHub repository. diff --git a/docs/features/quantization/int8.md b/docs/features/quantization/llm_compressor/int8_w8a8.md similarity index 66% rename from docs/features/quantization/int8.md rename to docs/features/quantization/llm_compressor/int8_w8a8.md index 547eb5aedc2..21ed00d1393 100644 --- a/docs/features/quantization/int8.md +++ b/docs/features/quantization/llm_compressor/int8_w8a8.md @@ -17,15 +17,17 @@ Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs re To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library: ```bash -pip install llmcompressor +(venv-llm-compressor) pip install llmcompressor ``` Additionally, install `vllm` and `lm-evaluation-harness` for evaluation: ```bash -pip install vllm "lm-eval[api]>=0.4.12" +(venv-vllm) pip install vllm "lm-eval[api]>=0.4.12" ``` +Please use separate environments for vLLM and llm-compressor as they might not work together. + ## Quantization Process The quantization process involves four main steps: @@ -57,26 +59,24 @@ When quantizing activations to INT8, you need sample data to estimate the activa It's best to use calibration data that closely matches your deployment data. For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`: -??? code +```python +from datasets import load_dataset - ```python - from datasets import load_dataset +NUM_CALIBRATION_SAMPLES = 512 +MAX_SEQUENCE_LENGTH = 2048 - NUM_CALIBRATION_SAMPLES = 512 - MAX_SEQUENCE_LENGTH = 2048 +# Load and preprocess the dataset +ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") +ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) - # Load and preprocess the dataset - ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") - ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) +def preprocess(example): + return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} +ds = ds.map(preprocess) - def preprocess(example): - return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} - ds = ds.map(preprocess) - - def tokenize(sample): - return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) - ds = ds.map(tokenize, remove_columns=ds.column_names) - ``` +def tokenize(sample): + return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) +ds = ds.map(tokenize, remove_columns=ds.column_names) +``` @@ -84,33 +84,31 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra Now, apply the quantization algorithms: -??? code +```python +from llmcompressor import oneshot +from llmcompressor.modifiers.quantization import GPTQModifier +from llmcompressor.modifiers.smoothquant import SmoothQuantModifier - ```python - from llmcompressor import oneshot - from llmcompressor.modifiers.quantization import GPTQModifier - from llmcompressor.modifiers.smoothquant import SmoothQuantModifier +# Configure the quantization algorithms +recipe = [ + SmoothQuantModifier(smoothing_strength=0.8), + GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]), +] - # Configure the quantization algorithms - recipe = [ - SmoothQuantModifier(smoothing_strength=0.8), - GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]), - ] +# Apply quantization +oneshot( + model=model, + dataset=ds, + recipe=recipe, + max_seq_length=MAX_SEQUENCE_LENGTH, + num_calibration_samples=NUM_CALIBRATION_SAMPLES, +) - # Apply quantization - oneshot( - model=model, - dataset=ds, - recipe=recipe, - max_seq_length=MAX_SEQUENCE_LENGTH, - num_calibration_samples=NUM_CALIBRATION_SAMPLES, - ) - - # Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token - SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token" - model.save_pretrained(SAVE_DIR, save_compressed=True) - tokenizer.save_pretrained(SAVE_DIR) - ``` +# Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token +SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token" +model.save_pretrained(SAVE_DIR, save_compressed=True) +tokenizer.save_pretrained(SAVE_DIR) +``` This process creates a W8A8 model with weights and activations quantized to 8-bit integers. diff --git a/mkdocs.yaml b/mkdocs.yaml index 097f7497fb2..1fee824f3b2 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -110,6 +110,9 @@ plugins: redirect_maps: features/spec_decode/README.md: features/speculative_decoding/README.md features/spec_decode/speculators.md: features/speculative_decoding/speculators.md + features/quantization/fp8.md: features/quantization/llm_compressor/fp8.md + features/quantization/int4.md: features/quantization/llm_compressor/int4.md + features/quantization/int8.md: features/quantization/llm_compressor/int8_w8a8.md serving/openai_compatible_server.md: serving/online_serving/README.md markdown_extensions: