Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .cspell/compound.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ knative
kserve
xinference
servicemeshv1
ipynb
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,7 @@
**/public/_remotes
.idea

.DS_Store
.DS_Store

.claude
CLAUDE.md
4 changes: 2 additions & 2 deletions docs/en/installation/tools.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -136,8 +136,8 @@ data:
"type": "item",
"link": "/model-repo/training",
"i18nKey": "nav_pre_train",
"text": "预训练",
"en": "PreTraining",
"text": "训练",
"en": "Training",
"icon": ""
},
{
Expand Down
184 changes: 184 additions & 0 deletions docs/en/llm-compressor/how_to/compressor_by_workbench.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
---
weight: 30
---

# LLM Compressor with Alauda AI

This document describes how to use the LLM Compressor integration with the Alauda AI platform to perform model compression workflows. The Alauda AI integration of LLM Compressor provides two example workflows:

- A workbench image and the [data-free-compressor.ipynb](/data-free-compressor.ipynb) that demonstrate how to compress a model.
- A workbench image and the [calibration-compressor.ipynb](/calibration-compressor.ipynb) that demonstrate how to compress a model using a calibration dataset.

<a href="/data-free-compressor.ipynb" download="data-free-compressor.ipynb" rel="noopener noreferrer">notebook</a>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stray download link — appears to be a leftover/debug element.

This bare <a> download link on its own line looks out of place. It duplicates the notebook link already provided on line 9 without any surrounding context or explanation. Consider removing it or integrating it into the workflow instructions if a download link is intentionally needed.

🤖 Prompt for AI Agents
In `@docs/en/llm-compressor/how_to/compressor_by_workbench.mdx` at line 12, Remove
the stray bare anchor tag linking "data-free-compressor.ipynb" (the <a
...>notebook</a> element) that duplicates the existing notebook link and has no
surrounding context; either delete this lone download link or merge it into the
earlier notebook reference on the page and provide contextual text so the link
is not left as an orphaned/debug element.


## Supported Model Compression Workflows

On the Alauda AI platform, you can use the Workbench feature to run LLM Compressor on models stored in your model repository. The following workflow outlines the typical steps for compressing a model.

### Create a Workbench

Follow the instructions in [Create Workbench](../../workbench/how_to/create_workbench.mdx) to create a new Workbench instance. Note that model compression is currently supported only within **JupyterLab**.

### Create a Model Repository and Upload Models

Refer to [Upload Models Using Notebook](../../model_inference/model_management/how_to/upload_models_using_notebook.mdx) for detailed steps on creating a model repository and uploading your model files. The example notebooks in this guide use the [TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) model.

```python title=data-free-compressor.ipynb
from llmcompressor.modifiers.quantization import QuantizationModifier

model_id = "./TinyLlama-1.1B-Chat-v1.0" #[!code callout]
recipe = QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) #[!code callout]
```

<Callouts>
1. Model to compress. **You can modify this line if you want to use your own model**.
2. This recipe will quantize all Linear layers except those in the `lm_head`,
which is often sensitive to quantization. The `W4A16` scheme compresses
weights to `4-bit` integers while retaining `16-bit` activations.
</Callouts>


### (Optional) Prepare and Upload a Dataset

:::note
If you plan to use the **data-free compressor notebook**, you can skip this step.
:::

To use the **calibration compressor notebook**, you must prepare and upload a calibration dataset. Prepare your dataset using the same process described in *Upload Models Using Notebook*. The example calibration notebook uses the [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset.

```python title=calibration-compressor.ipynb
from datasets import load_dataset

dataset_id = "./ultrachat_200k" #[!code callout]

num_calibration_samples = 512 if use_gpu else 4 #[!code callout]
max_sequence_length = 2048 if use_gpu else 16

ds = load_dataset(dataset_id, split="train_sft") #[!code callout]
ds = ds.shuffle(seed=42).select(range(num_calibration_samples)) #[!code callout]

def preprocess(example): #[!code callout]
text = tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
)
return tokenizer(
text,
padding=False,
max_length=max_sequence_length,
truncation=True,
add_special_tokens=False,
)

ds = ds.map(preprocess, remove_columns=ds.column_names)
```

<Callouts>
1. Create the calibration dataset, using Huggingface datasets API. **You can modify this line if you want to use your own dataset**.
2. Select number of samples. 512 samples is a good place to start. Increasing the number of samples can improve accuracy.
3. Load dataset.
4. Shuffle and grab only the number of samples we need.
5. Preprocess and tokenize into format the model uses.
</Callouts>

### (Optional) Upload Dataset into S3 Storage

If you wish to upload datasets into S3, you can first install the `boto3` tool and then run those codes in `JupyterLab`.

```bash
~/.venv/bin/python -m pip install boto3 -i https://pypi.tuna.tsinghua.edu.cn/simple
```

```python
import os
from boto3.s3.transfer import TransferConfig
import boto3

local_folder = "./ultrachat_200k" #[!code callout]
bucket_name = "datasets"

Comment on lines +92 to +99
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n docs/en/llm-compressor/how_to/compressor_by_workbench.mdx | sed -n '85,110p'

Repository: alauda/aml-docs

Length of output: 977


🏁 Script executed:

cat -n docs/en/llm-compressor/how_to/compressor_by_workbench.mdx | sed -n '110,130p'

Repository: alauda/aml-docs

Length of output: 1016


Define the S3 client before calling s3.upload_file() on line 112.

The snippet uses s3.upload_file() without instantiating the s3 client, causing a NameError. Additionally, line 109 has a typo: os path.join should be os.path.join, which will cause a SyntaxError.

🐛 Suggested fix
 import os
 from boto3.s3.transfer import TransferConfig
 import boto3
 
+ s3 = boto3.client(
+     "s3",
+     endpoint_url="http://minio.minio-system.svc.cluster.local:80",
+ )

Also fix line 109:

- local_path = os path.join(root, filename)
+ local_path = os.path.join(root, filename)
🤖 Prompt for AI Agents
In `@docs/en/llm-compressor/how_to/compressor_by_workbench.mdx` around lines 92 -
99, The snippet fails to instantiate an S3 client and has a typo in
os.path.join; before calling s3.upload_file() (referenced as s3.upload_file)
create the client (e.g., assign s3 = boto3.client("s3") or boto3.resource("s3"))
so s3 is defined, and fix the path construction by replacing the incorrect os
path.join usage with os.path.join when building file paths (references:
s3.upload_file, boto3, os.path.join, local_folder, bucket_name).

config = TransferConfig(
multipart_threshold=100*1024*1024,
max_concurrency=10,
multipart_chunksize=100*1024*1024,
use_threads=True
) #[!code callout]

for root, dirs, files in os.walk(local_folder):
for filename in files:
local_path = os path.join(root, filename)
relative_path = os.path.relpath(local_path, local_folder)
s3_key = f"ultrachat_200k/{relative_path.replace(os.sep, '/')}"
s3.upload_file(local_path, bucket_name, s3_key, Config=config)
print(f"Uploaded {local_path} -> {s3_key}")
Comment on lines +107 to +113
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Syntax error: missing dot in os.path.join.

Line 107 has os path.join which should be os.path.join. This will cause a NameError when users run this code.

🐛 Proposed fix
 for root, dirs, files in os.walk(local_folder):
     for filename in files:
-        local_path = os path.join(root, filename)
+        local_path = os.path.join(root, filename)
         relative_path = os.path.relpath(local_path, local_folder)
         s3_key = f"ultrachat_200k/{relative_path.replace(os.sep, '/')}"
         s3.upload_file(local_path, bucket_name, s3_key, Config=config)
         print(f"Uploaded {local_path} -> {s3_key}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for root, dirs, files in os.walk(local_folder):
for filename in files:
local_path = os path.join(root, filename)
relative_path = os.path.relpath(local_path, local_folder)
s3_key = f"ultrachat_200k/{relative_path.replace(os.sep, '/')}"
s3.upload_file(local_path, bucket_name, s3_key, Config=config)
print(f"Uploaded {local_path} -> {s3_key}")
for root, dirs, files in os.walk(local_folder):
for filename in files:
local_path = os.path.join(root, filename)
relative_path = os.path.relpath(local_path, local_folder)
s3_key = f"ultrachat_200k/{relative_path.replace(os.sep, '/')}"
s3.upload_file(local_path, bucket_name, s3_key, Config=config)
print(f"Uploaded {local_path} -> {s3_key}")
🤖 Prompt for AI Agents
In `@docs/en/llm-compressor/how_to/compressor_by_workbench.mdx` around lines 105 -
111, The snippet uses a broken reference `os path.join`; fix by replacing it
with the correct attribute access `os.path.join` where the code builds
local_path (inside the loop that calls os.walk); verify other os references
(e.g., os.walk, os.sep) are correct and keep the s3.upload_file call
(s3.upload_file(local_path, bucket_name, s3_key, Config=config)) unchanged so
uploads use the fixed local_path.

```

<Callouts>
1. **You can modify this line if you want to use your own dataset**.
2. Configure multipart upload with 100 MB chunks and a maximum of 10 concurrent threads.
</Callouts>

### (Optional) Use Dataset in S3 Storage

If you wish to use datasets from S3, you can first install the `s3fs` tool and then modify the dataset loading section in the example by following the code below.

```bash
~/.venv/bin/python -m pip install s3fs -i https://pypi.tuna.tsinghua.edu.cn/simple
```

```python title=calibration-compressor.ipynb
import os
from datasets import load_dataset

os.environ["AWS_ACCESS_KEY_ID"] = "@7Apples@" #[!code callout]
os.environ["AWS_SECRET_ACCESS_KEY"] = "07Apples@"

storage_options = {
"key": "07Apples@",
"secret": "O7Apples@",
"client_kwargs": {
"endpoint_url": "http://minio.minio-system.svc.cluster.local:80" #[!code callout]
}
}
Comment on lines +133 to +142
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use clearer placeholder values for credentials.

The current values (@7Apples@, 07Apples@, O7Apples@) are inconsistent and could confuse users. Consider using explicit placeholder format to make it clear these need to be replaced.

📝 Suggested fix
-os.environ["AWS_ACCESS_KEY_ID"] = "@7Apples@" #[!code callout]
-os.environ["AWS_SECRET_ACCESS_KEY"] = "07Apples@"
+os.environ["AWS_ACCESS_KEY_ID"] = "<YOUR_ACCESS_KEY_ID>" #[!code callout]
+os.environ["AWS_SECRET_ACCESS_KEY"] = "<YOUR_SECRET_ACCESS_KEY>"

 storage_options = {
-  "key": "07Apples@",
-  "secret": "O7Apples@",
+  "key": "<YOUR_ACCESS_KEY_ID>",
+  "secret": "<YOUR_SECRET_ACCESS_KEY>",
   "client_kwargs": {
     "endpoint_url": "http://minio.minio-system.svc.cluster.local:80" #[!code callout]
   }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
os.environ["AWS_ACCESS_KEY_ID"] = "@7Apples@" #[!code callout]
os.environ["AWS_SECRET_ACCESS_KEY"] = "07Apples@"
storage_options = {
"key": "07Apples@",
"secret": "O7Apples@",
"client_kwargs": {
"endpoint_url": "http://minio.minio-system.svc.cluster.local:80" #[!code callout]
}
}
os.environ["AWS_ACCESS_KEY_ID"] = "<YOUR_ACCESS_KEY_ID>" #[!code callout]
os.environ["AWS_SECRET_ACCESS_KEY"] = "<YOUR_SECRET_ACCESS_KEY>"
storage_options = {
"key": "<YOUR_ACCESS_KEY_ID>",
"secret": "<YOUR_SECRET_ACCESS_KEY>",
"client_kwargs": {
"endpoint_url": "http://minio.minio-system.svc.cluster.local:80" #[!code callout]
}
}
🤖 Prompt for AI Agents
In `@docs/en/llm-compressor/how_to/compressor_by_workbench.mdx` around lines 131 -
140, Replace the confusing example credential strings with explicit, consistent
placeholders so users know to replace them: update os.environ assignments for
"AWS_ACCESS_KEY_ID" and "AWS_SECRET_ACCESS_KEY" and the storage_options values
"key", "secret", and "client_kwargs" -> "endpoint_url" to use clear placeholders
(e.g., "<AWS_ACCESS_KEY_ID>", "<AWS_SECRET_ACCESS_KEY>", "<S3_KEY>",
"<S3_SECRET>", "<S3_ENDPOINT_URL>") and ensure the placeholder format is
consistent across os.environ and storage_options entries such as
os.environ["AWS_ACCESS_KEY_ID"], os.environ["AWS_SECRET_ACCESS_KEY"], and the
storage_options dict.


ds = load_dataset(
'parquet',
data_files='s3://datasets/ultrachat_200k/data/train_sft-*.parquet', #[!code callout]
storage_options=storage_options, #[!code callout]
split="train"
)
```

<Callouts>
1. Set environment variables (as a backup, some underlying components will use them).
2. Define storage configuration; you must explicitly specify the endpoint_url to connect to MinIO.
3. If the dataset is split, this is equivalent to `split="train_sft"` in the example.
</Callouts>

### Clone Models and Datasets in JupyterLab

In the JupyterLab terminal, use `git clone` to download the model repository (and dataset, if applicable) to your workspace. The data-free compressor notebook does not require a dataset.

### Create and Run Compression Notebooks

Download the appropriate example notebook for your use case: the **calibration compressor notebook** if you are using a dataset, or the **data-free compressor notebook** otherwise. Click the upward arrow button on the JupyterLab page to upload the downloaded notebook file.

### Upload the Compressed Model to the Repository

Once compression is complete, upload the compressed model back to the model repository using the steps outlined in *Upload Models Using Notebook*.

```python
model_dir = "./" + model_id.split("/")[-1] + "-W4A16" #[!code callout]
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir);
```

<Callouts>
1. Save model and tokenizer. **You can modify this line if you want to change the name of output**.
</Callouts>

### Deploy and Use the Compressed Model for Inference

Quantized and sparse models that you create with LLM Compressor are saved using the `compressed-tensors` library (an extension of [Safetensors](https://huggingface.co/docs/safetensors/en/index)).
The compression format matches the model's quantization or sparsity type. These formats are natively supported in vLLM, enabling fast inference through optimized deployment kernels by using Alauda AI Inference Server.
Follow the instructions in [create inference service](../../model_inference/inference_service/functions/inference_service.mdx#create-inference-service) to complete this step.
7 changes: 7 additions & 0 deletions docs/en/llm-compressor/how_to/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
weight: 60
---

# How To

<Overview />
7 changes: 7 additions & 0 deletions docs/en/llm-compressor/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
weight: 82
---

# LLM Compressor

<Overview />
35 changes: 35 additions & 0 deletions docs/en/llm-compressor/intro.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
weight: 10
---

# Introduction

## Preface

[LLM Compressor](https://github.com/vllm-project/llm-compressor), part of [the vLLM project](https://docs.vllm.ai/en/latest/) for efficient serving of LLMs, integrates the latest model compression research into a single open-source library enabling the generation of efficient, compressed models with minimal effort.

The framework allows users to apply some of the most recent research on model compression techniques to improve generative AI (gen AI) models' efficiency, scalability and performance while maintaining accuracy. With native support for Hugging Face and vLLM, the compressed models can be integrated into deployment pipelines, delivering faster and more cost-effective inference at scale.

LLM Compressor allows you to perform model optimization techniques such as quantization, sparsity, and compression to reduce memory use, model size, and improve inference without affecting the accuracy of model responses. The following compression methodologies are supported by LLM Compressor:

- **Quantization**: Converts model weights and activations to lower-bit formats such as int8, reducing memory usage.
- **Sparsity**: Sets a portion of model weights to zero, often in fixed patterns, allowing for more efficient computation.
- **Compression**: Shrinks the saved model file size, ideally with minimal impact on performance.

Use these methods together to deploy models more efficiently on resource-limited hardware.

## LLM Compressor supports a wide variety of compression techniques:

- Weight-only quantization (W4A16) compresses model weights to 4-bit precision, valuable for AI applications with limited hardware resources or high sensitivity to latency.
- Weight and activation quantization (W8A8) compresses both weights and activations to 8-bit precision, targeting general server scenarios for integer and floating-point formats.

## LLM Compressor supports several compression algorithms:

- AWQ: Weight only `INT4` quantization
- GPTQ: Weight-only `INT4` quantization
- FP8: Dynamic per-token quantization
- SparseGPT: Post-training sparsity
- SmoothQuant: Activation quantization

For more information about compression algorithms and formats, please refer to the [documentation](https://docs.vllm.ai/projects/llm-compressor/en/latest/guides/compression_schemes/) and examples in the [llmcompressor](https://github.com/vllm-project/llm-compressor) repository.
Each of these compression methods computes optimal scales and zero-points for weights and activations. Optimized scales can be per tensor, channel, group, or token. The final result is a compressed model saved with all its applied quantization parameters.
Loading