Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: Setup vLLM
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## What is vLLM

[vLLM](https://docs.vllm.ai/en/latest/) is an open-source, high-throughput inference and serving engine for large language models (LLMs). It’s designed to maximise hardware efficiency, making LLM inference faster, more memory-efficient, and scalable.

## Understanding the models

Llama 3.1 8B is an open-weight, text-only LLM with 8 billion parameters that can understand and generate text. You can view the model card at https://huggingface.co/meta-llama/Llama-3.1-8B.

Whisper large V3 is an automatic speech recognition (ASR) and speech translation model. It has 1.55 billion parameters and can both transcribe many languages and translate them to English. You can find the model card at https://huggingface.co/openai/whisper-large-v3.

## Set up your environment

Before you begin, make sure your environment meets these requirements:

- Python 3.12 on Ubuntu 22.04 LTS or newer
- At least 32 vCPUs, 96 GB RAM, and 64 GB of free disk space

This Learning Path was tested on a 96 core machine with 128-bit SVE, 192 GB of RAM and 500 GB of attached storage.

## Install build dependencies

Install the following packages required for running inference with vLLM on Arm64:
Comment thread
almayne marked this conversation as resolved.
```bash
sudo apt-get update -y
sudo apt install -y python3.12-venv python3.12-dev
```

Now install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency:
```bash
sudo apt-get install -y libtcmalloc-minimal4
```

## Create and activate a Python virtual environment

It’s best practice to install vLLM inside an isolated environment to prevent conflicts between system and project dependencies:
```bash
python3.12 -m venv vllm_env
source vllm_env/bin/activate
python -m pip install --upgrade pip
```

## Install vLLM for CPU
Comment thread
almayne marked this conversation as resolved.

Install a recent CPU specific build of vLLM:
```bash
export VLLM_VERSION=0.20.0
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl --extra-index-url https://download.pytorch.org/whl/cpu
```

If you wish to build vLLM from source you can follow the instructions in the [Build and Run vLLM on Arm Servers Learning Path](/learning-paths/servers-and-cloud-computing/vllm/vllm-setup/).


## Set up access to LLama3.1-8B models

To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login:
```bash
curl -LsSf https://hf.co/cli/install.sh | bash
hf auth login
```

Paste your access token into the terminal when prompted. To access Llama3.1-8B you need to request access on the Hugging Face website. Visit https://huggingface.co/meta-llama/Llama-3.1-8B and select "Expand to review and access". Complete the form and you should be granted access in a matter of minutes.

Your environment is now setup to run inference with vLLM. Next, we'll review model quantisation and then you'll use vLLM to run inference on both quantised and non-quantised Llama and Whisper models.
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
title: Quantisation Recipe
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Understanding quantisation

Quantised models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 and https://huggingface.co/RedHatAI/whisper-large-v3-quantized.w8a8, which we will be using in this Learning Path.

The notation w8a8 means that the weights have been quantised to 8-bit integers and the activations (the input data) are dynamically quantised to the same. This allows our kernels to utilise Arm's 8-bit integer matrix multiply feature I8MM. You can learn more about this in the [KleidiAI and matrix multiplication](/learning-paths/cross-platform/kleidiai-explainer/) Learning Path.

The w8a8 models we are using in this Learning Path only apply quantisation to the weights and activations in the linear layers of the transformer blocks. The activation quantisations are applied per-token and the weights are quantised per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations.

## Quantising your own models

If you would prefer to generate your own w8a8 quantised models, the recipe below is provided as an example. This is an optional activity and not a core part of this Learning Path, as it can take several hours to run.

You will need to install the required packages before running the quantisation script.
```bash
pip install compressed-tensors==0.14.0.1
pip install llmcompressor==0.10.0.1
pip install datasets==4.6.0

python w8a8_quant.py
```

Where w8a8_quant.py contains:
```python
from transformers import AutoTokenizer
from datasets import Dataset, load_dataset
from transformers import AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization import QuantizationType, QuantizationStrategy
import random

model_id = "meta-llama/Meta-Llama-3.1-8B"

num_samples = 256
max_seq_len = 4096

tokenizer = AutoTokenizer.from_pretrained(model_id)

def preprocess_fn(example):
return {"text": example["text"]}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)

scheme = {
"targets": ["Linear"],
"weights": {
"num_bits": 8,
"type": QuantizationType.INT,
"strategy": QuantizationStrategy.CHANNEL,
"symmetric": True,
"dynamic": False,
"group_size": None
},
"input_activations":
{
"num_bits": 8,
"type": QuantizationType.INT,
"strategy": QuantizationStrategy.TOKEN,
"dynamic": True,
"symmetric": False,
"observer": None,
},
"output_activations": None,
}

recipe = GPTQModifier(
targets="Linear",
config_groups={"group_0": scheme},
ignore=["lm_head"],
dampening_frac=0.01,
block_size=512,
)

model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True,
)

oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
model.save_pretrained("Meta-Llama-3.1-8B-quantized.w8a8")
```

When this has completed you will need to copy over the tokeniser specific files from the original model before you can run inference on your quantised model.

```bash
cp ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B/snapshots/*/*token* Meta-Llama-3.1-8B-quantized.w8a8/
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
---
title: Run inference with vLLM
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Run inference on LLama3.1-8B

We will use vLLM to serve an openAI-compatible API that we can use to run inference on Llama3.1-8B. This will demonstrate that the local environment is setup correctly.

Start vLLM’s OpenAI-compatible API server using Llama3.1-8B:
```bash
vllm serve meta-llama/Llama-3.1-8B
```

Then we can create a test script that sends a request to the server using the OpenAI library. Copy the below to a file named llama_test.py.

```python
import time
from openai import OpenAI
from transformers import AutoTokenizer

# vLLM's OpenAI-compatible server
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

model = "meta-llama/Llama-3.1-8B" # vllm server model

# Define a chat template for the model
llama3_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.first and message['role'] != 'system' %}{{ '<|start_header_id|>system<|end_header_id|>\n\n'+ 'You are a helpful assistant.' + '<|eot_id|>' }}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}"

# Define your prompt
message = [{"role": "user", "content": "Explain Big O notation with two examples."}]

def run(prompt):
resp = client.completions.create(
model=model,
prompt=prompt,
max_tokens=128, # The maximum number of tokens that can be generated in the completion
)
return resp.choices[0].text

def main():
t0 = time.time()

tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.chat_template = llama3_template
prompt = tokenizer.apply_chat_template(message, tokenize=False)
result = run(prompt)

print(f"\n=== Output ===\n{result}\n")
print(f"Batch completed in : {time.time() - t0:.2f}s")

if __name__ == "__main__":
main()
```

Now run the script with:
```bash
python llama_test.py
```

This will return the text generated by the model from your prompt. In the server logs you can see the throughput measured in tokens per second.

You can do the same for the quantised model. Start the server:
```bash
vllm serve RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
```

Update your test script to use the quantised model:
```python
model = "RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8"
```

Run inference on the quantised model:
```bash
python llama_test.py
```

You have now run inference using both the non-quantised and quantised Llama3.1-8B models.

## Run inference on Whisper

We will use a similar approach to test our ability to run inference on Whisper models. Install the required vLLM audio library then start vLLM’s OpenAI-compatible API server using Whisper-large-v3:
```bash
pip install vllm[audio]

vllm serve openai/whisper-large-v3
```

Then we can create a test script that sends a request with an audio file to the server using the OpenAI library. Copy the below to a file named whisper_test.py.

```python
import time
from openai import OpenAI
from vllm.assets.audio import AudioAsset

# vLLM's OpenAI-compatible server
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

model = "openai/whisper-large-v3" # vllm server model

# You can update the below with an audio file of your choosing
audio_filepath=str(AudioAsset("winning_call").get_local_path())

def transcribe_audio():
with open(audio_filepath, "rb") as audio:
transcription = client.audio.transcriptions.create(
model=model,
file=audio,
language="en",
response_format="json",
temperature=0.0,
)
return transcription.text

def main():
t0 = time.time()
out = transcribe_audio()
print(f"\n=== Output ===\n{out}\n")
print(f"Batch completed in : {time.time() - t0:.2f}s")

if __name__ == "__main__":
main()
```

Now run the script with:
```bash
python whisper_test.py
```

You can do the same for the quantised model. Start the server:
```bash
vllm serve RedHatAI/whisper-large-v3-quantized.w8a8
```

Update your test script to use the quantised model:
```python
model = "RedHatAI/whisper-large-v3-quantized.w8a8"
```

Run inference on the quantised model:
```bash
python whisper_test.py
```

You now have the quantised and non-quantised Llama and Whisper models on your local machine. You have installed vLLM and demonstrated you can run inference on your models. Now you can move on to benchmarking the Llama models and compare their performance.
Loading