ArmDeveloperEcosystem · almayne · Apr 24, 2026 · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026
diff --git a/...servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md b/...servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md
@@ -0,0 +1,71 @@
+---
+title: Setup vLLM
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## What is vLLM
+
+[vLLM](https://docs.vllm.ai/en/latest/) is an open-source, high-throughput inference and serving engine for large language models (LLMs). It’s designed to maximise hardware efficiency, making LLM inference faster, more memory-efficient, and scalable.
+
+## Understanding the models
+
+Llama 3.1 8B is an open-weight, text-only LLM with 8 billion parameters that can understand and generate text. You can view the model card at https://huggingface.co/meta-llama/Llama-3.1-8B.
+
+Whisper large V3 is an automatic speech recognition (ASR) and speech translation model. It has 1.55 billion parameters and can both transcribe many languages and translate them to English. You can find the model card at https://huggingface.co/openai/whisper-large-v3.
+
+## Set up your environment
+
+Before you begin, make sure your environment meets these requirements:
+
+- Python 3.12 on Ubuntu 22.04 LTS or newer
+- At least 32 vCPUs, 96 GB RAM, and 64 GB of free disk space
+
+This Learning Path was tested on a 96 core machine with 128-bit SVE, 192 GB of RAM and 500 GB of attached storage.
+
+## Install build dependencies
+
+Install the following packages required for running inference with vLLM on Arm64:
+```bash
+sudo apt-get update -y
+sudo apt install -y python3.12-venv python3.12-dev
+```
+
+Now install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency:
+```bash
+sudo apt-get install -y libtcmalloc-minimal4
+```
+
+## Create and activate a Python virtual environment
+
+It’s best practice to install vLLM inside an isolated environment to prevent conflicts between system and project dependencies:
+```bash
+python3.12 -m venv vllm_env
+source vllm_env/bin/activate
+python -m pip install --upgrade pip
+```
+
+## Install vLLM for CPU
+
+Install a recent CPU specific build of vLLM:
+```bash
+export VLLM_VERSION=0.20.0
+pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl --extra-index-url https://download.pytorch.org/whl/cpu
+```
+
+If you wish to build vLLM from source you can follow the instructions in the [Build and Run vLLM on Arm Servers Learning Path](/learning-paths/servers-and-cloud-computing/vllm/vllm-setup/).
+
+
+## Set up access to LLama3.1-8B models
+
+To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login:
+```bash
+curl -LsSf https://hf.co/cli/install.sh | bash
+hf auth login
+```
+
+Paste your access token into the terminal when prompted. To access Llama3.1-8B you need to request access on the Hugging Face website. Visit https://huggingface.co/meta-llama/Llama-3.1-8B and select "Expand to review and access". Complete the form and you should be granted access in a matter of minutes.
+
+Your environment is now setup to run inference with vLLM. Next, we'll review model quantisation and then you'll use vLLM to run inference on both quantised and non-quantised Llama and Whisper models.
diff --git a/...ervers-and-cloud-computing/vllm-benchmark-quantisation/2-quantisation-recipe.md b/...ervers-and-cloud-computing/vllm-benchmark-quantisation/2-quantisation-recipe.md
@@ -0,0 +1,104 @@
+---
+title: Quantisation Recipe
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Understanding quantisation
+
+Quantised models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 and https://huggingface.co/RedHatAI/whisper-large-v3-quantized.w8a8, which we will be using in this Learning Path.
+
+The notation w8a8 means that the weights have been quantised to 8-bit integers and the activations (the input data) are dynamically quantised to the same. This allows our kernels to utilise Arm's 8-bit integer matrix multiply feature I8MM. You can learn more about this in the [KleidiAI and matrix multiplication](/learning-paths/cross-platform/kleidiai-explainer/) Learning Path.
+
+The w8a8 models we are using in this Learning Path only apply quantisation to the weights and activations in the linear layers of the transformer blocks. The activation quantisations are applied per-token and the weights are quantised per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations.
+
+## Quantising your own models
+
+If you would prefer to generate your own w8a8 quantised models, the recipe below is provided as an example. This is an optional activity and not a core part of this Learning Path, as it can take several hours to run.
+
+You will need to install the required packages before running the quantisation script.
+```bash
+pip install compressed-tensors==0.14.0.1
+pip install llmcompressor==0.10.0.1
+pip install datasets==4.6.0
+
+python w8a8_quant.py
+```
+
+Where w8a8_quant.py contains:
+```python
+from transformers import AutoTokenizer
+from datasets import Dataset, load_dataset
+from transformers import AutoModelForCausalLM
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import GPTQModifier
+from compressed_tensors.quantization import QuantizationType, QuantizationStrategy
+import random
+
+model_id = "meta-llama/Meta-Llama-3.1-8B"
+
+num_samples = 256
+max_seq_len = 4096
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+def preprocess_fn(example):
+  return {"text": example["text"]}
+
+ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
+ds = ds.shuffle().select(range(num_samples))
+ds = ds.map(preprocess_fn)
+
+scheme = {
+        "targets": ["Linear"],
+        "weights": {
+            "num_bits": 8,
+            "type": QuantizationType.INT,
+            "strategy": QuantizationStrategy.CHANNEL,
+            "symmetric": True,
+            "dynamic": False,
+            "group_size": None
+        },
+        "input_activations":
+            {
+            "num_bits": 8,
+            "type": QuantizationType.INT,
+            "strategy": QuantizationStrategy.TOKEN,
+            "dynamic": True,
+            "symmetric": False,
+            "observer": None,
+        },
+        "output_activations": None,
+}
+
+recipe = GPTQModifier(
+  targets="Linear",
+  config_groups={"group_0": scheme},
+  ignore=["lm_head"],
+  dampening_frac=0.01,
+  block_size=512,
+)
+
+model = AutoModelForCausalLM.from_pretrained(
+  model_id,
+  device_map="auto",
+  trust_remote_code=True,
+)
+
+oneshot(
+  model=model,
+  dataset=ds,
+  recipe=recipe,
+  max_seq_length=max_seq_len,
+  num_calibration_samples=num_samples,
+)
+model.save_pretrained("Meta-Llama-3.1-8B-quantized.w8a8")
+```
+
+When this has completed you will need to copy over the tokeniser specific files from the original model before you can run inference on your quantised model.
+
+```bash
+cp ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B/snapshots/*/*token*  Meta-Llama-3.1-8B-quantized.w8a8/
+```
diff --git a/...aths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-run-inference.md b/...aths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-run-inference.md
@@ -0,0 +1,148 @@
+---
+title: Run inference with vLLM
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Run inference on LLama3.1-8B
+
+We will use vLLM to serve an openAI-compatible API that we can use to run inference on Llama3.1-8B. This will demonstrate that the local environment is setup correctly.
+
+Start vLLM’s OpenAI-compatible API server using Llama3.1-8B:
+```bash
+vllm serve meta-llama/Llama-3.1-8B
+```
+
+Then we can create a test script that sends a request to the server using the OpenAI library. Copy the below to a file named llama_test.py.
+
+```python
+import time
+from openai import OpenAI
+from transformers import AutoTokenizer
+
+# vLLM's OpenAI-compatible server
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
+
+model = "meta-llama/Llama-3.1-8B"   # vllm server model
+
+# Define a chat template for the model
+llama3_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.first and message['role'] != 'system' %}{{ '<|start_header_id|>system<|end_header_id|>\n\n'+ 'You are a helpful assistant.' + '<|eot_id|>' }}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}"
+
+# Define your prompt
+message = [{"role": "user", "content": "Explain Big O notation with two examples."}]
+
+def run(prompt):
+    resp = client.completions.create(
+        model=model,
+        prompt=prompt,
+        max_tokens=128,  # The maximum number of tokens that can be generated in the completion
+    )
+    return resp.choices[0].text
+
+def main():
+    t0 = time.time()
+
+    tokenizer = AutoTokenizer.from_pretrained(model)
+    tokenizer.chat_template = llama3_template
+    prompt = tokenizer.apply_chat_template(message, tokenize=False)
+    result = run(prompt)
+
+    print(f"\n=== Output ===\n{result}\n")
+    print(f"Batch completed in : {time.time() - t0:.2f}s")
+
+if __name__ == "__main__":
+    main()
+```
+
+Now run the script with:
+```bash
+python llama_test.py
+```
+
+This will return the text generated by the model from your prompt. In the server logs you can see the throughput measured in tokens per second.
+
+You can do the same for the quantised model. Start the server:
+```bash
+vllm serve RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
+```
+
+Update your test script to use the quantised model:
+```python
+model = "RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8"  
+```
+
+Run inference on the quantised model:
+```bash
+python llama_test.py
+```
+
+You have now run inference using both the non-quantised and quantised Llama3.1-8B models.
+
+## Run inference on Whisper
+
+We will use a similar approach to test our ability to run inference on Whisper models. Install the required vLLM audio library then start vLLM’s OpenAI-compatible API server using Whisper-large-v3:
+```bash
+pip install vllm[audio]
+
+vllm serve openai/whisper-large-v3 
+```
+
+Then we can create a test script that sends a request with an audio file to the server using the OpenAI library. Copy the below to a file named whisper_test.py.
+
+```python
+import time
+from openai import OpenAI
+from vllm.assets.audio import AudioAsset
+
+# vLLM's OpenAI-compatible server
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
+
+model = "openai/whisper-large-v3"   # vllm server model
+
+# You can update the below with an audio file of your choosing
+audio_filepath=str(AudioAsset("winning_call").get_local_path())
+
+def transcribe_audio():
+    with open(audio_filepath, "rb") as audio:
+        transcription = client.audio.transcriptions.create(
+            model=model, 
+            file=audio,
+            language="en",
+            response_format="json",
+            temperature=0.0,
+        )
+    return transcription.text
+
+def main():
+    t0 = time.time()
+    out = transcribe_audio()
+    print(f"\n=== Output ===\n{out}\n")
+    print(f"Batch completed in : {time.time() - t0:.2f}s")
+
+if __name__ == "__main__":
+    main()
+```
+
+Now run the script with:
+```bash
+python whisper_test.py
+```
+
+You can do the same for the quantised model. Start the server:
+```bash
+vllm serve RedHatAI/whisper-large-v3-quantized.w8a8
+```
+
+Update your test script to use the quantised model:
+```python
+model = "RedHatAI/whisper-large-v3-quantized.w8a8"  
+```
+
+Run inference on the quantised model:
+```bash
+python whisper_test.py
+```
+
+You now have the quantised and non-quantised Llama and Whisper models on your local machine. You have installed vLLM and demonstrated you can run inference on your models. Now you can move on to benchmarking the Llama models and compare their performance.