-
Notifications
You must be signed in to change notification settings - Fork 281
LP: vllm benchmarking with quantised models #3207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
almayne
wants to merge
5
commits into
ArmDeveloperEcosystem:main
Choose a base branch
from
almayne:vllm_bench_quantised
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
13546a9
LP: vllm benchmarking with quantised models
almayne 4d45ef9
Minor updates from review feedback.
almayne 147df8f
More minor updates from review feedback.
almayne 5202295
Locked quantisation lib versions. Added more info in w8a8.
almayne 2fc3bcb
Reordered pages and switched quant recipe to w8a8. Updated inference …
almayne File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
71 changes: 71 additions & 0 deletions
71
...servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| --- | ||
| title: Setup vLLM | ||
| weight: 2 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## What is vLLM | ||
|
|
||
| [vLLM](https://docs.vllm.ai/en/latest/) is an open-source, high-throughput inference and serving engine for large language models (LLMs). It’s designed to maximise hardware efficiency, making LLM inference faster, more memory-efficient, and scalable. | ||
|
|
||
| ## Understanding the models | ||
|
|
||
| Llama 3.1 8B is an open-weight, text-only LLM with 8 billion parameters that can understand and generate text. You can view the model card at https://huggingface.co/meta-llama/Llama-3.1-8B. | ||
|
|
||
| Whisper large V3 is an automatic speech recognition (ASR) and speech translation model. It has 1.55 billion parameters and can both transcribe many languages and translate them to English. You can find the model card at https://huggingface.co/openai/whisper-large-v3. | ||
|
|
||
| ## Set up your environment | ||
|
|
||
| Before you begin, make sure your environment meets these requirements: | ||
|
|
||
| - Python 3.12 on Ubuntu 22.04 LTS or newer | ||
| - At least 32 vCPUs, 96 GB RAM, and 64 GB of free disk space | ||
|
|
||
| This Learning Path was tested on a 96 core machine with 128-bit SVE, 192 GB of RAM and 500 GB of attached storage. | ||
|
|
||
| ## Install build dependencies | ||
|
|
||
| Install the following packages required for running inference with vLLM on Arm64: | ||
| ```bash | ||
| sudo apt-get update -y | ||
| sudo apt install -y python3.12-venv python3.12-dev | ||
| ``` | ||
|
|
||
| Now install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency: | ||
| ```bash | ||
| sudo apt-get install -y libtcmalloc-minimal4 | ||
| ``` | ||
|
|
||
| ## Create and activate a Python virtual environment | ||
|
|
||
| It’s best practice to install vLLM inside an isolated environment to prevent conflicts between system and project dependencies: | ||
| ```bash | ||
| python3.12 -m venv vllm_env | ||
| source vllm_env/bin/activate | ||
| python -m pip install --upgrade pip | ||
| ``` | ||
|
|
||
| ## Install vLLM for CPU | ||
|
almayne marked this conversation as resolved.
|
||
|
|
||
| Install a recent CPU specific build of vLLM: | ||
| ```bash | ||
| export VLLM_VERSION=0.20.0 | ||
| pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl --extra-index-url https://download.pytorch.org/whl/cpu | ||
| ``` | ||
|
|
||
| If you wish to build vLLM from source you can follow the instructions in the [Build and Run vLLM on Arm Servers Learning Path](/learning-paths/servers-and-cloud-computing/vllm/vllm-setup/). | ||
|
|
||
|
|
||
| ## Set up access to LLama3.1-8B models | ||
|
|
||
| To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login: | ||
| ```bash | ||
| curl -LsSf https://hf.co/cli/install.sh | bash | ||
| hf auth login | ||
| ``` | ||
|
|
||
| Paste your access token into the terminal when prompted. To access Llama3.1-8B you need to request access on the Hugging Face website. Visit https://huggingface.co/meta-llama/Llama-3.1-8B and select "Expand to review and access". Complete the form and you should be granted access in a matter of minutes. | ||
|
|
||
| Your environment is now setup to run inference with vLLM. Next, we'll review model quantisation and then you'll use vLLM to run inference on both quantised and non-quantised Llama and Whisper models. | ||
104 changes: 104 additions & 0 deletions
104
...ervers-and-cloud-computing/vllm-benchmark-quantisation/2-quantisation-recipe.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| --- | ||
| title: Quantisation Recipe | ||
| weight: 3 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## Understanding quantisation | ||
|
|
||
| Quantised models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 and https://huggingface.co/RedHatAI/whisper-large-v3-quantized.w8a8, which we will be using in this Learning Path. | ||
|
|
||
| The notation w8a8 means that the weights have been quantised to 8-bit integers and the activations (the input data) are dynamically quantised to the same. This allows our kernels to utilise Arm's 8-bit integer matrix multiply feature I8MM. You can learn more about this in the [KleidiAI and matrix multiplication](/learning-paths/cross-platform/kleidiai-explainer/) Learning Path. | ||
|
|
||
| The w8a8 models we are using in this Learning Path only apply quantisation to the weights and activations in the linear layers of the transformer blocks. The activation quantisations are applied per-token and the weights are quantised per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations. | ||
|
|
||
| ## Quantising your own models | ||
|
|
||
| If you would prefer to generate your own w8a8 quantised models, the recipe below is provided as an example. This is an optional activity and not a core part of this Learning Path, as it can take several hours to run. | ||
|
|
||
| You will need to install the required packages before running the quantisation script. | ||
| ```bash | ||
| pip install compressed-tensors==0.14.0.1 | ||
| pip install llmcompressor==0.10.0.1 | ||
| pip install datasets==4.6.0 | ||
|
|
||
| python w8a8_quant.py | ||
| ``` | ||
|
|
||
| Where w8a8_quant.py contains: | ||
| ```python | ||
| from transformers import AutoTokenizer | ||
| from datasets import Dataset, load_dataset | ||
| from transformers import AutoModelForCausalLM | ||
| from llmcompressor import oneshot | ||
| from llmcompressor.modifiers.quantization import GPTQModifier | ||
| from compressed_tensors.quantization import QuantizationType, QuantizationStrategy | ||
| import random | ||
|
|
||
| model_id = "meta-llama/Meta-Llama-3.1-8B" | ||
|
|
||
| num_samples = 256 | ||
| max_seq_len = 4096 | ||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
|
|
||
| def preprocess_fn(example): | ||
| return {"text": example["text"]} | ||
|
|
||
| ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train") | ||
| ds = ds.shuffle().select(range(num_samples)) | ||
| ds = ds.map(preprocess_fn) | ||
|
|
||
| scheme = { | ||
| "targets": ["Linear"], | ||
| "weights": { | ||
| "num_bits": 8, | ||
| "type": QuantizationType.INT, | ||
| "strategy": QuantizationStrategy.CHANNEL, | ||
| "symmetric": True, | ||
| "dynamic": False, | ||
| "group_size": None | ||
| }, | ||
| "input_activations": | ||
| { | ||
| "num_bits": 8, | ||
| "type": QuantizationType.INT, | ||
| "strategy": QuantizationStrategy.TOKEN, | ||
| "dynamic": True, | ||
| "symmetric": False, | ||
| "observer": None, | ||
| }, | ||
| "output_activations": None, | ||
| } | ||
|
|
||
| recipe = GPTQModifier( | ||
| targets="Linear", | ||
| config_groups={"group_0": scheme}, | ||
| ignore=["lm_head"], | ||
| dampening_frac=0.01, | ||
| block_size=512, | ||
| ) | ||
|
|
||
| model = AutoModelForCausalLM.from_pretrained( | ||
| model_id, | ||
| device_map="auto", | ||
| trust_remote_code=True, | ||
| ) | ||
|
|
||
| oneshot( | ||
| model=model, | ||
| dataset=ds, | ||
| recipe=recipe, | ||
| max_seq_length=max_seq_len, | ||
| num_calibration_samples=num_samples, | ||
| ) | ||
| model.save_pretrained("Meta-Llama-3.1-8B-quantized.w8a8") | ||
| ``` | ||
|
|
||
| When this has completed you will need to copy over the tokeniser specific files from the original model before you can run inference on your quantised model. | ||
|
|
||
| ```bash | ||
| cp ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B/snapshots/*/*token* Meta-Llama-3.1-8B-quantized.w8a8/ | ||
| ``` |
148 changes: 148 additions & 0 deletions
148
...aths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-run-inference.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,148 @@ | ||
| --- | ||
| title: Run inference with vLLM | ||
| weight: 4 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## Run inference on LLama3.1-8B | ||
|
|
||
| We will use vLLM to serve an openAI-compatible API that we can use to run inference on Llama3.1-8B. This will demonstrate that the local environment is setup correctly. | ||
|
|
||
| Start vLLM’s OpenAI-compatible API server using Llama3.1-8B: | ||
| ```bash | ||
| vllm serve meta-llama/Llama-3.1-8B | ||
| ``` | ||
|
|
||
| Then we can create a test script that sends a request to the server using the OpenAI library. Copy the below to a file named llama_test.py. | ||
|
|
||
| ```python | ||
| import time | ||
| from openai import OpenAI | ||
| from transformers import AutoTokenizer | ||
|
|
||
| # vLLM's OpenAI-compatible server | ||
| client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") | ||
|
|
||
| model = "meta-llama/Llama-3.1-8B" # vllm server model | ||
|
|
||
| # Define a chat template for the model | ||
| llama3_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.first and message['role'] != 'system' %}{{ '<|start_header_id|>system<|end_header_id|>\n\n'+ 'You are a helpful assistant.' + '<|eot_id|>' }}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}" | ||
|
|
||
| # Define your prompt | ||
| message = [{"role": "user", "content": "Explain Big O notation with two examples."}] | ||
|
|
||
| def run(prompt): | ||
| resp = client.completions.create( | ||
| model=model, | ||
| prompt=prompt, | ||
| max_tokens=128, # The maximum number of tokens that can be generated in the completion | ||
| ) | ||
| return resp.choices[0].text | ||
|
|
||
| def main(): | ||
| t0 = time.time() | ||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained(model) | ||
| tokenizer.chat_template = llama3_template | ||
| prompt = tokenizer.apply_chat_template(message, tokenize=False) | ||
| result = run(prompt) | ||
|
|
||
| print(f"\n=== Output ===\n{result}\n") | ||
| print(f"Batch completed in : {time.time() - t0:.2f}s") | ||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
| ``` | ||
|
|
||
| Now run the script with: | ||
| ```bash | ||
| python llama_test.py | ||
| ``` | ||
|
|
||
| This will return the text generated by the model from your prompt. In the server logs you can see the throughput measured in tokens per second. | ||
|
|
||
| You can do the same for the quantised model. Start the server: | ||
| ```bash | ||
| vllm serve RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 | ||
| ``` | ||
|
|
||
| Update your test script to use the quantised model: | ||
| ```python | ||
| model = "RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8" | ||
| ``` | ||
|
|
||
| Run inference on the quantised model: | ||
| ```bash | ||
| python llama_test.py | ||
| ``` | ||
|
|
||
| You have now run inference using both the non-quantised and quantised Llama3.1-8B models. | ||
|
|
||
| ## Run inference on Whisper | ||
|
|
||
| We will use a similar approach to test our ability to run inference on Whisper models. Install the required vLLM audio library then start vLLM’s OpenAI-compatible API server using Whisper-large-v3: | ||
| ```bash | ||
| pip install vllm[audio] | ||
|
|
||
| vllm serve openai/whisper-large-v3 | ||
| ``` | ||
|
|
||
| Then we can create a test script that sends a request with an audio file to the server using the OpenAI library. Copy the below to a file named whisper_test.py. | ||
|
|
||
| ```python | ||
| import time | ||
| from openai import OpenAI | ||
| from vllm.assets.audio import AudioAsset | ||
|
|
||
| # vLLM's OpenAI-compatible server | ||
| client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") | ||
|
|
||
| model = "openai/whisper-large-v3" # vllm server model | ||
|
|
||
| # You can update the below with an audio file of your choosing | ||
| audio_filepath=str(AudioAsset("winning_call").get_local_path()) | ||
|
|
||
| def transcribe_audio(): | ||
| with open(audio_filepath, "rb") as audio: | ||
| transcription = client.audio.transcriptions.create( | ||
| model=model, | ||
| file=audio, | ||
| language="en", | ||
| response_format="json", | ||
| temperature=0.0, | ||
| ) | ||
| return transcription.text | ||
|
|
||
| def main(): | ||
| t0 = time.time() | ||
| out = transcribe_audio() | ||
| print(f"\n=== Output ===\n{out}\n") | ||
| print(f"Batch completed in : {time.time() - t0:.2f}s") | ||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
| ``` | ||
|
|
||
| Now run the script with: | ||
| ```bash | ||
| python whisper_test.py | ||
| ``` | ||
|
|
||
| You can do the same for the quantised model. Start the server: | ||
| ```bash | ||
| vllm serve RedHatAI/whisper-large-v3-quantized.w8a8 | ||
| ``` | ||
|
|
||
| Update your test script to use the quantised model: | ||
| ```python | ||
| model = "RedHatAI/whisper-large-v3-quantized.w8a8" | ||
| ``` | ||
|
|
||
| Run inference on the quantised model: | ||
| ```bash | ||
| python whisper_test.py | ||
| ``` | ||
|
|
||
| You now have the quantised and non-quantised Llama and Whisper models on your local machine. You have installed vLLM and demonstrated you can run inference on your models. Now you can move on to benchmarking the Llama models and compare their performance. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.