LP: vllm benchmarking with quantised models by almayne · Pull Request #3207 · ArmDeveloperEcosystem/arm-learning-paths

almayne · 2026-04-24T15:27:48Z

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.

I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

fadara01

Thank you for your work!

I added some initial comments

fadara01 · 2026-04-24T15:47:42Z

+lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks gsm8k --batch_size 4 --limit 10
+```
+
+We would expect to see the precision is slightly lower with INT8.


we should expect to see numbers similar to the ones reported here for int8: https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8

fadara01 · 2026-04-24T16:12:09Z

+
+## Set up access to LLama3.1-8B models
+
+To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face cli so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face cli guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the cli and login:


is it worth adding an instruction that you should also sign the licence agreement etc for the meta-llama model?

Requesting access to the model is covered in the paragraph below. Is there an additional step I've forgotten?

fadara01 · 2026-04-24T16:16:54Z

+  * Accuracy: --limit mmlu=10,gsm8k=500
+
+### Throughput ratios: INT8/BF16
+| Requests/s | Total Tokens/s | Output Tokens/s |


given that we ran a serving benchmark, I think we should report latency here too.

nikhil-arm

I think we need to redo the inference and benchmarking pages from scratch.
Also I did not find any mention of whisper which was one of the requirement if I understand correctly

nSircombe · 2026-04-27T07:12:29Z

I think we need to redo the inference and benchmarking pages from scratch. Also I did not find any mention of whisper which was one of the requirement if I understand correctly

Llama and/or Whisper I think.

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

…page to use custom scripts and added whisper inference back in. Accuracy results in benchmarking page are now full runs. Signed-off-by: Anna Mayne <anna.mayne@arm.com>

LP: vllm benchmarking with quantised models

13546a9

Signed-off-by: Anna Mayne <anna.mayne@arm.com>