-
Notifications
You must be signed in to change notification settings - Fork 243
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Before submitting an issue, please make sure it hasn't been already addressed by searching through the existing and past issues.
Describe the bug
- I followed the notebook https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb to perform NVFP4 QAT (as well as FP8 QAT). Both models when deployed using vLLM and evaluated using ifEval benchmarks show a great reduction in accuracy when compared with PTQ quantized models. Its my understanding that QAT models should show improvement in accuracy when compared with PTQ quantized models.
Steps/Code to reproduce bug
- Calibration Size 512
- model used: meta-llama/Llama-3.1-8B-Instruct
- https://github.com/elizabetht/language-modeling-from-scratch/blob/main/quantization/model-optimizer/qat/output_executed_nvfp4_qat.ipynb
Expected behavior
Accuracy evaluations should improve when compared with PTQ models using same format (ie NVFP4/FP8)
Who can help?
- ?
System information
- Container used (if applicable): ?
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ?
- CPU architecture (x86_64, aarch64): ?
- GPU name (e.g. H100, A100, L40S): ?
- GPU memory size: ?
- Number of GPUs: ?
- Library versions (if applicable):
- Python: ?
- ModelOpt version or commit hash: ?
- CUDA: ?
- PyTorch: ?
- Transformers: ?
- TensorRT-LLM: ?
- ONNXRuntime: ?
- TensorRT: ?
- Any other details that may help: ?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working