[Do Not merge]Draft:Add NVFP4 four-over-six (4o6) adaptive activation quantization#1050
[Do Not merge]Draft:Add NVFP4 four-over-six (4o6) adaptive activation quantization#1050
Conversation
Implements the adaptive per-block scale selection strategy from arxiv:2512.02010.
Each 16-element activation block independently chooses between a 4-bit or 6-bit
FP8 block scale (MSE/MAE/abs_max criteria), reducing quantization error vs. uniform
scale encoding without requiring Blackwell hardware.
New public API:
- `nvfp4_4o6_fake_quant` in `modelopt.torch.quantization.calib.fouroversix`
- `NVFP4_4O6_W4A4_CFG` config (standard NVFP4 weights + 4o6 activations)
- `"nvfp4_4o6"` qformat in `hf_ptq.py`
Integration uses the existing `TensorQuantizer` backend mechanism
(`register_quant_backend("nvfp4_4o6", ...)`), with dynamic per-inference amax.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1050 +/- ##
==========================================
+ Coverage 70.10% 70.16% +0.06%
==========================================
Files 221 222 +1
Lines 25541 25606 +65
==========================================
+ Hits 17905 17967 +62
- Misses 7636 7639 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Implements the adaptive per-block scale selection strategy from arxiv:2512.02010. Each 16-element activation block independently chooses between a 4-bit or 6-bit FP8 block scale (MSE/MAE/abs_max criteria), reducing quantization error vs. uniform scale encoding without requiring Blackwell hardware.
New public API:
nvfp4_4o6_fake_quantinmodelopt.torch.quantization.calib.fouroversixNVFP4_4O6_W4A4_CFGconfig (standard NVFP4 weights + 4o6 activations)"nvfp4_4o6"qformat inhf_ptq.pyIntegration uses the existing
TensorQuantizerbackend mechanism (register_quant_backend("nvfp4_4o6", ...)), with dynamic per-inference amax.What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information