Remove _moe_count_expert_calib_tokens flag; tie token counting to moe_calib_experts_ratio#1062
Remove _moe_count_expert_calib_tokens flag; tie token counting to moe_calib_experts_ratio#1062
Conversation
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMoE calibration control is consolidated around Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI/Args
participant Config as QuantizeConfig
participant Plugin as HuggingFacePlugin
participant MoE as MoE Layer
participant Experts as Expert Modules
CLI->>Config: set moe_calib_experts_ratio (None or ratio)
Config->>Plugin: provide config (ratio present or None)
Plugin->>MoE: forward(input)
alt ratio is set (calibration)
Plugin->>Plugin: init token counting if ratio < 1.0
Plugin->>MoE: adjust routing (top_k scaled by ratio)
MoE->>Experts: route tokens (wider routing)
Experts-->>MoE: return activations (collect per-expert stats)
Plugin->>Plugin: reset token counting flag
Plugin->>Plugin: skip layer_sync_moe_local_experts_amax
else ratio is None (normal)
MoE->>Experts: route tokens (normal routing)
Plugin->>Plugin: perform layer_sync_moe_local_experts_amax as usual
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Edwardf0t1
left a comment
There was a problem hiding this comment.
Review
Clean simplification — removing moe_count_expert_calib_tokens and tying token counting to moe_calib_experts_ratio is a nice net reduction (+25/-42). A few suggestions:
Use is not None instead of truthiness for _moe_calib_experts_ratio
In huggingface.py, the guard if not self._moe_calib_experts_ratio and the check in layer_sync_moe_local_experts_amax both rely on truthiness to distinguish None from a valid float. This works today but is fragile — e.g. 0.0 is falsy and would be silently treated as None. Suggest explicit checks:
# forward guard
if self._moe_calib_experts_ratio is None:
return super().forward(hidden_states)
# layer_sync_moe_local_experts_amax
if self._moe_calib_experts_ratio is not None:
returnRestore the range assertion (or add a Pydantic validator)
The old code had assert 0 < self._moe_calib_experts_ratio <= 1. This was removed, and only the CLI (hf_ptq.py) validates the range now. If someone uses the Python API directly via QuantizeAlgorithmConfig, there's no validation. Consider keeping the assertion or adding a field validator on the config.
Minor: token counting init runs outside calibration
_init_token_counting() is now called whenever _moe_calib_experts_ratio is set, even on non-calibration forward passes. It only runs once (guarded by _token_counting_initialized), but it will allocate the counting tensor even if calibration mode is never entered. Not a bug, just a minor overhead change from the previous behavior.
Otherwise LGTM — the config removal, propagation cleanup, default change to None, and test updates are all well done.
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/config.py (1)
1056-1066:⚠️ Potential issue | 🟠 MajorAdd Pydantic range constraints to
moe_calib_experts_ratio.The field currently accepts any
float | Nonewithout validation at the config layer, despite other numeric fields in the same file usinggt,ge,lt, orleconstraints. The CLI example enforces(0.0, 1.0], and downstream code inmode.pyasserts the same range, but per the coding guidelines formodelopt/torch/**/config.py, validation should be enforced at the typed-config layer using Pydantic. Programmatic callers ofQuantizeAlgorithmConfigcan bypass these downstream checks.🛠️ Suggested change
moe_calib_experts_ratio: float | None = ModeloptField( default=None, + gt=0.0, + le=1.0, title="% of experts to calibrate during forward pass.",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/config.py` around lines 1056 - 1066, Add Pydantic range constraints to the moe_calib_experts_ratio field so config-level validation enforces (0.0, 1.0] instead of allowing any float; specifically, update the ModeloptField call for moe_calib_experts_ratio to include gt=0.0 and le=1.0 (while keeping the existing default=None and the float | None type) so QuantizeAlgorithmConfig (or whatever class contains moe_calib_experts_ratio) validates values at parse time rather than relying on downstream asserts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 447-450: The code currently disables expert-token counting when
_moe_calib_experts_ratio == 1.0 which prevents expert_token_count from being
populated; change the gating logic so _count_expert_tokens is enabled whenever
_moe_calib_experts_ratio is configured > 0 (i.e., remove the special-case that
turns off counting for 1.0). Locate the assignment/condition that sets
_count_expert_tokens (and any related initialization of expert_token_count) and
make it depend only on _moe_calib_experts_ratio > 0 so counting remains active
for ratio == 1.0.
---
Outside diff comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 1056-1066: Add Pydantic range constraints to the
moe_calib_experts_ratio field so config-level validation enforces (0.0, 1.0]
instead of allowing any float; specifically, update the ModeloptField call for
moe_calib_experts_ratio to include gt=0.0 and le=1.0 (while keeping the existing
default=None and the float | None type) so QuantizeAlgorithmConfig (or whatever
class contains moe_calib_experts_ratio) validates values at parse time rather
than relying on downstream asserts.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 338f43ff-e096-4077-8f3e-5154a129c349
📒 Files selected for processing (5)
examples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/mode.pymodelopt/torch/quantization/plugins/huggingface.pytests/unit/torch/quantization/plugins/test_sparse_moe.py
💤 Files with no reviewable changes (1)
- modelopt/torch/quantization/mode.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Line 252: The commented-out pytest addopts line in pyproject.toml (the addopts
entry under [tool.pytest.ini_options]) removes global pytest defaults (coverage,
verbose, instafail, durations, strict-markers) and will change behavior for all
test invocations (including those driven by tox.ini and CI workflows); either
revert the comment to restore the original addopts value, or if this change is
intentional, add a brief rationale in the PR and explicitly pass the required
pytest flags from tox.ini and CI job steps (or add equivalent entries in
pyproject.toml) so unit/GPU/example/partial-unit test jobs retain the expected
options.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d9e524ef-709d-4d1a-838b-426e24ab345a
📒 Files selected for processing (3)
CHANGELOG.rstmodelopt/torch/quantization/plugins/huggingface.pypyproject.toml
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/quantization/plugins/huggingface.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
513-548:⚠️ Potential issue | 🟠 MajorMake temporary calibration state restoration exception-safe.
If
super().forward(hidden_states)fails in this block,top_kand_count_expert_tokensmay remain modified, affecting later passes.Suggested fix
if is_calib: - self._count_expert_tokens = True - if TRANSFORMERS_VERSION_GE_5_0: - assert hasattr(self, "gate") and hasattr(self.gate, "top_k") - original_top_k = self.gate.top_k - self.gate.top_k = max( - original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio) - ) - super().forward(hidden_states) - self.gate.top_k = original_top_k - else: - # Path for transformers < 5.0 - if hasattr(self, "gate") and hasattr(self.gate, "top_k"): - top_k_owner = self.gate - else: - top_k_owner = self - original_top_k = top_k_owner.top_k - if hasattr(self, "num_experts"): - top_k_owner.top_k = max( - original_top_k, round(self.num_experts * self._moe_calib_experts_ratio) - ) - elif hasattr(self, "experts"): - num_experts = ( - self.experts.num_experts - if hasattr(self.experts, "num_experts") - else len(self.experts) - ) - top_k_owner.top_k = max( - original_top_k, - round(num_experts * self._moe_calib_experts_ratio), - ) - else: - raise ValueError(f"Could not find num_experts in module {self}") - super().forward(hidden_states) - top_k_owner.top_k = original_top_k - self._count_expert_tokens = False + self._count_expert_tokens = True + try: + if TRANSFORMERS_VERSION_GE_5_0: + assert hasattr(self, "gate") and hasattr(self.gate, "top_k") + original_top_k = self.gate.top_k + self.gate.top_k = max( + original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio) + ) + super().forward(hidden_states) + self.gate.top_k = original_top_k + else: + # Path for transformers < 5.0 + if hasattr(self, "gate") and hasattr(self.gate, "top_k"): + top_k_owner = self.gate + else: + top_k_owner = self + original_top_k = top_k_owner.top_k + if hasattr(self, "num_experts"): + top_k_owner.top_k = max( + original_top_k, round(self.num_experts * self._moe_calib_experts_ratio) + ) + elif hasattr(self, "experts"): + num_experts = ( + self.experts.num_experts + if hasattr(self.experts, "num_experts") + else len(self.experts) + ) + top_k_owner.top_k = max( + original_top_k, + round(num_experts * self._moe_calib_experts_ratio), + ) + else: + raise ValueError(f"Could not find num_experts in module {self}") + super().forward(hidden_states) + top_k_owner.top_k = original_top_k + finally: + self._count_expert_tokens = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 513 - 548, The calibration path mutates self._count_expert_tokens and top_k (either self.gate.top_k or top_k_owner.top_k) before calling super().forward(hidden_states) and doesn't restore them if an exception is raised; make this exception-safe by wrapping the call to super().forward(hidden_states) in a try/finally block that always restores top_k to original_top_k and sets self._count_expert_tokens back to False, applying the same try/finally pattern in both the TRANSFORMERS_VERSION_GE_5_0 branch (use original_top_k = self.gate.top_k) and the else branch (use original_top_k = top_k_owner.top_k and restore top_k_owner.top_k), keeping the logic that computes the temporary top_k based on self._moe_calib_experts_ratio and num_experts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 513-548: The calibration path mutates self._count_expert_tokens
and top_k (either self.gate.top_k or top_k_owner.top_k) before calling
super().forward(hidden_states) and doesn't restore them if an exception is
raised; make this exception-safe by wrapping the call to
super().forward(hidden_states) in a try/finally block that always restores top_k
to original_top_k and sets self._count_expert_tokens back to False, applying the
same try/finally pattern in both the TRANSFORMERS_VERSION_GE_5_0 branch (use
original_top_k = self.gate.top_k) and the else branch (use original_top_k =
top_k_owner.top_k and restore top_k_owner.top_k), keeping the logic that
computes the temporary top_k based on self._moe_calib_experts_ratio and
num_experts.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 59fdda96-ec93-4f7a-a987-ee8ff7e91498
📒 Files selected for processing (2)
modelopt/torch/quantization/config.pymodelopt/torch/quantization/plugins/huggingface.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
510-550:⚠️ Potential issue | 🟠 MajorMake calibration state restoration exception-safe.
If the calibration
super().forward(hidden_states)raises,top_kand_count_expert_tokensmay remain modified and affect later forwards. Wrap the temporary mutation scope intry/finally.Proposed fix
if is_calib: if not self._token_counting_initialized: self._init_token_counting() - self._count_expert_tokens = True - if TRANSFORMERS_VERSION_GE_5_0: - assert hasattr(self, "gate") and hasattr(self.gate, "top_k") - original_top_k = self.gate.top_k - self.gate.top_k = max( - original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio) - ) - super().forward(hidden_states) - self.gate.top_k = original_top_k - else: - # Path for transformers < 5.0 - if hasattr(self, "gate") and hasattr(self.gate, "top_k"): - top_k_owner = self.gate - else: - top_k_owner = self - original_top_k = top_k_owner.top_k - if hasattr(self, "num_experts"): - top_k_owner.top_k = max( - original_top_k, round(self.num_experts * self._moe_calib_experts_ratio) - ) - elif hasattr(self, "experts"): - num_experts = ( - self.experts.num_experts - if hasattr(self.experts, "num_experts") - else len(self.experts) - ) - top_k_owner.top_k = max( - original_top_k, - round(num_experts * self._moe_calib_experts_ratio), - ) - else: - raise ValueError(f"Could not find num_experts in module {self}") - super().forward(hidden_states) - top_k_owner.top_k = original_top_k - self._count_expert_tokens = False + self._count_expert_tokens = True + try: + if TRANSFORMERS_VERSION_GE_5_0: + assert hasattr(self, "gate") and hasattr(self.gate, "top_k") + original_top_k = self.gate.top_k + self.gate.top_k = max( + original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio) + ) + try: + super().forward(hidden_states) + finally: + self.gate.top_k = original_top_k + else: + # Path for transformers < 5.0 + if hasattr(self, "gate") and hasattr(self.gate, "top_k"): + top_k_owner = self.gate + else: + top_k_owner = self + original_top_k = top_k_owner.top_k + if hasattr(self, "num_experts"): + top_k_owner.top_k = max( + original_top_k, round(self.num_experts * self._moe_calib_experts_ratio) + ) + elif hasattr(self, "experts"): + num_experts = ( + self.experts.num_experts + if hasattr(self.experts, "num_experts") + else len(self.experts) + ) + top_k_owner.top_k = max( + original_top_k, + round(num_experts * self._moe_calib_experts_ratio), + ) + else: + raise ValueError(f"Could not find num_experts in module {self}") + try: + super().forward(hidden_states) + finally: + top_k_owner.top_k = original_top_k + finally: + self._count_expert_tokens = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 510 - 550, Wrap the calibration-only mutation of self._count_expert_tokens and any temporary change to top_k (whether via self.gate.top_k or top_k_owner.top_k) in a try/finally so that original_top_k is restored and self._count_expert_tokens is reset even if super().forward(hidden_states) raises; specifically, after setting self._count_expert_tokens = True and computing original_top_k and assigning the increased top_k (in both the TRANSFORMERS_VERSION_GE_5_0 branch and the legacy branch that sets top_k_owner.top_k), call super().forward(hidden_states) inside a try and restore top_k_owner.top_k/original_top_k and set self._count_expert_tokens = False in the finally block (also ensure the initial _init_token_counting call and reading of _moe_calib_experts_ratio remain unchanged).
🧹 Nitpick comments (1)
tests/unit/torch/quantization/plugins/test_sparse_moe.py (1)
272-276: Set_if_calibon an actual expert module for clearer intent.Using
next(converted.experts.modules())usually returns the container first. Prefer explicitly selecting a child expert to keep this robust and self-explanatory.Proposed tweak
- next(converted.experts.modules())._if_calib = True + first_expert = next(iter(converted.experts)) + first_expert._if_calib = True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 272 - 276, The line setting _if_calib uses next(converted.experts.modules()) which may return the container; instead select an actual expert child so intent is clear and robust—use converted.experts.children() (e.g., next(converted.experts.children()) ._if_calib = True) or index into the expert container if it is a ModuleList (e.g., converted.experts[0]._if_calib = True) to ensure you set _if_calib on a real expert module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 510-550: Wrap the calibration-only mutation of
self._count_expert_tokens and any temporary change to top_k (whether via
self.gate.top_k or top_k_owner.top_k) in a try/finally so that original_top_k is
restored and self._count_expert_tokens is reset even if
super().forward(hidden_states) raises; specifically, after setting
self._count_expert_tokens = True and computing original_top_k and assigning the
increased top_k (in both the TRANSFORMERS_VERSION_GE_5_0 branch and the legacy
branch that sets top_k_owner.top_k), call super().forward(hidden_states) inside
a try and restore top_k_owner.top_k/original_top_k and set
self._count_expert_tokens = False in the finally block (also ensure the initial
_init_token_counting call and reading of _moe_calib_experts_ratio remain
unchanged).
---
Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py`:
- Around line 272-276: The line setting _if_calib uses
next(converted.experts.modules()) which may return the container; instead select
an actual expert child so intent is clear and robust—use
converted.experts.children() (e.g., next(converted.experts.children())
._if_calib = True) or index into the expert container if it is a ModuleList
(e.g., converted.experts[0]._if_calib = True) to ensure you set _if_calib on a
real expert module.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8ac10ed3-831b-4160-b73d-e9a0408d0826
📒 Files selected for processing (2)
modelopt/torch/quantization/plugins/huggingface.pytests/unit/torch/quantization/plugins/test_sparse_moe.py
Edwardf0t1
left a comment
There was a problem hiding this comment.
One small thing: _count_expert_tokens is now set to True unconditionally inside the if is_calib block, even when _moe_calib_experts_ratio == 1.0. Previously the code had self._count_expert_tokens = self._moe_calib_experts_ratio < 1.0 to skip counting when all experts are calibrated (since the count is less useful in that case). Is this intentional?
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1062 +/- ##
==========================================
+ Coverage 70.07% 70.27% +0.20%
==========================================
Files 221 227 +6
Lines 25499 25854 +355
==========================================
+ Hits 17869 18170 +301
- Misses 7630 7684 +54 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unit/torch/quantization/plugins/test_sparse_moe.py (1)
260-275: Test logic is correct; minor comment clarification.The test correctly uses ratio 0.5 to enable token counting and properly simulates calibration mode. However, the comment at lines 273-274 is slightly misleading:
# Set _if_calib on an expert sub-module (not set by default since only the MoE # block was converted, not the full model). next(converted.experts.modules())._if_calib = True
modules()includes the container itself first, so this actually sets_if_calibonconverted.expertsdirectly, not a "sub-module." The test still works because theis_calibcheck inforward()iterates over all modules including the container.📝 Optional: More precise comment or explicit target
# Simulate calibration mode so lazy-init triggers during forward - # Set _if_calib on an expert sub-module (not set by default since only the MoE - # block was converted, not the full model). - next(converted.experts.modules())._if_calib = True + # Set _if_calib on the experts container (not set by default since only the MoE + # block was converted, not the full model). The is_calib check iterates over all + # modules in experts, including the container itself. + converted.experts._if_calib = True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 260 - 275, The comment is misleading: next(converted.experts.modules()) sets _if_calib on the experts container itself, not a nested sub-module; change the code to set _if_calib on an actual expert submodule and update the comment accordingly — replace next(converted.experts.modules())._if_calib = True with next(converted.experts.children())._if_calib = True (or explicitly index an expert like converted.experts[0]._if_calib = True) and update the nearby comment to state you’re setting _if_calib on an expert sub-module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 447-452: Update the docstring to accurately reflect when token
counting per expert is enabled: the config key _moe_calib_experts_ratio does not
enable per-expert token counting for any value > 0 — counting is only activated
when _moe_calib_experts_ratio < 1.0 (the code path that checks ratio < 1.0
enables counting), so change the sentence to state that token counting is
enabled only for ratios strictly less than 1.0 and clarify that ratio == 1.0
results in a direct pass-through without counting.
---
Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py`:
- Around line 260-275: The comment is misleading:
next(converted.experts.modules()) sets _if_calib on the experts container
itself, not a nested sub-module; change the code to set _if_calib on an actual
expert submodule and update the comment accordingly — replace
next(converted.experts.modules())._if_calib = True with
next(converted.experts.children())._if_calib = True (or explicitly index an
expert like converted.experts[0]._if_calib = True) and update the nearby comment
to state you’re setting _if_calib on an expert sub-module.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ab8ab7af-34ba-4e41-9a5c-1f85fce20206
📒 Files selected for processing (2)
modelopt/torch/quantization/plugins/huggingface.pytests/unit/torch/quantization/plugins/test_sparse_moe.py
| Optionally supports config-driven features (disabled by default): | ||
| - ``_moe_calib_experts_ratio``: force-forward tokens to more experts during calibration. | ||
| - ``_moe_count_expert_calib_tokens``: count tokens routed to each expert during calibration. | ||
| When set to a value > 0, also enables token counting per expert. | ||
|
|
||
| When both are disabled, forward is a direct pass-through with zero overhead. | ||
| When disabled, forward is a direct pass-through with zero overhead. | ||
| """ |
There was a problem hiding this comment.
Docstring inaccuracy: token counting is not enabled when ratio == 1.0
Line 449 states "When set to a value > 0, also enables token counting per expert," but line 512 only enables counting when ratio < 1.0. The docstring should clarify this edge case.
📝 Suggested docstring fix
Optionally supports config-driven features (disabled by default):
- ``_moe_calib_experts_ratio``: force-forward tokens to more experts during calibration.
- When set to a value > 0, also enables token counting per expert.
+ When set to a value in (0, 1), also enables token counting per expert.
+ At ratio == 1.0, all experts are calibrated so counting is skipped.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 447 - 452,
Update the docstring to accurately reflect when token counting per expert is
enabled: the config key _moe_calib_experts_ratio does not enable per-expert
token counting for any value > 0 — counting is only activated when
_moe_calib_experts_ratio < 1.0 (the code path that checks ratio < 1.0 enables
counting), so change the sentence to state that token counting is enabled only
for ratios strictly less than 1.0 and clarify that ratio == 1.0 results in a
direct pass-through without counting.
Summary
moe_count_expert_calib_tokensconfig field and the_moe_count_expert_calib_tokensinternal flag. Token counting is now implicitly enabled whenmoe_calib_experts_ratiois set, removing a redundant knob.--moe_calib_experts_ratiodefault toNoneinhf_ptq.py(was1.0). Previously all experts were force-calibrated by default; now the feature is opt-in and non-MoE models are unaffected without any flag.layer_sync_moe_local_experts_amaxwhenmoe_calib_experts_ratiois set, since each expert is calibrated independently with sufficient token coverage in that mode._QuantSparseMoe.forward: remove redundant truthy checks on_moe_calib_experts_ratioinside the branch that already assumes it is set.Changed files
modelopt/torch/quantization/config.pymoe_count_expert_calib_tokensfield; updatemoe_calib_experts_ratiodescription to document amax sync behaviormodelopt/torch/quantization/mode.pymoe_count_expert_calib_tokenspropagation inwrapped_calib_funcmodelopt/torch/quantization/plugins/huggingface.py_moe_count_expert_calib_tokensfrom_QuantSparseMoe; simplifyforward; skiplayer_sync_moe_local_experts_amaxwhen ratio is setexamples/llm_ptq/hf_ptq.py--moe_calib_experts_ratiotoNone; guard validationtests/unit/.../test_sparse_moe.py_moe_calib_experts_ratioinstead of removed flagTest plan
hf_ptq.pyworks without--moe_calib_experts_ratio(non-MoE model, defaultNone)🤖 Generated with Claude Code
Summary by CodeRabbit
Configuration Changes
Refactor
Documentation