Remove _moe_count_expert_calib_tokens flag; tie token counting to moe_calib_experts_ratio by cjluo-nv · Pull Request #1062 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-03-17T19:05:49Z

Summary

Remove moe_count_expert_calib_tokens config field and the _moe_count_expert_calib_tokens internal flag. Token counting is now implicitly enabled when moe_calib_experts_ratio is set, removing a redundant knob.
Change --moe_calib_experts_ratio default to None in hf_ptq.py (was 1.0). Previously all experts were force-calibrated by default; now the feature is opt-in and non-MoE models are unaffected without any flag.
Disable layer_sync_moe_local_experts_amax when moe_calib_experts_ratio is set, since each expert is calibrated independently with sufficient token coverage in that mode.
Simplify _QuantSparseMoe.forward: remove redundant truthy checks on _moe_calib_experts_ratio inside the branch that already assumes it is set.

Changed files

File	Change
`modelopt/torch/quantization/config.py`	Remove `moe_count_expert_calib_tokens` field; update `moe_calib_experts_ratio` description to document amax sync behavior
`modelopt/torch/quantization/mode.py`	Remove `moe_count_expert_calib_tokens` propagation in `wrapped_calib_func`
`modelopt/torch/quantization/plugins/huggingface.py`	Remove `_moe_count_expert_calib_tokens` from `_QuantSparseMoe`; simplify `forward`; skip `layer_sync_moe_local_experts_amax` when ratio is set
`examples/llm_ptq/hf_ptq.py`	Default `--moe_calib_experts_ratio` to `None`; guard validation
`tests/unit/.../test_sparse_moe.py`	Update tests to use `_moe_calib_experts_ratio` instead of removed flag

Test plan

Verify hf_ptq.py works without --moe_calib_experts_ratio (non-MoE model, default None)

🤖 Generated with Claude Code

Summary by CodeRabbit

Configuration Changes
- moe_calib_experts_ratio now defaults to None (disabled) instead of 1.0; validation only occurs when a value is provided.
Refactor
- Simplified MoE calibration flow and token-counting behavior; removed a deprecated expert-calibration configuration field.
Documentation
- Changelog and docstrings updated to reflect the new default and calibration behavior.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

copy-pr-bot · 2026-03-17T19:05:54Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-03-17T19:05:57Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

MoE calibration control is consolidated around moe_calib_experts_ratio: the boolean moe_count_expert_calib_tokens was removed, the CLI default for the ratio changed from 1.0 to None, validators were added, and runtime token-counting and sync logic now activate only when the ratio is set.

Changes

Cohort / File(s)	Summary
Configuration & CLI `modelopt/torch/quantization/config.py`, `examples/llm_ptq/hf_ptq.py`	Removed `moe_count_expert_calib_tokens` from `QuantizeAlgorithmConfig`. Added `gt=0.0, le=1.0` validators to `moe_calib_experts_ratio`. CLI default changed from `1.0` to `None`; argument validation now runs only when the value is not `None`.
Runtime / Mode `modelopt/torch/quantization/mode.py`	Removed propagation of `moe_count_expert_calib_tokens` into module kwargs and related pop logic.
Plugin (HuggingFace) Implementation `modelopt/torch/quantization/plugins/huggingface.py`	Removed initialization/exposure of `_moe_count_expert_calib_tokens`. Forward/calibration gating now depends on `_moe_calib_experts_ratio`; token-counting is lazily initialized only when ratio < 1.0 during calibration. Adjusts top_k routing during calibration and skips layer sync when ratio is set. Docstrings updated.
Tests `tests/unit/torch/quantization/plugins/test_sparse_moe.py`	Replaced boolean-based calibration flags with ratio-based activation in tests; assertions and setups updated to use `moe_calib_experts_ratio` values.
Changelog `CHANGELOG.rst`	Updated entry noting CLI default for `--moe_calib_experts_ratio` changed to `None` (disabled) instead of defaulting to all experts.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/Args
    participant Config as QuantizeConfig
    participant Plugin as HuggingFacePlugin
    participant MoE as MoE Layer
    participant Experts as Expert Modules

    CLI->>Config: set moe_calib_experts_ratio (None or ratio)
    Config->>Plugin: provide config (ratio present or None)
    Plugin->>MoE: forward(input)
    alt ratio is set (calibration)
        Plugin->>Plugin: init token counting if ratio < 1.0
        Plugin->>MoE: adjust routing (top_k scaled by ratio)
        MoE->>Experts: route tokens (wider routing)
        Experts-->>MoE: return activations (collect per-expert stats)
        Plugin->>Plugin: reset token counting flag
        Plugin->>Plugin: skip layer_sync_moe_local_experts_amax
    else ratio is None (normal)
        MoE->>Experts: route tokens (normal routing)
        Plugin->>Plugin: perform layer_sync_moe_local_experts_amax as usual
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly and concisely summarizes the main change: removing the `_moe_count_expert_calib_tokens` flag and tying token counting to `moe_calib_experts_ratio`, which is the primary objective across all modified files.
Security Anti-Patterns	✅ Passed	Pull request contains only configuration and calibration logic changes with no security anti-patterns detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/fix_sparse_moe

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Edwardf0t1

Review

Clean simplification — removing moe_count_expert_calib_tokens and tying token counting to moe_calib_experts_ratio is a nice net reduction (+25/-42). A few suggestions:

Use `is not None` instead of truthiness for `_moe_calib_experts_ratio`

In huggingface.py, the guard if not self._moe_calib_experts_ratio and the check in layer_sync_moe_local_experts_amax both rely on truthiness to distinguish None from a valid float. This works today but is fragile — e.g. 0.0 is falsy and would be silently treated as None. Suggest explicit checks:

# forward guard
if self._moe_calib_experts_ratio is None:
    return super().forward(hidden_states)

# layer_sync_moe_local_experts_amax
if self._moe_calib_experts_ratio is not None:
    return

Restore the range assertion (or add a Pydantic validator)

The old code had assert 0 < self._moe_calib_experts_ratio <= 1. This was removed, and only the CLI (hf_ptq.py) validates the range now. If someone uses the Python API directly via QuantizeAlgorithmConfig, there's no validation. Consider keeping the assertion or adding a field validator on the config.

Minor: token counting init runs outside calibration

_init_token_counting() is now called whenever _moe_calib_experts_ratio is set, even on non-calibration forward passes. It only runs once (guarded by _token_counting_initialized), but it will allocate the counting tensor even if calibration mode is never entered. Not a bug, just a minor overhead change from the previous behavior.

Otherwise LGTM — the config removal, propagation cleanup, default change to None, and test updates are all well done.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/config.py (1)
1056-1066: ⚠️ Potential issue | 🟠 Major

Add Pydantic range constraints to moe_calib_experts_ratio.

The field currently accepts any float | None without validation at the config layer, despite other numeric fields in the same file using gt, ge, lt, or le constraints. The CLI example enforces (0.0, 1.0], and downstream code in mode.py asserts the same range, but per the coding guidelines for modelopt/torch/**/config.py, validation should be enforced at the typed-config layer using Pydantic. Programmatic callers of QuantizeAlgorithmConfig can bypass these downstream checks.
🛠️ Suggested change
     moe_calib_experts_ratio: float | None = ModeloptField(
         default=None,
+        gt=0.0,
+        le=1.0,
         title="% of experts to calibrate during forward pass.",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 1056 - 1066, Add Pydantic
range constraints to the moe_calib_experts_ratio field so config-level
validation enforces (0.0, 1.0] instead of allowing any float; specifically,
update the ModeloptField call for moe_calib_experts_ratio to include gt=0.0 and
le=1.0 (while keeping the existing default=None and the float | None type) so
QuantizeAlgorithmConfig (or whatever class contains moe_calib_experts_ratio)
validates values at parse time rather than relying on downstream asserts.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 447-450: The code currently disables expert-token counting when
_moe_calib_experts_ratio == 1.0 which prevents expert_token_count from being
populated; change the gating logic so _count_expert_tokens is enabled whenever
_moe_calib_experts_ratio is configured > 0 (i.e., remove the special-case that
turns off counting for 1.0). Locate the assignment/condition that sets
_count_expert_tokens (and any related initialization of expert_token_count) and
make it depend only on _moe_calib_experts_ratio > 0 so counting remains active
for ratio == 1.0.

---

Outside diff comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 1056-1066: Add Pydantic range constraints to the
moe_calib_experts_ratio field so config-level validation enforces (0.0, 1.0]
instead of allowing any float; specifically, update the ModeloptField call for
moe_calib_experts_ratio to include gt=0.0 and le=1.0 (while keeping the existing
default=None and the float | None type) so QuantizeAlgorithmConfig (or whatever
class contains moe_calib_experts_ratio) validates values at parse time rather
than relying on downstream asserts.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 338f43ff-e096-4077-8f3e-5154a129c349

📥 Commits

Reviewing files that changed from the base of the PR and between 00fa5bd and 0e6f3d1.

📒 Files selected for processing (5)

examples/llm_ptq/hf_ptq.py
modelopt/torch/quantization/config.py
modelopt/torch/quantization/mode.py
modelopt/torch/quantization/plugins/huggingface.py
tests/unit/torch/quantization/plugins/test_sparse_moe.py

💤 Files with no reviewable changes (1)

modelopt/torch/quantization/mode.py

modelopt/torch/quantization/plugins/huggingface.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Line 252: The commented-out pytest addopts line in pyproject.toml (the addopts
entry under [tool.pytest.ini_options]) removes global pytest defaults (coverage,
verbose, instafail, durations, strict-markers) and will change behavior for all
test invocations (including those driven by tox.ini and CI workflows); either
revert the comment to restore the original addopts value, or if this change is
intentional, add a brief rationale in the PR and explicitly pass the required
pytest flags from tox.ini and CI job steps (or add equivalent entries in
pyproject.toml) so unit/GPU/example/partial-unit test jobs retain the expected
options.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d9e524ef-709d-4d1a-838b-426e24ab345a

📥 Commits

Reviewing files that changed from the base of the PR and between 0e6f3d1 and d7b6f2a.

📒 Files selected for processing (3)

CHANGELOG.rst
modelopt/torch/quantization/plugins/huggingface.py
pyproject.toml

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.rst

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/quantization/plugins/huggingface.py

pyproject.toml

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/plugins/huggingface.py (1)

513-548: ⚠️ Potential issue | 🟠 Major

Make temporary calibration state restoration exception-safe.

If super().forward(hidden_states) fails in this block, top_k and _count_expert_tokens may remain modified, affecting later passes.

Suggested fix

         if is_calib:
-            self._count_expert_tokens = True
-            if TRANSFORMERS_VERSION_GE_5_0:
-                assert hasattr(self, "gate") and hasattr(self.gate, "top_k")
-                original_top_k = self.gate.top_k
-                self.gate.top_k = max(
-                    original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio)
-                )
-                super().forward(hidden_states)
-                self.gate.top_k = original_top_k
-            else:
-                # Path for transformers < 5.0
-                if hasattr(self, "gate") and hasattr(self.gate, "top_k"):
-                    top_k_owner = self.gate
-                else:
-                    top_k_owner = self
-                original_top_k = top_k_owner.top_k
-                if hasattr(self, "num_experts"):
-                    top_k_owner.top_k = max(
-                        original_top_k, round(self.num_experts * self._moe_calib_experts_ratio)
-                    )
-                elif hasattr(self, "experts"):
-                    num_experts = (
-                        self.experts.num_experts
-                        if hasattr(self.experts, "num_experts")
-                        else len(self.experts)
-                    )
-                    top_k_owner.top_k = max(
-                        original_top_k,
-                        round(num_experts * self._moe_calib_experts_ratio),
-                    )
-                else:
-                    raise ValueError(f"Could not find num_experts in module {self}")
-                super().forward(hidden_states)
-                top_k_owner.top_k = original_top_k
-            self._count_expert_tokens = False
+            self._count_expert_tokens = True
+            try:
+                if TRANSFORMERS_VERSION_GE_5_0:
+                    assert hasattr(self, "gate") and hasattr(self.gate, "top_k")
+                    original_top_k = self.gate.top_k
+                    self.gate.top_k = max(
+                        original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio)
+                    )
+                    super().forward(hidden_states)
+                    self.gate.top_k = original_top_k
+                else:
+                    # Path for transformers < 5.0
+                    if hasattr(self, "gate") and hasattr(self.gate, "top_k"):
+                        top_k_owner = self.gate
+                    else:
+                        top_k_owner = self
+                    original_top_k = top_k_owner.top_k
+                    if hasattr(self, "num_experts"):
+                        top_k_owner.top_k = max(
+                            original_top_k, round(self.num_experts * self._moe_calib_experts_ratio)
+                        )
+                    elif hasattr(self, "experts"):
+                        num_experts = (
+                            self.experts.num_experts
+                            if hasattr(self.experts, "num_experts")
+                            else len(self.experts)
+                        )
+                        top_k_owner.top_k = max(
+                            original_top_k,
+                            round(num_experts * self._moe_calib_experts_ratio),
+                        )
+                    else:
+                        raise ValueError(f"Could not find num_experts in module {self}")
+                    super().forward(hidden_states)
+                    top_k_owner.top_k = original_top_k
+            finally:
+                self._count_expert_tokens = False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 513 - 548,
The calibration path mutates self._count_expert_tokens and top_k (either
self.gate.top_k or top_k_owner.top_k) before calling
super().forward(hidden_states) and doesn't restore them if an exception is
raised; make this exception-safe by wrapping the call to
super().forward(hidden_states) in a try/finally block that always restores top_k
to original_top_k and sets self._count_expert_tokens back to False, applying the
same try/finally pattern in both the TRANSFORMERS_VERSION_GE_5_0 branch (use
original_top_k = self.gate.top_k) and the else branch (use original_top_k =
top_k_owner.top_k and restore top_k_owner.top_k), keeping the logic that
computes the temporary top_k based on self._moe_calib_experts_ratio and
num_experts.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 513-548: The calibration path mutates self._count_expert_tokens
and top_k (either self.gate.top_k or top_k_owner.top_k) before calling
super().forward(hidden_states) and doesn't restore them if an exception is
raised; make this exception-safe by wrapping the call to
super().forward(hidden_states) in a try/finally block that always restores top_k
to original_top_k and sets self._count_expert_tokens back to False, applying the
same try/finally pattern in both the TRANSFORMERS_VERSION_GE_5_0 branch (use
original_top_k = self.gate.top_k) and the else branch (use original_top_k =
top_k_owner.top_k and restore top_k_owner.top_k), keeping the logic that
computes the temporary top_k based on self._moe_calib_experts_ratio and
num_experts.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 59fdda96-ec93-4f7a-a987-ee8ff7e91498

📥 Commits

Reviewing files that changed from the base of the PR and between d7b6f2a and 096bb73.

📒 Files selected for processing (2)

modelopt/torch/quantization/config.py
modelopt/torch/quantization/plugins/huggingface.py

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/plugins/huggingface.py (1)

510-550: ⚠️ Potential issue | 🟠 Major

Make calibration state restoration exception-safe.

If the calibration super().forward(hidden_states) raises, top_k and _count_expert_tokens may remain modified and affect later forwards. Wrap the temporary mutation scope in try/finally.

Proposed fix

         if is_calib:
             if not self._token_counting_initialized:
                 self._init_token_counting()
-            self._count_expert_tokens = True
-            if TRANSFORMERS_VERSION_GE_5_0:
-                assert hasattr(self, "gate") and hasattr(self.gate, "top_k")
-                original_top_k = self.gate.top_k
-                self.gate.top_k = max(
-                    original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio)
-                )
-                super().forward(hidden_states)
-                self.gate.top_k = original_top_k
-            else:
-                # Path for transformers < 5.0
-                if hasattr(self, "gate") and hasattr(self.gate, "top_k"):
-                    top_k_owner = self.gate
-                else:
-                    top_k_owner = self
-                original_top_k = top_k_owner.top_k
-                if hasattr(self, "num_experts"):
-                    top_k_owner.top_k = max(
-                        original_top_k, round(self.num_experts * self._moe_calib_experts_ratio)
-                    )
-                elif hasattr(self, "experts"):
-                    num_experts = (
-                        self.experts.num_experts
-                        if hasattr(self.experts, "num_experts")
-                        else len(self.experts)
-                    )
-                    top_k_owner.top_k = max(
-                        original_top_k,
-                        round(num_experts * self._moe_calib_experts_ratio),
-                    )
-                else:
-                    raise ValueError(f"Could not find num_experts in module {self}")
-                super().forward(hidden_states)
-                top_k_owner.top_k = original_top_k
-            self._count_expert_tokens = False
+            self._count_expert_tokens = True
+            try:
+                if TRANSFORMERS_VERSION_GE_5_0:
+                    assert hasattr(self, "gate") and hasattr(self.gate, "top_k")
+                    original_top_k = self.gate.top_k
+                    self.gate.top_k = max(
+                        original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio)
+                    )
+                    try:
+                        super().forward(hidden_states)
+                    finally:
+                        self.gate.top_k = original_top_k
+                else:
+                    # Path for transformers < 5.0
+                    if hasattr(self, "gate") and hasattr(self.gate, "top_k"):
+                        top_k_owner = self.gate
+                    else:
+                        top_k_owner = self
+                    original_top_k = top_k_owner.top_k
+                    if hasattr(self, "num_experts"):
+                        top_k_owner.top_k = max(
+                            original_top_k, round(self.num_experts * self._moe_calib_experts_ratio)
+                        )
+                    elif hasattr(self, "experts"):
+                        num_experts = (
+                            self.experts.num_experts
+                            if hasattr(self.experts, "num_experts")
+                            else len(self.experts)
+                        )
+                        top_k_owner.top_k = max(
+                            original_top_k,
+                            round(num_experts * self._moe_calib_experts_ratio),
+                        )
+                    else:
+                        raise ValueError(f"Could not find num_experts in module {self}")
+                    try:
+                        super().forward(hidden_states)
+                    finally:
+                        top_k_owner.top_k = original_top_k
+            finally:
+                self._count_expert_tokens = False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 510 - 550,
Wrap the calibration-only mutation of self._count_expert_tokens and any
temporary change to top_k (whether via self.gate.top_k or top_k_owner.top_k) in
a try/finally so that original_top_k is restored and self._count_expert_tokens
is reset even if super().forward(hidden_states) raises; specifically, after
setting self._count_expert_tokens = True and computing original_top_k and
assigning the increased top_k (in both the TRANSFORMERS_VERSION_GE_5_0 branch
and the legacy branch that sets top_k_owner.top_k), call
super().forward(hidden_states) inside a try and restore
top_k_owner.top_k/original_top_k and set self._count_expert_tokens = False in
the finally block (also ensure the initial _init_token_counting call and reading
of _moe_calib_experts_ratio remain unchanged).

🧹 Nitpick comments (1)

tests/unit/torch/quantization/plugins/test_sparse_moe.py (1)

272-276: Set _if_calib on an actual expert module for clearer intent.

Using next(converted.experts.modules()) usually returns the container first. Prefer explicitly selecting a child expert to keep this robust and self-explanatory.

Proposed tweak

-        next(converted.experts.modules())._if_calib = True
+        first_expert = next(iter(converted.experts))
+        first_expert._if_calib = True

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 272 -
276, The line setting _if_calib uses next(converted.experts.modules()) which may
return the container; instead select an actual expert child so intent is clear
and robust—use converted.experts.children() (e.g.,
next(converted.experts.children()) ._if_calib = True) or index into the expert
container if it is a ModuleList (e.g., converted.experts[0]._if_calib = True) to
ensure you set _if_calib on a real expert module.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 510-550: Wrap the calibration-only mutation of
self._count_expert_tokens and any temporary change to top_k (whether via
self.gate.top_k or top_k_owner.top_k) in a try/finally so that original_top_k is
restored and self._count_expert_tokens is reset even if
super().forward(hidden_states) raises; specifically, after setting
self._count_expert_tokens = True and computing original_top_k and assigning the
increased top_k (in both the TRANSFORMERS_VERSION_GE_5_0 branch and the legacy
branch that sets top_k_owner.top_k), call super().forward(hidden_states) inside
a try and restore top_k_owner.top_k/original_top_k and set
self._count_expert_tokens = False in the finally block (also ensure the initial
_init_token_counting call and reading of _moe_calib_experts_ratio remain
unchanged).

---

Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py`:
- Around line 272-276: The line setting _if_calib uses
next(converted.experts.modules()) which may return the container; instead select
an actual expert child so intent is clear and robust—use
converted.experts.children() (e.g., next(converted.experts.children())
._if_calib = True) or index into the expert container if it is a ModuleList
(e.g., converted.experts[0]._if_calib = True) to ensure you set _if_calib on a
real expert module.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8ac10ed3-831b-4160-b73d-e9a0408d0826

📥 Commits

Reviewing files that changed from the base of the PR and between 096bb73 and 13cd1a7.

📒 Files selected for processing (2)

modelopt/torch/quantization/plugins/huggingface.py
tests/unit/torch/quantization/plugins/test_sparse_moe.py

Edwardf0t1

One small thing: _count_expert_tokens is now set to True unconditionally inside the if is_calib block, even when _moe_calib_experts_ratio == 1.0. Previously the code had self._count_expert_tokens = self._moe_calib_experts_ratio < 1.0 to skip counting when all experts are calibrated (since the count is less useful in that case). Is this intentional?

codecov · 2026-03-17T22:12:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.27%. Comparing base (58417e5) to head (a4fd506).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1062      +/-   ##
==========================================
+ Coverage   70.07%   70.27%   +0.20%     
==========================================
  Files         221      227       +6     
  Lines       25499    25854     +355     
==========================================
+ Hits        17869    18170     +301     
- Misses       7630     7684      +54

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/unit/torch/quantization/plugins/test_sparse_moe.py (1)

260-275: Test logic is correct; minor comment clarification.

The test correctly uses ratio 0.5 to enable token counting and properly simulates calibration mode. However, the comment at lines 273-274 is slightly misleading:

# Set _if_calib on an expert sub-module (not set by default since only the MoE
# block was converted, not the full model).
next(converted.experts.modules())._if_calib = True

modules() includes the container itself first, so this actually sets _if_calib on converted.experts directly, not a "sub-module." The test still works because the is_calib check in forward() iterates over all modules including the container.

📝 Optional: More precise comment or explicit target

         # Simulate calibration mode so lazy-init triggers during forward
-        # Set _if_calib on an expert sub-module (not set by default since only the MoE
-        # block was converted, not the full model).
-        next(converted.experts.modules())._if_calib = True
+        # Set _if_calib on the experts container (not set by default since only the MoE
+        # block was converted, not the full model). The is_calib check iterates over all
+        # modules in experts, including the container itself.
+        converted.experts._if_calib = True

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 260 -
275, The comment is misleading: next(converted.experts.modules()) sets _if_calib
on the experts container itself, not a nested sub-module; change the code to set
_if_calib on an actual expert submodule and update the comment accordingly —
replace next(converted.experts.modules())._if_calib = True with
next(converted.experts.children())._if_calib = True (or explicitly index an
expert like converted.experts[0]._if_calib = True) and update the nearby comment
to state you’re setting _if_calib on an expert sub-module.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 447-452: Update the docstring to accurately reflect when token
counting per expert is enabled: the config key _moe_calib_experts_ratio does not
enable per-expert token counting for any value > 0 — counting is only activated
when _moe_calib_experts_ratio < 1.0 (the code path that checks ratio < 1.0
enables counting), so change the sentence to state that token counting is
enabled only for ratios strictly less than 1.0 and clarify that ratio == 1.0
results in a direct pass-through without counting.

---

Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py`:
- Around line 260-275: The comment is misleading:
next(converted.experts.modules()) sets _if_calib on the experts container
itself, not a nested sub-module; change the code to set _if_calib on an actual
expert submodule and update the comment accordingly — replace
next(converted.experts.modules())._if_calib = True with
next(converted.experts.children())._if_calib = True (or explicitly index an
expert like converted.experts[0]._if_calib = True) and update the nearby comment
to state you’re setting _if_calib on an expert sub-module.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ab8ab7af-34ba-4e41-9a5c-1f85fce20206

📥 Commits

Reviewing files that changed from the base of the PR and between 13cd1a7 and a4fd506.

📒 Files selected for processing (2)

modelopt/torch/quantization/plugins/huggingface.py
tests/unit/torch/quantization/plugins/test_sparse_moe.py

coderabbitai · 2026-03-17T23:33:53Z

modelopt/torch/quantization/plugins/huggingface.py

+    Optionally supports config-driven features (disabled by default):
    - ``_moe_calib_experts_ratio``: force-forward tokens to more experts during calibration.
-    - ``_moe_count_expert_calib_tokens``: count tokens routed to each expert during calibration.
+      When set to a value > 0, also enables token counting per expert.

-    When both are disabled, forward is a direct pass-through with zero overhead.
+    When disabled, forward is a direct pass-through with zero overhead.
    """


⚠️ Potential issue | 🟡 Minor

Docstring inaccuracy: token counting is not enabled when ratio == 1.0

Line 449 states "When set to a value > 0, also enables token counting per expert," but line 512 only enables counting when ratio < 1.0. The docstring should clarify this edge case.

📝 Suggested docstring fix

Optionally supports config-driven features (disabled by default): - ``_moe_calib_experts_ratio``: force-forward tokens to more experts during calibration. - When set to a value > 0, also enables token counting per expert. + When set to a value in (0, 1), also enables token counting per expert. + At ratio == 1.0, all experts are calibrated so counting is skipped.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 447 - 452, Update the docstring to accurately reflect when token counting per expert is enabled: the config key _moe_calib_experts_ratio does not enable per-expert token counting for any value > 0 — counting is only activated when _moe_calib_experts_ratio < 1.0 (the code path that checks ratio < 1.0 enables counting), so change the sentence to state that token counting is enabled only for ratios strictly less than 1.0 and clarify that ratio == 1.0 results in a direct pass-through without counting.

Do not forward tokens to all experts by default

0e6f3d1

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv changed the title ~~Do not forward tokens to all experts by default~~ Remove _moe_count_expert_calib_tokens flag; tie token counting to moe_calib_experts_ratio Mar 17, 2026

cjluo-nv marked this pull request as ready for review March 17, 2026 19:55

cjluo-nv requested review from a team as code owners March 17, 2026 19:55

cjluo-nv requested review from Edwardf0t1, meenchen and realAsma March 17, 2026 19:55

Edwardf0t1 reviewed Mar 17, 2026

View reviewed changes

Edwardf0t1 requested changes Mar 17, 2026

View reviewed changes

Fix unittest

d7b6f2a

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

modelopt/torch/quantization/plugins/huggingface.py Show resolved Hide resolved

cjluo-nv requested a review from a team as a code owner March 17, 2026 20:06

cjluo-nv requested a review from kevalmorabia97 March 17, 2026 20:06

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

Fix

096bb73

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Fix

13cd1a7

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Edwardf0t1 reviewed Mar 17, 2026

View reviewed changes

Fix

a4fd506

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Edwardf0t1 approved these changes Mar 18, 2026

View reviewed changes

Conversation

cjluo-nv commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changed files

Test plan

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Review

Use is not None instead of truthiness for _moe_calib_experts_ratio

Restore the range assertion (or add a Pydantic validator)

Minor: token counting init runs outside calibration

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cjluo-nv commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Use `is not None` instead of truthiness for `_moe_calib_experts_ratio`

codecov bot commented Mar 17, 2026 •

edited

Loading