Skip to content

Remove _moe_count_expert_calib_tokens flag; tie token counting to moe_calib_experts_ratio#1062

Open
cjluo-nv wants to merge 5 commits intomainfrom
chenjiel/fix_sparse_moe
Open

Remove _moe_count_expert_calib_tokens flag; tie token counting to moe_calib_experts_ratio#1062
cjluo-nv wants to merge 5 commits intomainfrom
chenjiel/fix_sparse_moe

Conversation

@cjluo-nv
Copy link
Collaborator

@cjluo-nv cjluo-nv commented Mar 17, 2026

Summary

  • Remove moe_count_expert_calib_tokens config field and the _moe_count_expert_calib_tokens internal flag. Token counting is now implicitly enabled when moe_calib_experts_ratio is set, removing a redundant knob.
  • Change --moe_calib_experts_ratio default to None in hf_ptq.py (was 1.0). Previously all experts were force-calibrated by default; now the feature is opt-in and non-MoE models are unaffected without any flag.
  • Disable layer_sync_moe_local_experts_amax when moe_calib_experts_ratio is set, since each expert is calibrated independently with sufficient token coverage in that mode.
  • Simplify _QuantSparseMoe.forward: remove redundant truthy checks on _moe_calib_experts_ratio inside the branch that already assumes it is set.

Changed files

File Change
modelopt/torch/quantization/config.py Remove moe_count_expert_calib_tokens field; update moe_calib_experts_ratio description to document amax sync behavior
modelopt/torch/quantization/mode.py Remove moe_count_expert_calib_tokens propagation in wrapped_calib_func
modelopt/torch/quantization/plugins/huggingface.py Remove _moe_count_expert_calib_tokens from _QuantSparseMoe; simplify forward; skip layer_sync_moe_local_experts_amax when ratio is set
examples/llm_ptq/hf_ptq.py Default --moe_calib_experts_ratio to None; guard validation
tests/unit/.../test_sparse_moe.py Update tests to use _moe_calib_experts_ratio instead of removed flag

Test plan

  • Verify hf_ptq.py works without --moe_calib_experts_ratio (non-MoE model, default None)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Configuration Changes

    • moe_calib_experts_ratio now defaults to None (disabled) instead of 1.0; validation only occurs when a value is provided.
  • Refactor

    • Simplified MoE calibration flow and token-counting behavior; removed a deprecated expert-calibration configuration field.
  • Documentation

    • Changelog and docstrings updated to reflect the new default and calibration behavior.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

MoE calibration control is consolidated around moe_calib_experts_ratio: the boolean moe_count_expert_calib_tokens was removed, the CLI default for the ratio changed from 1.0 to None, validators were added, and runtime token-counting and sync logic now activate only when the ratio is set.

Changes

Cohort / File(s) Summary
Configuration & CLI
modelopt/torch/quantization/config.py, examples/llm_ptq/hf_ptq.py
Removed moe_count_expert_calib_tokens from QuantizeAlgorithmConfig. Added gt=0.0, le=1.0 validators to moe_calib_experts_ratio. CLI default changed from 1.0 to None; argument validation now runs only when the value is not None.
Runtime / Mode
modelopt/torch/quantization/mode.py
Removed propagation of moe_count_expert_calib_tokens into module kwargs and related pop logic.
Plugin (HuggingFace) Implementation
modelopt/torch/quantization/plugins/huggingface.py
Removed initialization/exposure of _moe_count_expert_calib_tokens. Forward/calibration gating now depends on _moe_calib_experts_ratio; token-counting is lazily initialized only when ratio < 1.0 during calibration. Adjusts top_k routing during calibration and skips layer sync when ratio is set. Docstrings updated.
Tests
tests/unit/torch/quantization/plugins/test_sparse_moe.py
Replaced boolean-based calibration flags with ratio-based activation in tests; assertions and setups updated to use moe_calib_experts_ratio values.
Changelog
CHANGELOG.rst
Updated entry noting CLI default for --moe_calib_experts_ratio changed to None (disabled) instead of defaulting to all experts.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/Args
    participant Config as QuantizeConfig
    participant Plugin as HuggingFacePlugin
    participant MoE as MoE Layer
    participant Experts as Expert Modules

    CLI->>Config: set moe_calib_experts_ratio (None or ratio)
    Config->>Plugin: provide config (ratio present or None)
    Plugin->>MoE: forward(input)
    alt ratio is set (calibration)
        Plugin->>Plugin: init token counting if ratio < 1.0
        Plugin->>MoE: adjust routing (top_k scaled by ratio)
        MoE->>Experts: route tokens (wider routing)
        Experts-->>MoE: return activations (collect per-expert stats)
        Plugin->>Plugin: reset token counting flag
        Plugin->>Plugin: skip layer_sync_moe_local_experts_amax
    else ratio is None (normal)
        MoE->>Experts: route tokens (normal routing)
        Plugin->>Plugin: perform layer_sync_moe_local_experts_amax as usual
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: removing the _moe_count_expert_calib_tokens flag and tying token counting to moe_calib_experts_ratio, which is the primary objective across all modified files.
Security Anti-Patterns ✅ Passed Pull request contains only configuration and calibration logic changes with no security anti-patterns detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/fix_sparse_moe
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@cjluo-nv cjluo-nv changed the title Do not forward tokens to all experts by default Remove _moe_count_expert_calib_tokens flag; tie token counting to moe_calib_experts_ratio Mar 17, 2026
@cjluo-nv cjluo-nv marked this pull request as ready for review March 17, 2026 19:55
@cjluo-nv cjluo-nv requested review from a team as code owners March 17, 2026 19:55
Copy link
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Clean simplification — removing moe_count_expert_calib_tokens and tying token counting to moe_calib_experts_ratio is a nice net reduction (+25/-42). A few suggestions:

Use is not None instead of truthiness for _moe_calib_experts_ratio

In huggingface.py, the guard if not self._moe_calib_experts_ratio and the check in layer_sync_moe_local_experts_amax both rely on truthiness to distinguish None from a valid float. This works today but is fragile — e.g. 0.0 is falsy and would be silently treated as None. Suggest explicit checks:

# forward guard
if self._moe_calib_experts_ratio is None:
    return super().forward(hidden_states)

# layer_sync_moe_local_experts_amax
if self._moe_calib_experts_ratio is not None:
    return

Restore the range assertion (or add a Pydantic validator)

The old code had assert 0 < self._moe_calib_experts_ratio <= 1. This was removed, and only the CLI (hf_ptq.py) validates the range now. If someone uses the Python API directly via QuantizeAlgorithmConfig, there's no validation. Consider keeping the assertion or adding a field validator on the config.

Minor: token counting init runs outside calibration

_init_token_counting() is now called whenever _moe_calib_experts_ratio is set, even on non-calibration forward passes. It only runs once (guarded by _token_counting_initialized), but it will allocate the counting tensor even if calibration mode is never entered. Not a bug, just a minor overhead change from the previous behavior.


Otherwise LGTM — the config removal, propagation cleanup, default change to None, and test updates are all well done.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/quantization/config.py (1)

1056-1066: ⚠️ Potential issue | 🟠 Major

Add Pydantic range constraints to moe_calib_experts_ratio.

The field currently accepts any float | None without validation at the config layer, despite other numeric fields in the same file using gt, ge, lt, or le constraints. The CLI example enforces (0.0, 1.0], and downstream code in mode.py asserts the same range, but per the coding guidelines for modelopt/torch/**/config.py, validation should be enforced at the typed-config layer using Pydantic. Programmatic callers of QuantizeAlgorithmConfig can bypass these downstream checks.

🛠️ Suggested change
     moe_calib_experts_ratio: float | None = ModeloptField(
         default=None,
+        gt=0.0,
+        le=1.0,
         title="% of experts to calibrate during forward pass.",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 1056 - 1066, Add Pydantic
range constraints to the moe_calib_experts_ratio field so config-level
validation enforces (0.0, 1.0] instead of allowing any float; specifically,
update the ModeloptField call for moe_calib_experts_ratio to include gt=0.0 and
le=1.0 (while keeping the existing default=None and the float | None type) so
QuantizeAlgorithmConfig (or whatever class contains moe_calib_experts_ratio)
validates values at parse time rather than relying on downstream asserts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 447-450: The code currently disables expert-token counting when
_moe_calib_experts_ratio == 1.0 which prevents expert_token_count from being
populated; change the gating logic so _count_expert_tokens is enabled whenever
_moe_calib_experts_ratio is configured > 0 (i.e., remove the special-case that
turns off counting for 1.0). Locate the assignment/condition that sets
_count_expert_tokens (and any related initialization of expert_token_count) and
make it depend only on _moe_calib_experts_ratio > 0 so counting remains active
for ratio == 1.0.

---

Outside diff comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 1056-1066: Add Pydantic range constraints to the
moe_calib_experts_ratio field so config-level validation enforces (0.0, 1.0]
instead of allowing any float; specifically, update the ModeloptField call for
moe_calib_experts_ratio to include gt=0.0 and le=1.0 (while keeping the existing
default=None and the float | None type) so QuantizeAlgorithmConfig (or whatever
class contains moe_calib_experts_ratio) validates values at parse time rather
than relying on downstream asserts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 338f43ff-e096-4077-8f3e-5154a129c349

📥 Commits

Reviewing files that changed from the base of the PR and between 00fa5bd and 0e6f3d1.

📒 Files selected for processing (5)
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/mode.py
  • modelopt/torch/quantization/plugins/huggingface.py
  • tests/unit/torch/quantization/plugins/test_sparse_moe.py
💤 Files with no reviewable changes (1)
  • modelopt/torch/quantization/mode.py

@cjluo-nv cjluo-nv requested a review from a team as a code owner March 17, 2026 20:06
@cjluo-nv cjluo-nv requested a review from kevalmorabia97 March 17, 2026 20:06
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Line 252: The commented-out pytest addopts line in pyproject.toml (the addopts
entry under [tool.pytest.ini_options]) removes global pytest defaults (coverage,
verbose, instafail, durations, strict-markers) and will change behavior for all
test invocations (including those driven by tox.ini and CI workflows); either
revert the comment to restore the original addopts value, or if this change is
intentional, add a brief rationale in the PR and explicitly pass the required
pytest flags from tox.ini and CI job steps (or add equivalent entries in
pyproject.toml) so unit/GPU/example/partial-unit test jobs retain the expected
options.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d9e524ef-709d-4d1a-838b-426e24ab345a

📥 Commits

Reviewing files that changed from the base of the PR and between 0e6f3d1 and d7b6f2a.

📒 Files selected for processing (3)
  • CHANGELOG.rst
  • modelopt/torch/quantization/plugins/huggingface.py
  • pyproject.toml
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/quantization/plugins/huggingface.py

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)

513-548: ⚠️ Potential issue | 🟠 Major

Make temporary calibration state restoration exception-safe.

If super().forward(hidden_states) fails in this block, top_k and _count_expert_tokens may remain modified, affecting later passes.

Suggested fix
         if is_calib:
-            self._count_expert_tokens = True
-            if TRANSFORMERS_VERSION_GE_5_0:
-                assert hasattr(self, "gate") and hasattr(self.gate, "top_k")
-                original_top_k = self.gate.top_k
-                self.gate.top_k = max(
-                    original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio)
-                )
-                super().forward(hidden_states)
-                self.gate.top_k = original_top_k
-            else:
-                # Path for transformers < 5.0
-                if hasattr(self, "gate") and hasattr(self.gate, "top_k"):
-                    top_k_owner = self.gate
-                else:
-                    top_k_owner = self
-                original_top_k = top_k_owner.top_k
-                if hasattr(self, "num_experts"):
-                    top_k_owner.top_k = max(
-                        original_top_k, round(self.num_experts * self._moe_calib_experts_ratio)
-                    )
-                elif hasattr(self, "experts"):
-                    num_experts = (
-                        self.experts.num_experts
-                        if hasattr(self.experts, "num_experts")
-                        else len(self.experts)
-                    )
-                    top_k_owner.top_k = max(
-                        original_top_k,
-                        round(num_experts * self._moe_calib_experts_ratio),
-                    )
-                else:
-                    raise ValueError(f"Could not find num_experts in module {self}")
-                super().forward(hidden_states)
-                top_k_owner.top_k = original_top_k
-            self._count_expert_tokens = False
+            self._count_expert_tokens = True
+            try:
+                if TRANSFORMERS_VERSION_GE_5_0:
+                    assert hasattr(self, "gate") and hasattr(self.gate, "top_k")
+                    original_top_k = self.gate.top_k
+                    self.gate.top_k = max(
+                        original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio)
+                    )
+                    super().forward(hidden_states)
+                    self.gate.top_k = original_top_k
+                else:
+                    # Path for transformers < 5.0
+                    if hasattr(self, "gate") and hasattr(self.gate, "top_k"):
+                        top_k_owner = self.gate
+                    else:
+                        top_k_owner = self
+                    original_top_k = top_k_owner.top_k
+                    if hasattr(self, "num_experts"):
+                        top_k_owner.top_k = max(
+                            original_top_k, round(self.num_experts * self._moe_calib_experts_ratio)
+                        )
+                    elif hasattr(self, "experts"):
+                        num_experts = (
+                            self.experts.num_experts
+                            if hasattr(self.experts, "num_experts")
+                            else len(self.experts)
+                        )
+                        top_k_owner.top_k = max(
+                            original_top_k,
+                            round(num_experts * self._moe_calib_experts_ratio),
+                        )
+                    else:
+                        raise ValueError(f"Could not find num_experts in module {self}")
+                    super().forward(hidden_states)
+                    top_k_owner.top_k = original_top_k
+            finally:
+                self._count_expert_tokens = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 513 - 548,
The calibration path mutates self._count_expert_tokens and top_k (either
self.gate.top_k or top_k_owner.top_k) before calling
super().forward(hidden_states) and doesn't restore them if an exception is
raised; make this exception-safe by wrapping the call to
super().forward(hidden_states) in a try/finally block that always restores top_k
to original_top_k and sets self._count_expert_tokens back to False, applying the
same try/finally pattern in both the TRANSFORMERS_VERSION_GE_5_0 branch (use
original_top_k = self.gate.top_k) and the else branch (use original_top_k =
top_k_owner.top_k and restore top_k_owner.top_k), keeping the logic that
computes the temporary top_k based on self._moe_calib_experts_ratio and
num_experts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 513-548: The calibration path mutates self._count_expert_tokens
and top_k (either self.gate.top_k or top_k_owner.top_k) before calling
super().forward(hidden_states) and doesn't restore them if an exception is
raised; make this exception-safe by wrapping the call to
super().forward(hidden_states) in a try/finally block that always restores top_k
to original_top_k and sets self._count_expert_tokens back to False, applying the
same try/finally pattern in both the TRANSFORMERS_VERSION_GE_5_0 branch (use
original_top_k = self.gate.top_k) and the else branch (use original_top_k =
top_k_owner.top_k and restore top_k_owner.top_k), keeping the logic that
computes the temporary top_k based on self._moe_calib_experts_ratio and
num_experts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 59fdda96-ec93-4f7a-a987-ee8ff7e91498

📥 Commits

Reviewing files that changed from the base of the PR and between d7b6f2a and 096bb73.

📒 Files selected for processing (2)
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/plugins/huggingface.py

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)

510-550: ⚠️ Potential issue | 🟠 Major

Make calibration state restoration exception-safe.

If the calibration super().forward(hidden_states) raises, top_k and _count_expert_tokens may remain modified and affect later forwards. Wrap the temporary mutation scope in try/finally.

Proposed fix
         if is_calib:
             if not self._token_counting_initialized:
                 self._init_token_counting()
-            self._count_expert_tokens = True
-            if TRANSFORMERS_VERSION_GE_5_0:
-                assert hasattr(self, "gate") and hasattr(self.gate, "top_k")
-                original_top_k = self.gate.top_k
-                self.gate.top_k = max(
-                    original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio)
-                )
-                super().forward(hidden_states)
-                self.gate.top_k = original_top_k
-            else:
-                # Path for transformers < 5.0
-                if hasattr(self, "gate") and hasattr(self.gate, "top_k"):
-                    top_k_owner = self.gate
-                else:
-                    top_k_owner = self
-                original_top_k = top_k_owner.top_k
-                if hasattr(self, "num_experts"):
-                    top_k_owner.top_k = max(
-                        original_top_k, round(self.num_experts * self._moe_calib_experts_ratio)
-                    )
-                elif hasattr(self, "experts"):
-                    num_experts = (
-                        self.experts.num_experts
-                        if hasattr(self.experts, "num_experts")
-                        else len(self.experts)
-                    )
-                    top_k_owner.top_k = max(
-                        original_top_k,
-                        round(num_experts * self._moe_calib_experts_ratio),
-                    )
-                else:
-                    raise ValueError(f"Could not find num_experts in module {self}")
-                super().forward(hidden_states)
-                top_k_owner.top_k = original_top_k
-            self._count_expert_tokens = False
+            self._count_expert_tokens = True
+            try:
+                if TRANSFORMERS_VERSION_GE_5_0:
+                    assert hasattr(self, "gate") and hasattr(self.gate, "top_k")
+                    original_top_k = self.gate.top_k
+                    self.gate.top_k = max(
+                        original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio)
+                    )
+                    try:
+                        super().forward(hidden_states)
+                    finally:
+                        self.gate.top_k = original_top_k
+                else:
+                    # Path for transformers < 5.0
+                    if hasattr(self, "gate") and hasattr(self.gate, "top_k"):
+                        top_k_owner = self.gate
+                    else:
+                        top_k_owner = self
+                    original_top_k = top_k_owner.top_k
+                    if hasattr(self, "num_experts"):
+                        top_k_owner.top_k = max(
+                            original_top_k, round(self.num_experts * self._moe_calib_experts_ratio)
+                        )
+                    elif hasattr(self, "experts"):
+                        num_experts = (
+                            self.experts.num_experts
+                            if hasattr(self.experts, "num_experts")
+                            else len(self.experts)
+                        )
+                        top_k_owner.top_k = max(
+                            original_top_k,
+                            round(num_experts * self._moe_calib_experts_ratio),
+                        )
+                    else:
+                        raise ValueError(f"Could not find num_experts in module {self}")
+                    try:
+                        super().forward(hidden_states)
+                    finally:
+                        top_k_owner.top_k = original_top_k
+            finally:
+                self._count_expert_tokens = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 510 - 550,
Wrap the calibration-only mutation of self._count_expert_tokens and any
temporary change to top_k (whether via self.gate.top_k or top_k_owner.top_k) in
a try/finally so that original_top_k is restored and self._count_expert_tokens
is reset even if super().forward(hidden_states) raises; specifically, after
setting self._count_expert_tokens = True and computing original_top_k and
assigning the increased top_k (in both the TRANSFORMERS_VERSION_GE_5_0 branch
and the legacy branch that sets top_k_owner.top_k), call
super().forward(hidden_states) inside a try and restore
top_k_owner.top_k/original_top_k and set self._count_expert_tokens = False in
the finally block (also ensure the initial _init_token_counting call and reading
of _moe_calib_experts_ratio remain unchanged).
🧹 Nitpick comments (1)
tests/unit/torch/quantization/plugins/test_sparse_moe.py (1)

272-276: Set _if_calib on an actual expert module for clearer intent.

Using next(converted.experts.modules()) usually returns the container first. Prefer explicitly selecting a child expert to keep this robust and self-explanatory.

Proposed tweak
-        next(converted.experts.modules())._if_calib = True
+        first_expert = next(iter(converted.experts))
+        first_expert._if_calib = True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 272 -
276, The line setting _if_calib uses next(converted.experts.modules()) which may
return the container; instead select an actual expert child so intent is clear
and robust—use converted.experts.children() (e.g.,
next(converted.experts.children()) ._if_calib = True) or index into the expert
container if it is a ModuleList (e.g., converted.experts[0]._if_calib = True) to
ensure you set _if_calib on a real expert module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 510-550: Wrap the calibration-only mutation of
self._count_expert_tokens and any temporary change to top_k (whether via
self.gate.top_k or top_k_owner.top_k) in a try/finally so that original_top_k is
restored and self._count_expert_tokens is reset even if
super().forward(hidden_states) raises; specifically, after setting
self._count_expert_tokens = True and computing original_top_k and assigning the
increased top_k (in both the TRANSFORMERS_VERSION_GE_5_0 branch and the legacy
branch that sets top_k_owner.top_k), call super().forward(hidden_states) inside
a try and restore top_k_owner.top_k/original_top_k and set
self._count_expert_tokens = False in the finally block (also ensure the initial
_init_token_counting call and reading of _moe_calib_experts_ratio remain
unchanged).

---

Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py`:
- Around line 272-276: The line setting _if_calib uses
next(converted.experts.modules()) which may return the container; instead select
an actual expert child so intent is clear and robust—use
converted.experts.children() (e.g., next(converted.experts.children())
._if_calib = True) or index into the expert container if it is a ModuleList
(e.g., converted.experts[0]._if_calib = True) to ensure you set _if_calib on a
real expert module.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8ac10ed3-831b-4160-b73d-e9a0408d0826

📥 Commits

Reviewing files that changed from the base of the PR and between 096bb73 and 13cd1a7.

📒 Files selected for processing (2)
  • modelopt/torch/quantization/plugins/huggingface.py
  • tests/unit/torch/quantization/plugins/test_sparse_moe.py

Copy link
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small thing: _count_expert_tokens is now set to True unconditionally inside the if is_calib block, even when _moe_calib_experts_ratio == 1.0. Previously the code had self._count_expert_tokens = self._moe_calib_experts_ratio < 1.0 to skip counting when all experts are calibrated (since the count is less useful in that case). Is this intentional?

@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.27%. Comparing base (58417e5) to head (a4fd506).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1062      +/-   ##
==========================================
+ Coverage   70.07%   70.27%   +0.20%     
==========================================
  Files         221      227       +6     
  Lines       25499    25854     +355     
==========================================
+ Hits        17869    18170     +301     
- Misses       7630     7684      +54     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/unit/torch/quantization/plugins/test_sparse_moe.py (1)

260-275: Test logic is correct; minor comment clarification.

The test correctly uses ratio 0.5 to enable token counting and properly simulates calibration mode. However, the comment at lines 273-274 is slightly misleading:

# Set _if_calib on an expert sub-module (not set by default since only the MoE
# block was converted, not the full model).
next(converted.experts.modules())._if_calib = True

modules() includes the container itself first, so this actually sets _if_calib on converted.experts directly, not a "sub-module." The test still works because the is_calib check in forward() iterates over all modules including the container.

📝 Optional: More precise comment or explicit target
         # Simulate calibration mode so lazy-init triggers during forward
-        # Set _if_calib on an expert sub-module (not set by default since only the MoE
-        # block was converted, not the full model).
-        next(converted.experts.modules())._if_calib = True
+        # Set _if_calib on the experts container (not set by default since only the MoE
+        # block was converted, not the full model). The is_calib check iterates over all
+        # modules in experts, including the container itself.
+        converted.experts._if_calib = True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 260 -
275, The comment is misleading: next(converted.experts.modules()) sets _if_calib
on the experts container itself, not a nested sub-module; change the code to set
_if_calib on an actual expert submodule and update the comment accordingly —
replace next(converted.experts.modules())._if_calib = True with
next(converted.experts.children())._if_calib = True (or explicitly index an
expert like converted.experts[0]._if_calib = True) and update the nearby comment
to state you’re setting _if_calib on an expert sub-module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 447-452: Update the docstring to accurately reflect when token
counting per expert is enabled: the config key _moe_calib_experts_ratio does not
enable per-expert token counting for any value > 0 — counting is only activated
when _moe_calib_experts_ratio < 1.0 (the code path that checks ratio < 1.0
enables counting), so change the sentence to state that token counting is
enabled only for ratios strictly less than 1.0 and clarify that ratio == 1.0
results in a direct pass-through without counting.

---

Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py`:
- Around line 260-275: The comment is misleading:
next(converted.experts.modules()) sets _if_calib on the experts container
itself, not a nested sub-module; change the code to set _if_calib on an actual
expert submodule and update the comment accordingly — replace
next(converted.experts.modules())._if_calib = True with
next(converted.experts.children())._if_calib = True (or explicitly index an
expert like converted.experts[0]._if_calib = True) and update the nearby comment
to state you’re setting _if_calib on an expert sub-module.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ab8ab7af-34ba-4e41-9a5c-1f85fce20206

📥 Commits

Reviewing files that changed from the base of the PR and between 13cd1a7 and a4fd506.

📒 Files selected for processing (2)
  • modelopt/torch/quantization/plugins/huggingface.py
  • tests/unit/torch/quantization/plugins/test_sparse_moe.py

Comment on lines +447 to 452
Optionally supports config-driven features (disabled by default):
- ``_moe_calib_experts_ratio``: force-forward tokens to more experts during calibration.
- ``_moe_count_expert_calib_tokens``: count tokens routed to each expert during calibration.
When set to a value > 0, also enables token counting per expert.

When both are disabled, forward is a direct pass-through with zero overhead.
When disabled, forward is a direct pass-through with zero overhead.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring inaccuracy: token counting is not enabled when ratio == 1.0

Line 449 states "When set to a value > 0, also enables token counting per expert," but line 512 only enables counting when ratio < 1.0. The docstring should clarify this edge case.

📝 Suggested docstring fix
     Optionally supports config-driven features (disabled by default):
     - ``_moe_calib_experts_ratio``: force-forward tokens to more experts during calibration.
-      When set to a value > 0, also enables token counting per expert.
+      When set to a value in (0, 1), also enables token counting per expert.
+      At ratio == 1.0, all experts are calibrated so counting is skipped.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 447 - 452,
Update the docstring to accurately reflect when token counting per expert is
enabled: the config key _moe_calib_experts_ratio does not enable per-expert
token counting for any value > 0 — counting is only activated when
_moe_calib_experts_ratio < 1.0 (the code path that checks ratio < 1.0 enables
counting), so change the sentence to state that token counting is enabled only
for ratios strictly less than 1.0 and clarify that ratio == 1.0 results in a
direct pass-through without counting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants