Enable default MoE expert replay for Megatron train/inference parity#701
Draft
FurtherAI wants to merge 274 commits into
Draft
Enable default MoE expert replay for Megatron train/inference parity#701FurtherAI wants to merge 274 commits into
FurtherAI wants to merge 274 commits into
Conversation
Qwen/Qwen3.5-35B-A3B full workflow passes, including correctness and sensitivity.
…atch # Conflicts: # pyproject.toml # src/art/preprocessing/tokenize.py # tests/integration/megatron/model_support/test_provider_support.py # uv.lock
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enable vLLM-to-Megatron MoE expert replay by default for ART Megatron backends. The goal is to make the policy scored and updated by Megatron use the same routed experts that vLLM used during rollout generation.
Users normally do not need to configure anything:
enable_expert_replay=Trueis the default. To disable it explicitly:Replay only activates for registered MoE model-support handlers. Dense models and non-Megatron backends are unaffected.
How it works
For supported MoE Megatron models, ART now takes this path:
MegatronBackend(enable_expert_replay=True)stores the backend-level replay setting.engine_args["enable_return_routed_experts"] = Trueengine_args["async_scheduling"] = FalseRouterReplay, and each training job receivesmoe_routing_replay_path.This is intentionally strict: if replay is enabled for a MoE path and a trajectory is missing aligned route metadata, packing raises instead of silently training without replay.
vLLM prefix cache and async scheduling
Current ART still keeps the vLLM runtime on a pre-#39568 build, so routed-expert capture forces
async_scheduling=Falsefor now. vLLM #39568 moves routed-expert transport throughModelRunnerOutputplus scheduler-side route storage, which should support routed-experts capture with async scheduling and local prefix-cache route recovery:vllm-project/vllm#39568
This restriction should be temporary. ART’s current pinned vLLM runtime is
0.20.2rc1.dev168+gecd0b60aa, which predates #39568. vLLMv0.22.0includes #39568, so after upgrading ART’s vLLM runtime tov0.22.0and validating the real ART rollout path with prefix caching, we will be able remove the forced async-scheduling disable.Correctness
The train/inference mismatch test now exercises the realistic path: vLLM generates rollouts, ART applies its normal tokenization and shared-prefix packing, and Megatron scores the packed tensors.
Current bf16 gates (on logprobs with Megatron as the candidate and vLLM as the target):
The numbers vary a bit run-to-run. I tried deterministic vLLM sampling but that results in higher mismatch so I dropped it. The test also reports top1/top20 overlap and base/lora deltas for inspection.
Here are some example numbers for Qwen3.5 35B MoE for replay-on vs off:
So we can see that routing replay is quite helpful, buying ~1/4 the KL and 1/2 the mean abs pct logprob diff.
Throughput
Megatron training-side replay overhead was checked with
isolated_backend_train.pyusing the fair capture-then-replay benchmark:This does not measure the vLLM inference-side cost of disabling async scheduling. Expected ART pipeline-RL impact should usually be small, but in saturated vLLM-only serving, especially short decode/high concurrency, disabling async scheduling can cost more because async scheduling reduces host-side scheduling gaps.