Skip to content

Enable default MoE expert replay for Megatron train/inference parity#701

Draft
FurtherAI wants to merge 274 commits into
mainfrom
austin/train_inf_mismatch
Draft

Enable default MoE expert replay for Megatron train/inference parity#701
FurtherAI wants to merge 274 commits into
mainfrom
austin/train_inf_mismatch

Conversation

@FurtherAI
Copy link
Copy Markdown
Collaborator

@FurtherAI FurtherAI commented May 29, 2026

Summary

Enable vLLM-to-Megatron MoE expert replay by default for ART Megatron backends. The goal is to make the policy scored and updated by Megatron use the same routed experts that vLLM used during rollout generation.

Users normally do not need to configure anything:

backend = art.MegatronBackend()

enable_expert_replay=True is the default. To disable it explicitly:

backend = art.MegatronBackend(enable_expert_replay=False)

Replay only activates for registered MoE model-support handlers. Dense models and non-Megatron backends are unaffected.

How it works

For supported MoE Megatron models, ART now takes this path:

  1. MegatronBackend(enable_expert_replay=True) stores the backend-level replay setting.
  2. Before vLLM startup, ART detects MoE model support and sets:
    • engine_args["enable_return_routed_experts"] = True
    • engine_args["async_scheduling"] = False
  3. vLLM returns prompt and completion routed experts in the OpenAI response payload.
  4. ART attaches that metadata to choices, then normal tokenization/packing aligns routes against the exact packed training tokens.
  5. Packing builds a strict MoE routing replay bundle for the packed Megatron batch.
  6. Megatron startup enables Megatron-Core RouterReplay, and each training job receives moe_routing_replay_path.
  7. Megatron forward uses the replayed expert IDs instead of live router-selected expert IDs.

This is intentionally strict: if replay is enabled for a MoE path and a trajectory is missing aligned route metadata, packing raises instead of silently training without replay.

vLLM prefix cache and async scheduling

Current ART still keeps the vLLM runtime on a pre-#39568 build, so routed-expert capture forces async_scheduling=False for now. vLLM #39568 moves routed-expert transport through ModelRunnerOutput plus scheduler-side route storage, which should support routed-experts capture with async scheduling and local prefix-cache route recovery:

vllm-project/vllm#39568

This restriction should be temporary. ART’s current pinned vLLM runtime is 0.20.2rc1.dev168+gecd0b60aa, which predates #39568. vLLM v0.22.0 includes #39568, so after upgrading ART’s vLLM runtime to v0.22.0 and validating the real ART rollout path with prefix caching, we will be able remove the forced async-scheduling disable.

Correctness

The train/inference mismatch test now exercises the realistic path: vLLM generates rollouts, ART applies its normal tokenization and shared-prefix packing, and Megatron scores the packed tensors.

Current bf16 gates (on logprobs with Megatron as the candidate and vLLM as the target):

  • default dense mean_abs_pct <= 4%
  • Qwen3 MoE mean_abs_pct <= 7%
  • Qwen3.5/3.6 MoE mean_abs_pct <= 5%
  • top20 KL(Megatron || vLLM) <= 0.002

The numbers vary a bit run-to-run. I tried deterministic vLLM sampling but that results in higher mismatch so I dropped it. The test also reports top1/top20 overlap and base/lora deltas for inspection.

Here are some example numbers for Qwen3.5 35B MoE for replay-on vs off:

run mean_abs_pct vs vLLM MAE top1 top20 KL(Megatron || vLLM)
source replay-on artifact 3.923% 0.0216 0.859 0.977 0.00151
replay-off 8.761% 0.0482 0.875 0.940 0.00643

So we can see that routing replay is quite helpful, buying ~1/4 the KL and 1/2 the mean abs pct logprob diff.

Throughput

Megatron training-side replay overhead was checked with isolated_backend_train.py using the fair capture-then-replay benchmark:

  • baseline: 9,720.1 tok/s
  • replay: 9,480.68 tok/s
  • delta: -2.46%

This does not measure the vLLM inference-side cost of disabling async scheduling. Expected ART pipeline-RL impact should usually be small, but in saturated vLLM-only serving, especially short decode/high concurrency, disabling async scheduling can cost more because async scheduling reduces host-side scheduling gaps.

Qwen/Qwen3.5-35B-A3B full workflow passes, including correctness and sensitivity.
FurtherAI added 30 commits May 23, 2026 17:37
…atch

# Conflicts:
#	pyproject.toml
#	src/art/preprocessing/tokenize.py
#	tests/integration/megatron/model_support/test_provider_support.py
#	uv.lock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant