Enable default MoE expert replay for Megatron train/inference parity by FurtherAI · Pull Request #701 · OpenPipe/ART

FurtherAI · 2026-05-29T20:16:18Z

Summary

Enable vLLM-to-Megatron MoE expert replay by default for ART Megatron backends. The goal is to make the policy scored and updated by Megatron use the same routed experts that vLLM used during rollout generation.

Users normally do not need to configure anything:

backend = art.MegatronBackend()

enable_expert_replay=True is the default. To disable it explicitly:

backend = art.MegatronBackend(enable_expert_replay=False)

Replay only activates for registered MoE model-support handlers. Dense models and non-Megatron backends are unaffected.

How it works

For supported MoE Megatron models, ART now takes this path:

MegatronBackend(enable_expert_replay=True) stores the backend-level replay setting.
Before vLLM startup, ART detects MoE model support and sets:
- engine_args["enable_return_routed_experts"] = True
- engine_args["async_scheduling"] = False
vLLM returns prompt and completion routed experts in the OpenAI response payload.
ART attaches that metadata to choices, then normal tokenization/packing aligns routes against the exact packed training tokens.
Packing builds a strict MoE routing replay bundle for the packed Megatron batch.
Megatron startup enables Megatron-Core RouterReplay, and each training job receives moe_routing_replay_path.
Megatron forward uses the replayed expert IDs instead of live router-selected expert IDs.

This is intentionally strict: if replay is enabled for a MoE path and a trajectory is missing aligned route metadata, packing raises instead of silently training without replay.

vLLM prefix cache and async scheduling

Current ART still keeps the vLLM runtime on a pre-#39568 build, so routed-expert capture forces async_scheduling=False for now. vLLM #39568 moves routed-expert transport through ModelRunnerOutput plus scheduler-side route storage, which should support routed-experts capture with async scheduling and local prefix-cache route recovery:

vllm-project/vllm#39568

This restriction should be temporary. ART’s current pinned vLLM runtime is 0.20.2rc1.dev168+gecd0b60aa, which predates #39568. vLLM v0.22.0 includes #39568, so after upgrading ART’s vLLM runtime to v0.22.0 and validating the real ART rollout path with prefix caching, we will be able remove the forced async-scheduling disable.

Correctness

The train/inference mismatch test now exercises the realistic path: vLLM generates rollouts, ART applies its normal tokenization and shared-prefix packing, and Megatron scores the packed tensors.

Current bf16 gates (on logprobs with Megatron as the candidate and vLLM as the target):

default dense mean_abs_pct <= 4%
Qwen3 MoE mean_abs_pct <= 7%
Qwen3.5/3.6 MoE mean_abs_pct <= 5%
top20 KL(Megatron || vLLM) <= 0.002

The numbers vary a bit run-to-run. I tried deterministic vLLM sampling but that results in higher mismatch so I dropped it. The test also reports top1/top20 overlap and base/lora deltas for inspection.

Here are some example numbers for Qwen3.5 35B MoE for replay-on vs off:

run	mean_abs_pct vs vLLM	MAE	top1	top20	KL(Megatron \|\| vLLM)
source replay-on artifact	3.923%	0.0216	0.859	0.977	0.00151
replay-off	8.761%	0.0482	0.875	0.940	0.00643

So we can see that routing replay is quite helpful, buying ~1/4 the KL and 1/2 the mean abs pct logprob diff.

Throughput

Megatron training-side replay overhead was checked with isolated_backend_train.py using the fair capture-then-replay benchmark:

baseline: 9,720.1 tok/s
replay: 9,480.68 tok/s
delta: -2.46%

This does not measure the vLLM inference-side cost of disabling async scheduling. Expected ART pipeline-RL impact should usually be small, but in saturated vLLM-only serving, especially short decode/high concurrency, disabling async scheduling can cost more because async scheduling reduces host-side scheduling gaps.

Qwen/Qwen3.5-35B-A3B full workflow passes, including correctness and sensitivity.

…atch # Conflicts: # pyproject.toml # src/art/preprocessing/tokenize.py # tests/integration/megatron/model_support/test_provider_support.py # uv.lock

FurtherAI added 30 commits April 13, 2026 17:48

Wire HF parity into validation workflow

2727104

Stabilize megatron HF parity runtime

e835237

Drop HF parity delta checks

84d59e0

Wire lora coverage and correctness into workflow

362160a

Wire merged vllm serving into workflow

8e43cdd

Isolate workflow stages in subprocesses

3580730

Add model support trainability workflow stages

95b07e6

Add realistic packed-position validation and runtime cleanup

592d99e

Use real preprocess in packed position validation

0cf988b

Move megatron preprocess patching into model handlers

1db721a

Replace chat template rollout with conformance suite

9b4c2ac

Wait for dedicated vLLM health before serving

d0a3198

Fix Qwen3.5 trainability and packed position handling

8dd17f6

Log correctness runs and narrow DeepEP gating

faeca8a

WIP snapshot current megatron bridge/model support state

5ac1f0c

Split Megatron runtime trainable modes for HF parity

c15075f

Restore Qwen3.5 text-only SP embedding scatter

0f96868

Restore oracle flex attention eager path

aa708cc

Fix Qwen3.5 GDN LoRA TP shard ordering

cad8003

Gate DeepEP to supported runtime dtypes

383f0aa

Revert invalid flex attention compile toggle

1144295

Restore oracle-only DeepEP fp32 override

1cd848e

Generalize LoRA shard manifests and pin block mask compile backend

df39090

Fix sensitivity harness for Qwen3.5 workflow

5a9388f

Qwen/Qwen3.5-35B-A3B full workflow passes, including correctness and sensitivity.

Validate packed position ids with oracle metric

6eb6d91

Add vllm separation integration test harness

c307576

Cut over ART core to external vLLM runtime

cb9fa84

Add vLLM separation integration checks

740c79e

Update lockfile for vLLM separation

c29563f

Fix vLLM separation test package imports

31e430d

FurtherAI added 30 commits May 23, 2026 17:37

Restore train-inf rollout temperature

2d6de24

Use compact non-CP oracle topology matrix

1ce63a7

Add durable model support workflow CLI

98b1cd7

Remove native LoRA exclusion from workflow CLI

850ce28

Add vLLM routed expert prefix sidecar

082d0aa

Fix routed expert prefix cache sidecar dependencies

53cd24c

Tune train-inf mismatch gates

003b433

Relax qwen3 train-inf gates

09937e0

Recognize fused moe lora coverage

54855ec

Enable managed MoE routing replay

0f70173

Release routing replay before job cleanup

fdeb42b

Update Qwen3.5 train-inf invariant gate

456ee60

Support dense real-path train-inf topology

bdd6c0e

Ignore token-only MoE routing metadata

491ef59

Treat null route fields as absent

7822790

Fix dense real-path score matching

7192d07

Add real-path base mismatch diagnostics

6593840

Fix real-path base diagnostic scoring

d7a381c

Freeze base diagnostic Megatron worker

db3cffb

Add real-path base mismatch diagnostic

3084544

Add train-inf forward trace diagnostic

4ab349d

Keep forward trace on default vLLM path

5e940a1

Limit vLLM forward trace tensor dumps

fd3c3d4

Capture Megatron final hidden in trace

f6e07d9

Save Megatron logits in forward trace

87cd3a4

Capture Megatron trace submodules for train-inf diagnostics

19297a9

Trace vLLM projection submodules for diagnostics

0286d1e

Add all-architectures model support workflow

9b4e340

Clean train-inf adapter artifacts on pass

c97dbd8

Merge remote-tracking branch 'origin/main' into austin/train_inf_mism…

76177d6

…atch # Conflicts: # pyproject.toml # src/art/preprocessing/tokenize.py # tests/integration/megatron/model_support/test_provider_support.py # uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable default MoE expert replay for Megatron train/inference parity#701

Enable default MoE expert replay for Megatron train/inference parity#701
FurtherAI wants to merge 274 commits into
mainfrom
austin/train_inf_mismatch

FurtherAI commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FurtherAI commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

vLLM prefix cache and async scheduling

Correctness

Throughput

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FurtherAI commented May 29, 2026 •

edited

Loading