-
Notifications
You must be signed in to change notification settings - Fork 391
feat: data plane transfer queue integration #2439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ZhiyuLi-Nvidia
wants to merge
159
commits into
main
Choose a base branch
from
zhiyul/data_plane_plan
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
159 commits
Select commit
Hold shift + click to select a range
0e2d986
plan
ZhiyuLi-Nvidia 03fd24e
plan: align Stage 4 with rl-arena/verl 1-hop pattern
ZhiyuLi-Nvidia 7aedea8
feat(data-plane): TransferQueue integration for GRPO with driver-side…
ZhiyuLi-Nvidia 9c17127
refactor(data-plane): extract driver-side balanced packing into presh…
ZhiyuLi-Nvidia f1a995b
feat(data-plane): AsyncTrajectoryCollector writes rollouts to TQ when…
ZhiyuLi-Nvidia ec7df8f
feat(data-plane): wire async-on-TQ end-to-end with driver-side balanc…
ZhiyuLi-Nvidia 49db9bb
fix(data-plane): preserve sample order and FLOPs semantics on @dp_dis…
ZhiyuLi-Nvidia 130f713
feat(data-plane): grpo_sync routes logprob/ref-logprob through @dp_di…
ZhiyuLi-Nvidia 5e26441
refactor(data-plane): replace @dp_dispatch with TQPolicy subclass; ad…
ZhiyuLi-Nvidia bd714c8
fix(data-plane): VLM extras, async fan-out, cleanup-on-failure
ZhiyuLi-Nvidia f2a8ba3
docs(data-plane): add API lifecycle doc with verl comparison
ZhiyuLi-Nvidia 680e5dd
feat(data-plane): sync 1-hop trajectory collector + per-sample key li…
ZhiyuLi-Nvidia 8b297f8
refactor(data-plane): extract make_actor_runtime_env, fix N² list copy
ZhiyuLi-Nvidia 941b54d
feat(data-plane): jagged tensors on TQ wire + naming/factory cleanup
ZhiyuLi-Nvidia 975bd05
refactor(data-plane): KVBatchMeta.subset/slice/concat methods
ZhiyuLi-Nvidia 186e792
Mooncake cpu backend
ZhiyuLi-Nvidia fabb9a0
Readability Refactor
ZhiyuLi-Nvidia eb643c0
wip test mooncake
ZhiyuLi-Nvidia b32ffa3
refactor(data-plane): drop dead set_wire_format/_PACK_JAGGED + adapte…
ZhiyuLi-Nvidia 68454ff
refactor(ray.sub): drop NETWORK_INIT_CMDS — MC_TCP_BIND_ADDRESS suffices
ZhiyuLi-Nvidia 2486160
docs(data-plane): consolidate README; drop stale plan/verl refs
ZhiyuLi-Nvidia c2c53d6
feat(data-plane): non-tensor object support on TQ wire
ZhiyuLi-Nvidia 2ba6ef2
feat(grpo-sync): equivalency fixes + content via TQ object column
ZhiyuLi-Nvidia 72835c6
style: fix ruff lint errors and apply ruff format
ZhiyuLi-Nvidia 74ddeba
style: apply pre-commit auto-fixes (ruff)
ZhiyuLi-Nvidia 3b9b827
chore(pyrefly): whitelist all new data_plane files + fix type errors
ZhiyuLi-Nvidia 9dbca27
remove unnecessary script
ZhiyuLi-Nvidia 3533d53
feat(data-plane): decompose message_log at wire boundary
ZhiyuLi-Nvidia 4a8096c
refactor(data-plane): rename DataPlaneClient.get_meta → claim_meta
ZhiyuLi-Nvidia 87414f8
docs(data-plane): tighten DataPlaneClient boundary docstring
ZhiyuLi-Nvidia 9ca449f
fix(data-plane): treat DataPlaneConfig.enabled as required field
ZhiyuLi-Nvidia a123ffe
docs(data-plane): make build_data_plane_client docstring backend-agno…
ZhiyuLi-Nvidia 92c8244
refactor(data-plane): promote codec imports to module top-level
ZhiyuLi-Nvidia efdd82c
refactor(data-plane): rename driver_io → column_io
ZhiyuLi-Nvidia 0d92835
refactor(data-plane): validate dp_world at TQPolicy config time
ZhiyuLi-Nvidia 861294f
refactor(data-plane): centralize packing-meta keys in schema.py
ZhiyuLi-Nvidia cb36ef6
refactor(data-plane): drop redundant dp_world assert in shard_meta_fo…
ZhiyuLi-Nvidia dbbfd19
refactor(data-plane): move DP_SEED_FIELDS to schema.py as DP_TRAIN_FI…
ZhiyuLi-Nvidia a646aeb
fix(data-plane): reject empty meta in shard_meta_for_dp
ZhiyuLi-Nvidia 81cbcd7
refactor(data-plane): print_event → log_event via stdlib logging
ZhiyuLi-Nvidia 808f165
style(data-plane): match repo logger naming convention
ZhiyuLi-Nvidia 0a330d3
refactor(data-plane): convert DataPlaneStats to @dataclass
ZhiyuLi-Nvidia 028dff8
refactor(data-plane): type DataPlaneEvent as TypedDict
ZhiyuLi-Nvidia b1de50f
refactor(data-plane): drop placeholder 0s from _run; make sizes kw-only
ZhiyuLi-Nvidia fd47991
fix(data-plane): route check_consumption_status through _run
ZhiyuLi-Nvidia fc7d6e5
fix(data-plane): route close() through _run
ZhiyuLi-Nvidia 4d94024
perf(data-plane): single sync in to_nested_by_length
ZhiyuLi-Nvidia 0f69257
docs(data-plane): convert codec.py docstrings to Google style
ZhiyuLi-Nvidia e72682e
refactor(data-plane): centralize Layout type alias in schema.py
ZhiyuLi-Nvidia 5e3f2d3
fix(data-plane): validate pad_to_multiple >= 1 in materialize
ZhiyuLi-Nvidia 3c6f7ca
fix(data-plane): fail fast on empty local IP at Mooncake bootstrap
ZhiyuLi-Nvidia d86d84b
fix(data-plane): surface chmod failure when mooncake_master is not exec
ZhiyuLi-Nvidia 13ae181
refactor(data-plane): scope mooncake_cpu 1D workaround to TQDataPlane…
ZhiyuLi-Nvidia c39c580
docs(data-plane): clarify TQ module vs client access convention
ZhiyuLi-Nvidia 3b1d196
docs(data-plane): note trust boundary at pack_object_array pickle site
ZhiyuLi-Nvidia 4fda233
refactor(data-plane): drop codec pickle, use TQ-native NonTensorStack
ZhiyuLi-Nvidia 5e36cd2
refactor(data-plane): drop dead object-array codec helpers
ZhiyuLi-Nvidia a11a8cd
refactor(data-plane): centralize _meta_idx sentinel in schema.py
ZhiyuLi-Nvidia 5669bbb
docs(data-plane): convert interfaces.py docstrings to Google style
ZhiyuLi-Nvidia f602e71
refactor(data-plane): align schema constant names with their values
ZhiyuLi-Nvidia d2d6e98
docs(data-plane): tighten preshard.py docstring to Google style
ZhiyuLi-Nvidia cd022c0
docs(data-plane): convert column_io.py docstrings to Google style
ZhiyuLi-Nvidia a7b14a8
docs(data-plane): convert factory.py docstring to Google style
ZhiyuLi-Nvidia a28e116
docs(data-plane): add Args/Returns blocks to observability.py docstrings
ZhiyuLi-Nvidia 941c084
docs(data-plane): tighten transfer_queue.py docstrings, add Args/Retu…
ZhiyuLi-Nvidia ac171dc
docs(data-plane): add Args/Returns to worker_mixin.py docstrings
ZhiyuLi-Nvidia 4573641
docs(data-plane): add Args/Returns blocks to tq_policy.py docstrings
ZhiyuLi-Nvidia f3ac950
docs(data-plane): convert sync_rollout_actor.py docstrings to Google …
ZhiyuLi-Nvidia c69cbd0
docs(data-plane): add Args/Returns to grpo_sync.py dynamic-sampling h…
ZhiyuLi-Nvidia 18ec172
refactor(data-plane): drop _to_wire's redundant promote_1d kwarg
ZhiyuLi-Nvidia d0e8fdb
fix(data-plane): survive TQ simple-backend NonTensorData wire-strip
ZhiyuLi-Nvidia 9cd0c3a
build(data-plane): pin mooncake-transfer-engine-cuda13 wheel for cu13…
ZhiyuLi-Nvidia 800f89b
chore: ruff auto-fix and ruff-format pass
ZhiyuLi-Nvidia 1d7d0ee
chore(pyrefly): rename driver_io → column_io in whitelist
ZhiyuLi-Nvidia 2a6285c
chore(pyrefly): silence 5 latent type errors with targeted ignore com…
ZhiyuLi-Nvidia e06076e
chore(pyrefly): whitelist nemo_rl/data_plane/schema.py
ZhiyuLi-Nvidia 81a734f
fix(data-plane): preserve object-column identity through TQ wire
ZhiyuLi-Nvidia e6033a9
fix(data-plane): gate TQ write-back on TP×CP×PP leader to avoid dupli…
ZhiyuLi-Nvidia 68110b9
chore: ruff auto-fix and D205 docstring fixes
ZhiyuLi-Nvidia 06175ca
refactor(data-plane): drop async-grpo TQ scaffolding from sync PR
ZhiyuLi-Nvidia a592a0d
refactor(data-plane): consolidate producer codec, caller mints keys
ZhiyuLi-Nvidia b6227f1
test(data-plane): align codec tests with current contract
ZhiyuLi-Nvidia 06fa8a3
refactor(grpo_sync): drop dead batch_cache; make TQPolicy attrs public
ZhiyuLi-Nvidia 4d41c24
refactor(data-plane): extract calibration field filter into named sch…
ZhiyuLi-Nvidia 2447264
refactor(data-plane): make kv_batch_get(select_fields) required
ZhiyuLi-Nvidia b5e4561
refactor(sync-rollout-actor): remove unused wrappers; document full l…
ZhiyuLi-Nvidia 44e28d5
test(data-plane): move data_plane unit tests under tests/unit/ for CI…
ZhiyuLi-Nvidia d283cbb
test(data-plane): apply ruff --fix and import-sort to data_plane unit…
ZhiyuLi-Nvidia 54b24b4
docs: fix broken nemo-gym Core Components link
ZhiyuLi-Nvidia 8818e91
chore(grpo): drop stale mypy comments; rename TQPolicy ctor->actor
ZhiyuLi-Nvidia f6aaecf
fix(data-plane): reject loopback IP; resolve TQ runtime_env pin from …
ZhiyuLi-Nvidia efc0e27
docs(data-plane): rewrite README around sync flow + async proposal
ZhiyuLi-Nvidia 8b535af
docs(data-plane): clarify partition scope and TQ mental model
ZhiyuLi-Nvidia f8b310d
refactor(data-plane): per-row tags on KVBatchMeta; rename slice → dri…
ZhiyuLi-Nvidia 46b14a7
perf(sync-rollout-actor): subset driver_carry via carry_keys
ZhiyuLi-Nvidia 1724362
refactor(grpo-sync): apply overlong filter post-dynamic-sampling
ZhiyuLi-Nvidia 4b51983
refactor(grpo-sync): isolate TQ ops behind TQPolicy/KVBatchMeta façades
ZhiyuLi-Nvidia 228f066
refactor(data-plane): YAML-only defaults for TQ config (terryk §9)
ZhiyuLi-Nvidia 20f290e
docs(data-plane): refresh README around encapsulated TQ path
ZhiyuLi-Nvidia 7f9f6ac
chore: ruff format + pyrefly ignore + underscore-md rename
ZhiyuLi-Nvidia 52495d8
docs(data-plane): drop api-lifecycle doc; realistic concrete examples
ZhiyuLi-Nvidia 9f88424
docs: align nemo-gym Core Components link with main
ZhiyuLi-Nvidia 6c94851
fix(data-plane): close grad_norm collapse + NCCL desync in DP fsdp2 path
ZhiyuLi-Nvidia 8100471
refactor(data-plane): drop _tq() lazy wrapper; fail-fast in check_con…
ZhiyuLi-Nvidia b51a4e4
refactor(grpo-sync): mint uids in rollout actor (verl-style per-promp…
ZhiyuLi-Nvidia c8ca43e
refactor(data-plane): rename KVBatchMeta.keys -> sample_ids (Phase A)
ZhiyuLi-Nvidia 0f45f07
refactor(data-plane): rename DataPlaneClient kwarg keys -> sample_ids…
ZhiyuLi-Nvidia d68ad02
test(data-plane): update KVBatchMeta schema-pin to sample_ids
ZhiyuLi-Nvidia 1ca91e8
refactor(data-plane): rename DataPlaneClient verbs kv_batch_* -> {put…
ZhiyuLi-Nvidia f047682
refactor(data-plane): tighten clear_samples(None) contract; warn on s…
ZhiyuLi-Nvidia aec314d
chore(data-plane): apply ruff format
ZhiyuLi-Nvidia 65f8008
feat(data-plane): align seq-dim across DP ranks via meta-stamped glob…
ZhiyuLi-Nvidia ba3f2f8
test(data-plane): add missing DataPlaneConfig keys to test_seqpack_eq…
ZhiyuLi-Nvidia ac607de
refactor(data-plane): remove _PartitionRecord from TQ adapter
ZhiyuLi-Nvidia 9b75e97
test(data-plane): remove empty tests/unit/data_plane/conftest.py
ZhiyuLi-Nvidia 60a2872
revert(test): restore NUM_MINUTES=150 in prorlv2 recipe sh
ZhiyuLi-Nvidia 8ca9e7a
test(data-plane): drop test_tq_multinode.py
ZhiyuLi-Nvidia 55acd37
docs(data-plane): document DP-aligned forward pad seqlen in README
ZhiyuLi-Nvidia 8289b5a
test(data-plane): drop stale import-isolation tests; merge codec_obje…
ZhiyuLi-Nvidia 1ee5d2d
refactor(data-plane): drop drive-by edits from PR scope
ZhiyuLi-Nvidia 2dd6ec0
test(data-plane): accept attribute-style data_plane access in invariant
ZhiyuLi-Nvidia bfab58a
refactor(data-plane): use attribute-style access on MasterConfig
ZhiyuLi-Nvidia ee4ce24
refactor(data-plane): replace run_grpo dispatch grep with behavioral …
ZhiyuLi-Nvidia bc46bf8
fix(data-plane): use attribute access for loss_fn KL penalty assert
ZhiyuLi-Nvidia f598ae6
fix(data-plane): pre-register fields to dodge TQ controller race
ZhiyuLi-Nvidia 6225fbe
fix(configs): set truncated_importance_sampling_type=tis on recipes t…
ZhiyuLi-Nvidia c89bc43
refactor(data-plane): close four cross-boundary leaks
ZhiyuLi-Nvidia fd8b23a
chore(data-plane): apply ruff format to discard_samples
ZhiyuLi-Nvidia 5c306ed
build: regenerate uv.lock (cu13 mooncake wheel needs requires-python …
ZhiyuLi-Nvidia ec13926
build: regenerate uv.lock against current HEAD
ZhiyuLi-Nvidia 00ec2f5
test(data-plane): consolidate suite under tests/unit/data_plane
ZhiyuLi-Nvidia f61fec5
fix(data-plane): shrink mooncake_cpu segment defaults to fit CI runners
ZhiyuLi-Nvidia 8528b7c
test(data-plane): update _apply_dynamic_sampling tests for policy= param
ZhiyuLi-Nvidia e0eb6cd
fix(data-plane): apply pad_to_seqlen to ALL 2D+ tensors in materialize
ZhiyuLi-Nvidia 977a931
test(data-plane): add missing DataPlaneConfig keys to _TQ_CFG in chao…
ZhiyuLi-Nvidia 511bd5b
test(data-plane): remove storage-actor-kill chaos test
ZhiyuLi-Nvidia 8504cc8
fix(data-plane): exclude MESSAGE_LOG_BULK_FIELDS from FP8 calib request
ZhiyuLi-Nvidia 6495ce8
test(data-plane): pin MESSAGE_LOG_BULK_FIELDS in DP_CALIB_EXCLUDED_FI…
ZhiyuLi-Nvidia a8d3816
test(data-plane): add missing DataPlaneConfig keys to tq_lifecycle fi…
ZhiyuLi-Nvidia b8b42d7
feat(data-plane): route FP8 KV scales through TQ (sync first cut)
ZhiyuLi-Nvidia 7740ba0
Revert "feat(data-plane): route FP8 KV scales through TQ (sync first …
ZhiyuLi-Nvidia 0f82cbf
refactor(data-plane): flip calib filter to positive include-list
ZhiyuLi-Nvidia b717610
test(data-plane): add realistic-shape rollout fixtures + cross-file d…
ZhiyuLi-Nvidia 4b8c9fe
build: refresh uv.lock
ZhiyuLi-Nvidia 7f5cfa9
chore(test): apply ruff isort + blank-line fixes
ZhiyuLi-Nvidia cacb612
fix(data-plane): override _is_writeback_leader in DTensor V1 worker
ZhiyuLi-Nvidia 4d61ea3
test(data-plane): sync grpo_math_1B reference config buffer sizes
ZhiyuLi-Nvidia a99aa6a
test(data-plane): slim test_architecture_invariants to 2 behavioral t…
ZhiyuLi-Nvidia a3ec982
undo unnecessary change
ZhiyuLi-Nvidia dce67a4
build: resolve mooncake-transfer-engine-cuda13 from PyPI instead of G…
ZhiyuLi-Nvidia 42be993
perf(data-plane): skip Ray return of per-token logprob tensors
ZhiyuLi-Nvidia 17fbf5c
perf(data-plane): worker-side suppress per-token logprob Ray return
ZhiyuLi-Nvidia 6b1dff8
refactor(data-plane): drop aggregator path now that logprob workers r…
ZhiyuLi-Nvidia 24cb5bf
refactor(data-plane): make Ray worker_coords the single source of tru…
ZhiyuLi-Nvidia abcea99
Revert "refactor(data-plane): make Ray worker_coords the single sourc…
ZhiyuLi-Nvidia 5815041
fix(data-plane): unify leader-gate on NamedSharding.is_axis_zero; fix…
ZhiyuLi-Nvidia fedc88e
chore: ruff auto-fix and ruff-format pass post-rebase
ZhiyuLi-Nvidia 254984e
build: refresh uv.lock against post-rebase pyproject.toml
ZhiyuLi-Nvidia 92499b3
undo unnecessary change
ZhiyuLi-Nvidia File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.