Refactor: extract setup_static_arena + launch_aicore_kernel into base#922
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughMethods ChangesConsolidation of arena and kernel launch logic into DeviceRunnerBase
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the DeviceRunner implementations for the a2a3 and a5 platforms by moving the common setup_static_arena and launch_aicore_kernel methods, along with their cached arena size member variables, into the shared DeviceRunnerBase class. This consolidation normalizes the rollback behavior on arena commit failures. A review comment points out a potential issue where sequential failures during re-initialization could lead to an incomplete rollback of previously committed arenas, and suggests using a try-catch block to guarantee a clean release of all resources.
e44b9f3 to
eac2ddb
Compare
Move two more mechanical-clean items into DeviceRunnerBase (-104 net lines): - `setup_static_arena`: bodies were near-identical; rollback-on-failure semantic normalized to a5's (release earlier committed peers when a later region fails). a2a3's prior "keep peers committed on failure" behavior is dropped — that pattern silently masked errors by handing pooled pointers back to the caller alongside a failure rc. - `launch_aicore_kernel`: byte-identical body. Lazy-registers the AICore binary via `rtRegisterAllKernel`, caches the handle, builds rtArgsEx_t, launches via `rtKernelLaunchWithHandleV2`. - `cached_gm_heap_size_`, `cached_gm_sm_size_`, `cached_runtime_arena_size_` moved alongside `setup_static_arena`. Skipped (per-arch divergence too tangled for clean shared body): - `init_l2_perf` / `init_tensor_dump` / `init_pmu` / `init_scope_stats`: set arch-specific KernelArgs fields (a2a3 sets `aicore_ring_addr`; a5 sets `aicore_l2_perf_ring_addrs` / `aicore_pmu_ring_addrs`). - `finalize_collectors`: a5 only stops collectors; a2a3 fully finalizes with halHostUnregister + dep_gen. Both arches built clean. a2a3 onboard smoke 8/8 in 15s.
) Three more small helpers + one tweak to finalize_common(). Net -1 line, but real consolidation: each chunk that used to be duplicated in two arches now lives once. New base helpers (called from each arch's run()): - init_runtime_args_with_metadata(Runtime&) — H2D the Runtime struct via kernel_args_.init_runtime_args, then publish log_level / log_info_v / device_id into KernelArgs from HostLogger. - start_shared_collectors_for_run() — builds the thread_factory lambda and calls .start() on the four shared collectors (l2_perf, dump, pmu, scope_stats) gated by their enable flags. - teardown_shared_collectors_after_run() — the post-sync stop+reconcile+export sequence for those four. l2_perf adds read_phase_header_metadata + export_swimlane_json; dump exports dump files; pmu just reconciles; scope_stats writes JSONL. a2a3 still inlines its dep_gen collector start + teardown (with dep_gen_replay_emit_deps_json) right after the shared helpers — that's a2a3-only. finalize_common() now resets cached_gm_heap_size_ / cached_gm_sm_size_ / cached_runtime_arena_size_ alongside the other identity state. These fields live on base (moved in PR #922) so resetting them there is the natural home; each arch's finalize() no longer needs the three explicit = 0 lines after calling finalize_common(). Both arches built clean. a2a3 onboard smoke (dummy_task, alternating_matmul_add, prepared_callable suite, spmd_basic) — 9/9 passed in 19s. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Summary
Two more mechanical-clean items extracted into
DeviceRunnerBase. Net -104 lines.Moved to base
setup_static_arena: bodies were near-identical between a2a3 and a5. The only difference was the rollback-on-failure semantic, which is now normalized to a5's pattern: when a later region fails to commit, earlier committed peers are released.launch_aicore_kernel: byte-identical body. Lazy-registers the AICore binary viartRegisterAllKernel, caches the handle, buildsrtArgsEx_t, launches viartKernelLaunchWithHandleV2. Whetherk_argsis host- or device-resident is decided by the caller'srun().cached_gm_heap_size_/cached_gm_sm_size_/cached_runtime_arena_size_moved alongsidesetup_static_arena.Behavior normalization (call out)
a2a3's
setup_static_arenapreviously kept earlier committed peers alive when a later region failed, on the rationale that "pooled pointers previously returned to callers must stay valid even if this resize attempt aborts." That rationale leaks: if the caller already committed pointers to earlier regions, they keep working; but the caller also gets a failure rc, so it cannot rely on the layout. The earlier-peers-stay-alive behavior masks the failure — the caller might happily continue using stale pointers without realizing the configured runtime arena (or whatever later region failed) never got committed.a5's "release earlier peers on later failure" semantic is the safer default: failure means failure across the board, caller re-tries with a different layout or bails out. The current
setup_static_arenacallers all retry the full layout when it fails, so this is a no-op behavioral change in practice — but it removes a footgun.Skipped (per-arch divergence too tangled)
init_l2_perf/init_tensor_dump/init_pmu/init_scope_stats: set arch-specificKernelArgsfields (a2a3:aicore_ring_addr; a5:aicore_l2_perf_ring_addrs+aicore_pmu_ring_addrs). Unifying needs multiple virtual hooks per method — the virtual overhead exceeds the savings.finalize_collectors: a5 onlystop()s collectors (per-run cleanup); a2a3 fullyfinalize()s withhalHostUnregister+dep_gen. Semantically different functions sharing a name.Verification
dummy_task,alternating_matmul_add,prepared_callablesuite) — 8/8 passed in 15s.Test plan