Refactor: extract sim host shared base + c_api glue (-2800 lines)#932
Conversation
📝 WalkthroughWalkthroughThis PR refactors a2a3 and a5 simulator device runners to share common infrastructure via a new polymorphic base class, consolidating duplicated C API entry points and device-management logic while preserving architecture-specific binary loading and initialization hooks. ChangesSimulator Device Runner Architecture Refactoring
Sequence Diagram(s)sequenceDiagram
participant ChipWorker
participant c_api_shared as c_api_shared.cpp
participant DeviceRunner as DeviceRunner (A2A3/A5)
participant SimDeviceRunnerBase
participant Runtime
participant Kernel
ChipWorker->>c_api_shared: create_device_context()
c_api_shared-->>ChipWorker: DeviceContextHandle
ChipWorker->>c_api_shared: simpler_init(ctx, device_id, aicpu_bin, aicore_bin)
c_api_shared->>c_api_shared: set thread-local runner
c_api_shared->>SimDeviceRunnerBase: attach_current_thread(device_id)
c_api_shared->>SimDeviceRunnerBase: set_executors(aicpu_bin, aicore_bin)
ChipWorker->>c_api_shared: prepare_callable(ctx, id, orch_so)
c_api_shared->>SimDeviceRunnerBase: register_callable(id, orch_data, ...)
ChipWorker->>c_api_shared: run_prepared(ctx, runtime, id, args, block_dim, ...)
c_api_shared->>c_api_shared: placement-new Runtime
c_api_shared->>SimDeviceRunnerBase: bind_callable_to_runtime(runtime, id)
c_api_shared->>SimDeviceRunnerBase: set flags (swimlane, pmu, dep_gen, ...)
c_api_shared->>DeviceRunner: run(runtime, block_dim, ...)
DeviceRunner->>DeviceRunner: ensure_binaries_loaded() [arch-specific]
DeviceRunner->>DeviceRunner: init_l2_swimlane() [arch-specific]
DeviceRunner->>Kernel: spawn AICPU/AICore threads
Kernel->>Kernel: execute kernels
DeviceRunner-->>c_api_shared: timing info
c_api_shared->>c_api_shared: populate PtoRunTiming
c_api_shared-->>ChipWorker: int result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the simulation host runtime for both the a2a3 and a5 architectures by extracting shared logic—including the DeviceRunner base class, C API wrappers, and memory allocation—into a common directory. This reduces code duplication and aligns the simulation platform with the onboard refactoring pattern. Feedback on the changes points out a potential null pointer dereference in the shared C API implementation where current_runner() is dereferenced without a defensive null check, which could lead to crashes in unbound threads.
Mirror the onboard host refactor (hw-native-sys#880…hw-native-sys#928) on the sim path. Pulls ~2800 duplicate lines out of src/{a2a3,a5}/platform/sim/host/ into a shared SimDeviceRunnerBase + c_api_shared.cpp + memory_allocator.cpp under src/common/platform/sim/host/. Per-arch DeviceRunner keeps only the bits that genuinely differ between a2a3 and a5 sim: - aicore_execute signature (a5 has extra aicore_pmu_ring_addrs arg) - dlsym'd function-pointer table (a2a3 has dep_gen / pmu_reg_addrs / aicore_rotation_table setters; a5 doesn't) - init_* alloc strategy (a2a3 uses mem_alloc_ via captured lambdas; a5 uses std::malloc via prof_alloc_cb static functions — preserved as-is, no behavior change) - finalize() collector semantics (a2a3 releases shm to mem_alloc_ per-run; a5 stop()s per-run, full finalize at run-end via prof_free_cb — preserved as-is) - run() middle (dep_gen gating on a2a3 only; different SIM_REG_* constants) SimDeviceRunnerBase hosts the byte-identical methods + their state: setup_static_arena, acquire_pooled_*, create_thread, attach_current_thread, allocate_tensor / free_tensor / copy_*, register_callable[_host_orch], unregister_callable, has_callable, bind_callable_to_runtime, prepare_orch_so, upload_chip_callable_buffer, print_handshake_results, release_callable_state, ensure_device_initialized, set_*_enabled / output_prefix accessors, last_device_wall_ns, the shared mem_alloc_ / gm_*_arena_ / callable maps / chip_callable buffer pool / collector instances / kernel_args_ / device_wall ptr / log/dlopen counters. Mechanical-fix: setup_static_arena standardized to "release all arenas on any failure" (matches a2a3 sim + onboard PR hw-native-sys#922). a5 sim had been keeping earlier-committed peers alive on later-region failure; the new common impl drops that to match the onboard invariant. Latent-fix carried over from onboard hw-native-sys#928: c_api_shared's run_prepared wraps the placement-new'd Runtime in RAIIScopeGuard so its dtor fires on every exit path (manual r->~Runtime() in the prior sim c_api was bypassed by catch(...) on exception). Polymorphism via SimDeviceRunnerBase virtuals: ~SimDeviceRunnerBase (public), run(), finalize(), set_dep_gen_enabled() (default no-op, a2a3 overrides). c_api_shared.cpp works through SimDeviceRunnerBase * and dispatches via those virtuals — same pattern as the onboard hw-native-sys#928 split. Per-arch pto_runtime_c_api.cpp shrinks from ~420 lines to ~55 (just create_device_context + ACL stubs). memory_allocator.cpp was byte-identical, deleted from both arch subdirs and lives once in common/. Both arches built clean. nm -D verifies all 17 ChipWorker dlsym targets exported on both sim libhost_runtime.so. ST passes 38/38 on a2a3sim L1+L2 (devices 0,1) and 22/22 on a5sim L1+L2 (devices 0,1); examples spot-checked (scalar_data_test, vector_example, benchmark_bgemm on a2a3sim; vector_example, bgemm on a5sim). Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
src/a5/platform/sim/host/device_runner.cpp (3)
533-543:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
stop_collectors()still misses the scope-stats collector.
run()'s early-return guard now relies on this helper, but this implementation never stopsscope_stats_collector_even though it can be started earlier in the same run. Any failure afterinit_scope_stats()and before the normal success-path teardown leaves that collector running.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a5/platform/sim/host/device_runner.cpp` around lines 533 - 543, stop_collectors() fails to stop scope_stats_collector_, leaving it running on early returns; update DeviceRunner::stop_collectors() to check scope_stats_collector_.is_initialized() and call scope_stats_collector_.stop() so any collector started by init_scope_stats() is properly torn down (this addresses the early-return behavior in run()).
507-520:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFree
device_wall_dev_ptr_before tearing down the allocators.
device_wall_dev_ptr_is allocated viaallocate_tensor()inrun(), butfinalize()releases the arenas and callsmem_alloc_.finalize()beforefree_tensor(device_wall_dev_ptr_). That leaves the last tensor allocation dependent on already-torn-down allocator state.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a5/platform/sim/host/device_runner.cpp` around lines 507 - 520, The teardown frees arenas and finalizes mem_alloc_ before freeing device_wall_dev_ptr_, which was allocated in run() via allocate_tensor(); move the free so that free_tensor(device_wall_dev_ptr_) (and nulling device_wall_dev_ptr_) is called before gm_heap_arena_.release(), gm_sm_arena_.release(), runtime_arena_pool_.release(), and mem_alloc_.finalize(); update the cleanup sequence in the DeviceRunner destructor/cleanup block (the code containing gm_heap_arena_.release(), mem_alloc_.finalize(), clear_cpu_sim_shared_storage(), and device_wall_dev_ptr_ handling) to free the tensor first to avoid using torn-down allocator state.
56-149:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClean up partial SO loads before returning failure.
Any
return -1aftercreate_temp_so_file()ordlopen()leaves the just-created path/handle live. In the AICPU branch that also leavesaicpu_so_loaded_ == false, so a retry can create another temp SO on top of a half-loaded runner state. Please add local rollback for the current SO before each failure exit.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a5/platform/sim/host/device_runner.cpp` around lines 56 - 149, ensure_binaries_loaded can return early after create_temp_so_file/dlopen failures leaving temp paths/handles and function pointers in a half-initialized state; add rollback cleanup before every return -1: for the AICPU branch ensure that on any failure after create_temp_so_file() you remove aicpu_so_path_ (std::remove and clear), dlclose(aicpu_so_handle_) if non-null and set aicpu_so_handle_ = nullptr, reset any loaded function pointers (aicpu_execute_func_, set_platform_regs_func_, etc.) and keep aicpu_so_loaded_ == false; when aicpu_so_loaded_ is set true only after all symbols load successfully. Do the same for the AICore branch: if create_temp_so_file() or dlopen() or dlsym fails, remove and clear aicore_so_path_, dlclose aicore_so_handle_ if set and null it, reset aicore_execute_func_ (and any helper pointers) before returning.src/a2a3/platform/sim/host/device_runner.cpp (4)
262-306:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInstall the collector cleanup guard before the first
init_*call.
perf_cleanupis created after all profiling initializers run. Ifinit_l2_swimlane()succeeds and a laterinit_*fails,run()returns before the guard exists, so the earlier collectors stay initialized and keep their shared memory until a laterfinalize().🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/sim/host/device_runner.cpp` around lines 262 - 306, Create the RAII cleanup guard (perf_cleanup) before any of the init_* calls so that finalize_collectors() is guaranteed on early returns; move the RAIIScopeGuard construction using finalize_collectors() to precede init_l2_swimlane(), init_tensor_dump(), init_pmu(), init_dep_gen(), and init_scope_stats() (or otherwise ensure perf_cleanup is created immediately at run() start) so any partial initialization is cleaned up if a subsequent init_* returns non-zero.
404-418:⚠️ Potential issue | 🟠 Major | ⚡ Quick winStop collector threads on the non-zero AICPU return path.
The collectors are started before kernel launch, but the
runtime_rc != 0branch returns before anystop()call.perf_cleanuponly finalizes/free-backs collector memory; it does not join the mgmt/poll threads, so this path can leave them running against freed buffers.Also applies to: 479-483
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/sim/host/device_runner.cpp` around lines 404 - 418, The non-zero AICPU return path returns before joining/stopping collector threads, risking live threads accessing freed buffers; ensure you call the matching stop() on each collector you started (e.g. l2_swimlane_collector_.stop(), dump_collector_.stop(), pmu_collector_.stop(), dep_gen_collector_.stop(), scope_stats_collector_.stop()) in the runtime_rc != 0 error branch (the same fix also apply to the other early-return block around the 479-483 area) before invoking perf_cleanup()/return so all mgmt/poll threads are joined and collectors are cleanly shut down.
64-108:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRollback partially loaded temp SO state on loader failures.
Any
dlopen/dlsymfailure here returns with temp-path and handle state still populated. The AICPU path is the worst case:aicpu_so_loaded_stays false, so the next retry will create another temp SO and leak the previous handle/path instead of cleaning it up first.Also applies to: 131-150
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/sim/host/device_runner.cpp` around lines 64 - 108, When dlopen or any dlsym (via the load_sym lambda) fails the code currently returns leaving aicpu_so_handle_, aicpu_so_path_ and related symbols populated and aicpu_so_loaded_ false, leaking the temp SO and handle; modify the failure paths so that on any error you dlclose(aicpu_so_handle_) if non-null, unlink/remove the temp file at aicpu_so_path_ (or call the corresponding simpler::common::sim_host cleanup), reset aicpu_so_handle_ and aicpu_so_path_ to empty/null, clear any partially-set function pointers (e.g. aicpu_execute_func_, set_platform_regs_func_, etc.), and ensure aicpu_so_loaded_ remains false; apply the same cleanup logic for the other similar block referenced (lines 131-150) so no handle/path leaks occur on partial load failures.
716-736:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClear every profiling base pointer when collectors are finalized.
finalize_collectors()only zeroeskernel_args_.scope_stats_data_base. After a successful run, the other profiling bases still point at freed shared memory, and the nextrun()forwards those stale addresses back throughset_platform_*even when that feature is disabled.Proposed fix
void DeviceRunner::finalize_collectors() { auto free_cb = [this](void *dev_ptr) -> int { return mem_alloc_.free(dev_ptr); }; if (l2_swimlane_collector_.is_initialized()) { l2_swimlane_collector_.finalize(nullptr, free_cb); + kernel_args_.l2_swimlane_data_base = 0; + kernel_args_.l2_swimlane_aicore_rotation_table = 0; } if (dump_collector_.is_initialized()) { dump_collector_.finalize(nullptr, free_cb); + kernel_args_.dump_data_base = 0; } if (pmu_collector_.is_initialized()) { pmu_collector_.finalize(nullptr, free_cb); + kernel_args_.pmu_data_base = 0; } if (dep_gen_collector_.is_initialized()) { dep_gen_collector_.finalize(nullptr, free_cb); + kernel_args_.dep_gen_data_base = 0; } if (scope_stats_collector_.is_initialized()) { scope_stats_collector_.finalize(nullptr, free_cb); kernel_args_.scope_stats_data_base = 0; } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/sim/host/device_runner.cpp` around lines 716 - 736, In finalize_collectors(), after finalizing each collector (l2_swimlane_collector_, dump_collector_, pmu_collector_, dep_gen_collector_, scope_stats_collector_), clear the corresponding kernel_args_ base pointers so they no longer point at freed shared memory; specifically set kernel_args_.l2_swimlane_data_base, kernel_args_.dump_data_base, kernel_args_.pmu_data_base, kernel_args_.dep_gen_data_base and kernel_args_.scope_stats_data_base to 0 (in the same places where you currently set scope_stats_data_base) to ensure stale addresses are not forwarded on subsequent run() calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/common/platform/sim/host/c_api_shared.cpp`:
- Around line 213-233: After attach_current_thread succeeds, exceptions during
building the aicpu/aicore vectors or in runner->set_executors() currently return
-1 without releasing the acquired sim device; change the error path so it calls
the runner cleanup (e.g. runner->finalize_device() or the appropriate release
method on SimDeviceRunnerBase) before returning, ensuring
attach_current_thread/SimDeviceRunnerBase state is rolled back when
set_executors throws; update the catch(...) block after the vector/set_executors
section to call that finalize/release method and then return the error.
- Around line 186-193: finalize_device currently queries the caller thread's
bound device (pto_cpu_sim_get_bound_device) and may release the wrong device;
instead obtain the device id recorded on the runner and bind/release that device
around the runner->finalize() call. Change finalize_device to cast ctx to
SimDeviceRunnerBase*, read the stored device id (e.g. runner->device_id or a
getter on SimDeviceRunnerBase), call pto_cpu_sim_bind_device(dev) if dev >= 0,
invoke runner->finalize(), and then call pto_cpu_sim_release_device(dev) in a
finally/cleanup path; remove reliance on pto_cpu_sim_get_bound_device so
finalize uses the context's device regardless of the caller thread.
- Line 150: The ABI currently only exports get_runtime_size() and requires
callers to placement-new a Runtime into caller-owned storage, which can be
insufficiently aligned because Runtime contains a 64-byte-aligned Handshake; add
an exported function get_runtime_alignment() that returns alignof(Runtime) (or
otherwise change the API so the callee returns/allocates properly aligned
storage) and update documentation/use sites (e.g., run_prepared, the
placement-new call in c_api_shared.cpp and the caller in ChipWorker that uses
runtime_buf_) so callers allocate runtime_buf_ with the returned alignment (or
obtain aligned memory from the API) to ensure placement-new into runtime meets
alignof(Runtime).
In `@src/common/platform/sim/host/device_runner_base.cpp`:
- Around line 52-53: The temp-DSO file is currently made
world-readable/executable with fchmod(fd, 0755), exposing extracted
executor/kernel/AICPU SOs; change the permission to owner-only by using
fchmod(fd, 0700) where the temp file is created (the mkstemp -> fchmod sequence
in device_runner_base.cpp) so dlopen still works but the temp-DSO is not
accessible to other local users; update any related comments to reflect
owner-only permissions.
---
Outside diff comments:
In `@src/a2a3/platform/sim/host/device_runner.cpp`:
- Around line 262-306: Create the RAII cleanup guard (perf_cleanup) before any
of the init_* calls so that finalize_collectors() is guaranteed on early
returns; move the RAIIScopeGuard construction using finalize_collectors() to
precede init_l2_swimlane(), init_tensor_dump(), init_pmu(), init_dep_gen(), and
init_scope_stats() (or otherwise ensure perf_cleanup is created immediately at
run() start) so any partial initialization is cleaned up if a subsequent init_*
returns non-zero.
- Around line 404-418: The non-zero AICPU return path returns before
joining/stopping collector threads, risking live threads accessing freed
buffers; ensure you call the matching stop() on each collector you started (e.g.
l2_swimlane_collector_.stop(), dump_collector_.stop(), pmu_collector_.stop(),
dep_gen_collector_.stop(), scope_stats_collector_.stop()) in the runtime_rc != 0
error branch (the same fix also apply to the other early-return block around the
479-483 area) before invoking perf_cleanup()/return so all mgmt/poll threads are
joined and collectors are cleanly shut down.
- Around line 64-108: When dlopen or any dlsym (via the load_sym lambda) fails
the code currently returns leaving aicpu_so_handle_, aicpu_so_path_ and related
symbols populated and aicpu_so_loaded_ false, leaking the temp SO and handle;
modify the failure paths so that on any error you dlclose(aicpu_so_handle_) if
non-null, unlink/remove the temp file at aicpu_so_path_ (or call the
corresponding simpler::common::sim_host cleanup), reset aicpu_so_handle_ and
aicpu_so_path_ to empty/null, clear any partially-set function pointers (e.g.
aicpu_execute_func_, set_platform_regs_func_, etc.), and ensure aicpu_so_loaded_
remains false; apply the same cleanup logic for the other similar block
referenced (lines 131-150) so no handle/path leaks occur on partial load
failures.
- Around line 716-736: In finalize_collectors(), after finalizing each collector
(l2_swimlane_collector_, dump_collector_, pmu_collector_, dep_gen_collector_,
scope_stats_collector_), clear the corresponding kernel_args_ base pointers so
they no longer point at freed shared memory; specifically set
kernel_args_.l2_swimlane_data_base, kernel_args_.dump_data_base,
kernel_args_.pmu_data_base, kernel_args_.dep_gen_data_base and
kernel_args_.scope_stats_data_base to 0 (in the same places where you currently
set scope_stats_data_base) to ensure stale addresses are not forwarded on
subsequent run() calls.
In `@src/a5/platform/sim/host/device_runner.cpp`:
- Around line 533-543: stop_collectors() fails to stop scope_stats_collector_,
leaving it running on early returns; update DeviceRunner::stop_collectors() to
check scope_stats_collector_.is_initialized() and call
scope_stats_collector_.stop() so any collector started by init_scope_stats() is
properly torn down (this addresses the early-return behavior in run()).
- Around line 507-520: The teardown frees arenas and finalizes mem_alloc_ before
freeing device_wall_dev_ptr_, which was allocated in run() via
allocate_tensor(); move the free so that free_tensor(device_wall_dev_ptr_) (and
nulling device_wall_dev_ptr_) is called before gm_heap_arena_.release(),
gm_sm_arena_.release(), runtime_arena_pool_.release(), and
mem_alloc_.finalize(); update the cleanup sequence in the DeviceRunner
destructor/cleanup block (the code containing gm_heap_arena_.release(),
mem_alloc_.finalize(), clear_cpu_sim_shared_storage(), and device_wall_dev_ptr_
handling) to free the tensor first to avoid using torn-down allocator state.
- Around line 56-149: ensure_binaries_loaded can return early after
create_temp_so_file/dlopen failures leaving temp paths/handles and function
pointers in a half-initialized state; add rollback cleanup before every return
-1: for the AICPU branch ensure that on any failure after create_temp_so_file()
you remove aicpu_so_path_ (std::remove and clear), dlclose(aicpu_so_handle_) if
non-null and set aicpu_so_handle_ = nullptr, reset any loaded function pointers
(aicpu_execute_func_, set_platform_regs_func_, etc.) and keep aicpu_so_loaded_
== false; when aicpu_so_loaded_ is set true only after all symbols load
successfully. Do the same for the AICore branch: if create_temp_so_file() or
dlopen() or dlsym fails, remove and clear aicore_so_path_, dlclose
aicore_so_handle_ if set and null it, reset aicore_execute_func_ (and any helper
pointers) before returning.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b65b5ee4-92bf-41a6-a86e-609deb7aa17d
📒 Files selected for processing (13)
src/a2a3/platform/sim/host/CMakeLists.txtsrc/a2a3/platform/sim/host/device_runner.cppsrc/a2a3/platform/sim/host/device_runner.hsrc/a2a3/platform/sim/host/pto_runtime_c_api.cppsrc/a5/platform/sim/host/CMakeLists.txtsrc/a5/platform/sim/host/device_runner.cppsrc/a5/platform/sim/host/device_runner.hsrc/a5/platform/sim/host/memory_allocator.cppsrc/a5/platform/sim/host/pto_runtime_c_api.cppsrc/common/platform/sim/host/c_api_shared.cppsrc/common/platform/sim/host/device_runner_base.cppsrc/common/platform/sim/host/device_runner_base.hsrc/common/platform/sim/host/memory_allocator.cpp
💤 Files with no reviewable changes (1)
- src/a5/platform/sim/host/memory_allocator.cpp
Summary
Mirror the onboard host refactor (#880…#928) on the sim path. Pulls ~2800 duplicate lines out of
src/{a2a3,a5}/platform/sim/host/into a sharedSimDeviceRunnerBase+c_api_shared.cpp+memory_allocator.cppundersrc/common/platform/sim/host/. Per-archDeviceRunnerkeeps only the bits that genuinely differ between a2a3 and a5 sim.What moved to
src/common/platform/sim/host/device_runner_base.{h,cpp}— shared base class hosting:setup_static_arena,acquire_pooled_*create_thread,attach_current_thread,ensure_device_initializedallocate_tensor/free_tensor/copy_to_device/copy_from_deviceregister_callable[_host_orch],unregister_callable,has_callable,bind_callable_to_runtimeprepare_orch_so,upload_chip_callable_bufferprint_handshake_results,release_callable_statemem_alloc_/gm_*_arena_/ callable maps / chip_callable buffer pool / collector instances /kernel_args_/ device_wall ptr / log/dlopen countersc_api_shared.cpp— TSD glue + static wrappers + public C ABI (simpler_init,prepare_callable,run_prepared,device_*_ctx,finalize_device, etc.) all written againstSimDeviceRunnerBase *. Same polymorphic pattern as onboard PR Refactor: extract pto_runtime_c_api shared glue into common (-303 lines) #928.memory_allocator.cpp— was byte-identical between arches.What stays per-arch
aicore_executesignature — a5 has an extraaicore_pmu_ring_addrsargpmu_reg_addrs/aicore_rotation_tablesetters; a5 doesn'tinit_*alloc strategy — a2a3 usesmem_alloc_via captured lambdas; a5 usesstd::mallocviaprof_alloc_cbstatics. Preserved as-is; no behavior change in this PR.finalize()collector semantics — a2a3 releases shm tomem_alloc_per-run; a5stop()s per-run, full finalize at run-end viaprof_free_cb. Preserved as-is.run()middle — dep_gen gating on a2a3 only; differentSIM_REG_*constantsPer-arch
pto_runtime_c_api.cppShrinks from ~420 lines to ~55 — just
create_device_context(which needs the concreteDeviceRunnertype) + ACL no-op stubs.Mechanical fixes carried with the move
setup_static_arenastandardized to "release all arenas on any failure" (matches a2a3 sim + onboard PR Refactor: extract setup_static_arena + launch_aicore_kernel into base #922). a5 sim had been keeping earlier-committed peers alive on later-region failure; the common impl drops that to match the onboard invariant.c_api_shared'srun_preparedwraps the placement-new'dRuntimeinRAIIScopeGuardso its dtor fires on every exit path (manualr->~Runtime()in the prior sim c_api was bypassed bycatch(...)on exception).Polymorphism
SimDeviceRunnerBasevirtuals:~SimDeviceRunnerBase(public),run,finalize,set_dep_gen_enabled(default no-op, a2a3 overrides).c_api_shared.cppworks throughSimDeviceRunnerBase *and dispatches via those virtuals — same pattern as onboard #928.Test plan
nm -Dverifies all 17 ChipWorker dlsym targets exported on both sim libhost_runtime.so