Skip to content

Refactor: extract sim host shared base + c_api glue (-2800 lines)#932

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:sim-host-common
May 31, 2026
Merged

Refactor: extract sim host shared base + c_api glue (-2800 lines)#932
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:sim-host-common

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Summary

Mirror the onboard host refactor (#880#928) on the sim path. Pulls ~2800 duplicate lines out of src/{a2a3,a5}/platform/sim/host/ into a shared SimDeviceRunnerBase + c_api_shared.cpp + memory_allocator.cpp under src/common/platform/sim/host/. Per-arch DeviceRunner keeps only the bits that genuinely differ between a2a3 and a5 sim.

What moved to src/common/platform/sim/host/

  • device_runner_base.{h,cpp} — shared base class hosting:
    • setup_static_arena, acquire_pooled_*
    • create_thread, attach_current_thread, ensure_device_initialized
    • allocate_tensor / free_tensor / copy_to_device / copy_from_device
    • register_callable[_host_orch], unregister_callable, has_callable, bind_callable_to_runtime
    • prepare_orch_so, upload_chip_callable_buffer
    • print_handshake_results, release_callable_state
    • Shared mem_alloc_ / gm_*_arena_ / callable maps / chip_callable buffer pool / collector instances / kernel_args_ / device_wall ptr / log/dlopen counters
  • c_api_shared.cpp — TSD glue + static wrappers + public C ABI (simpler_init, prepare_callable, run_prepared, device_*_ctx, finalize_device, etc.) all written against SimDeviceRunnerBase *. Same polymorphic pattern as onboard PR Refactor: extract pto_runtime_c_api shared glue into common (-303 lines) #928.
  • memory_allocator.cpp — was byte-identical between arches.

What stays per-arch

  • aicore_execute signature — a5 has an extra aicore_pmu_ring_addrs arg
  • dlsym'd function-pointer table — a2a3 has dep_gen / pmu_reg_addrs / aicore_rotation_table setters; a5 doesn't
  • init_* alloc strategy — a2a3 uses mem_alloc_ via captured lambdas; a5 uses std::malloc via prof_alloc_cb statics. Preserved as-is; no behavior change in this PR.
  • finalize() collector semantics — a2a3 releases shm to mem_alloc_ per-run; a5 stop()s per-run, full finalize at run-end via prof_free_cb. Preserved as-is.
  • run() middle — dep_gen gating on a2a3 only; different SIM_REG_* constants

Per-arch pto_runtime_c_api.cpp

Shrinks from ~420 lines to ~55 — just create_device_context (which needs the concrete DeviceRunner type) + ACL no-op stubs.

Mechanical fixes carried with the move

Polymorphism

SimDeviceRunnerBase virtuals: ~SimDeviceRunnerBase (public), run, finalize, set_dep_gen_enabled (default no-op, a2a3 overrides). c_api_shared.cpp works through SimDeviceRunnerBase * and dispatches via those virtuals — same pattern as onboard #928.

Test plan

  • Both arches built clean (a2a3sim + a5sim libhost_runtime.so)
  • nm -D verifies all 17 ChipWorker dlsym targets exported on both sim libhost_runtime.so
  • ST passes 38/38 on a2a3sim L1+L2 (devices 0,1)
  • ST passes 22/22 on a5sim L1+L2 (devices 0,1)
  • Examples spot-checked: scalar_data_test, vector_example, benchmark_bgemm on a2a3sim; vector_example, bgemm on a5sim
  • CI green

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refactors a2a3 and a5 simulator device runners to share common infrastructure via a new polymorphic base class, consolidating duplicated C API entry points and device-management logic while preserving architecture-specific binary loading and initialization hooks.

Changes

Simulator Device Runner Architecture Refactoring

Layer / File(s) Summary
Shared polymorphic base class definition
src/common/platform/sim/host/device_runner_base.h
Introduces SimDeviceRunnerBase as a shared abstract base class exposing pure virtual run()/finalize() lifecycle methods, default set_dep_gen_enabled(), and public utility methods for arena setup, pooled arena acquisition, thread creation/binding, tensor allocation/copy, callable registration/binding/unregistration, chip-callable buffer upload, and diagnostic getters/setters. Declares a utility function create_temp_so_file for materializing in-memory executor binaries to temporary files.
Shared base implementation
src/common/platform/sim/host/device_runner_base.cpp
Implements SimDeviceRunnerBase methods: arena setup with idempotency and rollback, pooled arena accessors, thread/device lifecycle binding via thread-local context storage, tensor memory operations using internal allocators, callable registration with ELF-hash-based deduplication and refcounting, chip-callable buffer upload with temp SO creation and dlopen/dlsym address patching, handshake logging, and state release for cleanup.
Shared C API glue layer
src/common/platform/sim/host/c_api_shared.cpp
Implements full C ABI runtime entry points: per-thread runner binding via pthread_key_t, device context lifecycle, tensor memory operations, initialization (simpler_init), callable preparation (prepare_callable) with orch SO registration or host dlopen binding, callable execution (run_prepared) with runtime construction and flag wiring, unregister and dlopen-count query functions, all with exception-to-error-code conversion and cleanup guards.
A2A3 DeviceRunner refactoring
src/a2a3/platform/sim/host/device_runner.h, device_runner.cpp
Refactors DeviceRunner to subclass SimDeviceRunnerBase, narrowing public API to lifecycle methods and set_dep_gen_enabled(). Implements A2A3-specific binary loading (AICPU/AICore temp-SO staging with expanded load_sym lambda covering L2 swimlane rotation-table, dep_gen, and scope-stats setters) and diagnostic initialization hooks via private overrides. Removes prior large thread/memory/callable-registry surface now inherited from base.
A2A3 C API stubs and build integration
src/a2a3/platform/sim/host/pto_runtime_c_api.cpp, CMakeLists.txt
Strips pto_runtime_c_api.cpp to only create_device_context and ACL/comm no-op stubs, removing device-memory/lifecycle/callable implementations now in shared C API. Updates CMakeLists.txt to include common platform sim host directory and link shared sources (device_runner_base.cpp, c_api_shared.cpp, memory_allocator.cpp).
A5 DeviceRunner refactoring
src/a5/platform/sim/host/device_runner.h, device_runner.cpp
Refactors DeviceRunner to subclass SimDeviceRunnerBase, exposing only lifecycle methods while inheriting thread/memory/arena/callable management from base. Implements A5-specific binary loading, diagnostic initialization, and introduces stop_collectors() helper. Removes large prior state/method declarations from header.
A5 C API stubs, memory allocator removal, and build integration
src/a5/platform/sim/host/pto_runtime_c_api.cpp, memory_allocator.cpp, CMakeLists.txt
Strips pto_runtime_c_api.cpp to stubs, deletes A5's local memory_allocator.cpp (now shared), and updates CMakeLists.txt to include common directory and link shared sources.

Sequence Diagram(s)

sequenceDiagram
  participant ChipWorker
  participant c_api_shared as c_api_shared.cpp
  participant DeviceRunner as DeviceRunner (A2A3/A5)
  participant SimDeviceRunnerBase
  participant Runtime
  participant Kernel
  
  ChipWorker->>c_api_shared: create_device_context()
  c_api_shared-->>ChipWorker: DeviceContextHandle
  
  ChipWorker->>c_api_shared: simpler_init(ctx, device_id, aicpu_bin, aicore_bin)
  c_api_shared->>c_api_shared: set thread-local runner
  c_api_shared->>SimDeviceRunnerBase: attach_current_thread(device_id)
  c_api_shared->>SimDeviceRunnerBase: set_executors(aicpu_bin, aicore_bin)
  
  ChipWorker->>c_api_shared: prepare_callable(ctx, id, orch_so)
  c_api_shared->>SimDeviceRunnerBase: register_callable(id, orch_data, ...)
  
  ChipWorker->>c_api_shared: run_prepared(ctx, runtime, id, args, block_dim, ...)
  c_api_shared->>c_api_shared: placement-new Runtime
  c_api_shared->>SimDeviceRunnerBase: bind_callable_to_runtime(runtime, id)
  c_api_shared->>SimDeviceRunnerBase: set flags (swimlane, pmu, dep_gen, ...)
  c_api_shared->>DeviceRunner: run(runtime, block_dim, ...)
  DeviceRunner->>DeviceRunner: ensure_binaries_loaded() [arch-specific]
  DeviceRunner->>DeviceRunner: init_l2_swimlane() [arch-specific]
  DeviceRunner->>Kernel: spawn AICPU/AICore threads
  Kernel->>Kernel: execute kernels
  DeviceRunner-->>c_api_shared: timing info
  c_api_shared->>c_api_shared: populate PtoRunTiming
  c_api_shared-->>ChipWorker: int result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hw-native-sys/simpler#921: Main PR updates dlsym loading and calling of new AICPU set_platform_aicore_rotation_table function, directly matching rotation-table plumbing.
  • hw-native-sys/simpler#916: Main PR updates L2 swimlane executor setters and kernel-arg wiring, aligning with L2Perf→L2Swimlane symbol/ABI renames.
  • hw-native-sys/simpler#858: Main PR updates sim device runner binary-loading/unloading paths to dlsym and manage new AICPU scope-stats setter hooks required by scope_stats feature.

Poem

🐰 Architectural harmony blooms anew,
When shared foundations guide the view,
Base class wisdom, inheritance's grace,
Two arches dance in common space.
Code reborn, duplication shed,
Across all platforms, architects led.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main refactoring objective: extracting shared code from per-arch sim host paths and reducing duplication by ~2800 lines.
Description check ✅ Passed The description provides a comprehensive overview of what was moved to common, what remains per-arch, mechanical fixes applied, and test results validating the refactoring.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the simulation host runtime for both the a2a3 and a5 architectures by extracting shared logic—including the DeviceRunner base class, C API wrappers, and memory allocation—into a common directory. This reduces code duplication and aligns the simulation platform with the onboard refactoring pattern. Feedback on the changes points out a potential null pointer dereference in the shared C API implementation where current_runner() is dereferenced without a defensive null check, which could lead to crashes in unbound threads.

Comment thread src/common/platform/sim/host/c_api_shared.cpp
Mirror the onboard host refactor (hw-native-sys#880hw-native-sys#928) on the sim path. Pulls
~2800 duplicate lines out of src/{a2a3,a5}/platform/sim/host/ into a
shared SimDeviceRunnerBase + c_api_shared.cpp + memory_allocator.cpp
under src/common/platform/sim/host/. Per-arch DeviceRunner keeps only
the bits that genuinely differ between a2a3 and a5 sim:

- aicore_execute signature (a5 has extra aicore_pmu_ring_addrs arg)
- dlsym'd function-pointer table (a2a3 has dep_gen / pmu_reg_addrs /
  aicore_rotation_table setters; a5 doesn't)
- init_* alloc strategy (a2a3 uses mem_alloc_ via captured lambdas;
  a5 uses std::malloc via prof_alloc_cb static functions — preserved
  as-is, no behavior change)
- finalize() collector semantics (a2a3 releases shm to mem_alloc_
  per-run; a5 stop()s per-run, full finalize at run-end via
  prof_free_cb — preserved as-is)
- run() middle (dep_gen gating on a2a3 only; different SIM_REG_*
  constants)

SimDeviceRunnerBase hosts the byte-identical methods + their state:
setup_static_arena, acquire_pooled_*, create_thread,
attach_current_thread, allocate_tensor / free_tensor / copy_*,
register_callable[_host_orch], unregister_callable, has_callable,
bind_callable_to_runtime, prepare_orch_so, upload_chip_callable_buffer,
print_handshake_results, release_callable_state, ensure_device_initialized,
set_*_enabled / output_prefix accessors, last_device_wall_ns, the
shared mem_alloc_ / gm_*_arena_ / callable maps / chip_callable
buffer pool / collector instances / kernel_args_ / device_wall ptr /
log/dlopen counters.

Mechanical-fix: setup_static_arena standardized to "release all
arenas on any failure" (matches a2a3 sim + onboard PR hw-native-sys#922). a5 sim
had been keeping earlier-committed peers alive on later-region
failure; the new common impl drops that to match the onboard
invariant.

Latent-fix carried over from onboard hw-native-sys#928: c_api_shared's
run_prepared wraps the placement-new'd Runtime in RAIIScopeGuard so
its dtor fires on every exit path (manual r->~Runtime() in the prior
sim c_api was bypassed by catch(...) on exception).

Polymorphism via SimDeviceRunnerBase virtuals: ~SimDeviceRunnerBase
(public), run(), finalize(), set_dep_gen_enabled() (default no-op,
a2a3 overrides). c_api_shared.cpp works through
SimDeviceRunnerBase * and dispatches via those virtuals — same
pattern as the onboard hw-native-sys#928 split.

Per-arch pto_runtime_c_api.cpp shrinks from ~420 lines to ~55 (just
create_device_context + ACL stubs). memory_allocator.cpp was
byte-identical, deleted from both arch subdirs and lives once in
common/.

Both arches built clean. nm -D verifies all 17 ChipWorker dlsym
targets exported on both sim libhost_runtime.so. ST passes 38/38 on
a2a3sim L1+L2 (devices 0,1) and 22/22 on a5sim L1+L2 (devices 0,1);
examples spot-checked (scalar_data_test, vector_example,
benchmark_bgemm on a2a3sim; vector_example, bgemm on a5sim).

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
@ChaoWao ChaoWao force-pushed the sim-host-common branch from 36c08a8 to 5ab1585 Compare May 31, 2026 06:30
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
src/a5/platform/sim/host/device_runner.cpp (3)

533-543: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

stop_collectors() still misses the scope-stats collector.

run()'s early-return guard now relies on this helper, but this implementation never stops scope_stats_collector_ even though it can be started earlier in the same run. Any failure after init_scope_stats() and before the normal success-path teardown leaves that collector running.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/sim/host/device_runner.cpp` around lines 533 - 543,
stop_collectors() fails to stop scope_stats_collector_, leaving it running on
early returns; update DeviceRunner::stop_collectors() to check
scope_stats_collector_.is_initialized() and call scope_stats_collector_.stop()
so any collector started by init_scope_stats() is properly torn down (this
addresses the early-return behavior in run()).

507-520: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Free device_wall_dev_ptr_ before tearing down the allocators.

device_wall_dev_ptr_ is allocated via allocate_tensor() in run(), but finalize() releases the arenas and calls mem_alloc_.finalize() before free_tensor(device_wall_dev_ptr_). That leaves the last tensor allocation dependent on already-torn-down allocator state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/sim/host/device_runner.cpp` around lines 507 - 520, The
teardown frees arenas and finalizes mem_alloc_ before freeing
device_wall_dev_ptr_, which was allocated in run() via allocate_tensor(); move
the free so that free_tensor(device_wall_dev_ptr_) (and nulling
device_wall_dev_ptr_) is called before gm_heap_arena_.release(),
gm_sm_arena_.release(), runtime_arena_pool_.release(), and
mem_alloc_.finalize(); update the cleanup sequence in the DeviceRunner
destructor/cleanup block (the code containing gm_heap_arena_.release(),
mem_alloc_.finalize(), clear_cpu_sim_shared_storage(), and device_wall_dev_ptr_
handling) to free the tensor first to avoid using torn-down allocator state.

56-149: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clean up partial SO loads before returning failure.

Any return -1 after create_temp_so_file() or dlopen() leaves the just-created path/handle live. In the AICPU branch that also leaves aicpu_so_loaded_ == false, so a retry can create another temp SO on top of a half-loaded runner state. Please add local rollback for the current SO before each failure exit.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/sim/host/device_runner.cpp` around lines 56 - 149,
ensure_binaries_loaded can return early after create_temp_so_file/dlopen
failures leaving temp paths/handles and function pointers in a half-initialized
state; add rollback cleanup before every return -1: for the AICPU branch ensure
that on any failure after create_temp_so_file() you remove aicpu_so_path_
(std::remove and clear), dlclose(aicpu_so_handle_) if non-null and set
aicpu_so_handle_ = nullptr, reset any loaded function pointers
(aicpu_execute_func_, set_platform_regs_func_, etc.) and keep aicpu_so_loaded_
== false; when aicpu_so_loaded_ is set true only after all symbols load
successfully. Do the same for the AICore branch: if create_temp_so_file() or
dlopen() or dlsym fails, remove and clear aicore_so_path_, dlclose
aicore_so_handle_ if set and null it, reset aicore_execute_func_ (and any helper
pointers) before returning.
src/a2a3/platform/sim/host/device_runner.cpp (4)

262-306: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Install the collector cleanup guard before the first init_* call.

perf_cleanup is created after all profiling initializers run. If init_l2_swimlane() succeeds and a later init_* fails, run() returns before the guard exists, so the earlier collectors stay initialized and keep their shared memory until a later finalize().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/sim/host/device_runner.cpp` around lines 262 - 306, Create
the RAII cleanup guard (perf_cleanup) before any of the init_* calls so that
finalize_collectors() is guaranteed on early returns; move the RAIIScopeGuard
construction using finalize_collectors() to precede init_l2_swimlane(),
init_tensor_dump(), init_pmu(), init_dep_gen(), and init_scope_stats() (or
otherwise ensure perf_cleanup is created immediately at run() start) so any
partial initialization is cleaned up if a subsequent init_* returns non-zero.

404-418: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stop collector threads on the non-zero AICPU return path.

The collectors are started before kernel launch, but the runtime_rc != 0 branch returns before any stop() call. perf_cleanup only finalizes/free-backs collector memory; it does not join the mgmt/poll threads, so this path can leave them running against freed buffers.

Also applies to: 479-483

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/sim/host/device_runner.cpp` around lines 404 - 418, The
non-zero AICPU return path returns before joining/stopping collector threads,
risking live threads accessing freed buffers; ensure you call the matching
stop() on each collector you started (e.g. l2_swimlane_collector_.stop(),
dump_collector_.stop(), pmu_collector_.stop(), dep_gen_collector_.stop(),
scope_stats_collector_.stop()) in the runtime_rc != 0 error branch (the same fix
also apply to the other early-return block around the 479-483 area) before
invoking perf_cleanup()/return so all mgmt/poll threads are joined and
collectors are cleanly shut down.

64-108: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rollback partially loaded temp SO state on loader failures.

Any dlopen/dlsym failure here returns with temp-path and handle state still populated. The AICPU path is the worst case: aicpu_so_loaded_ stays false, so the next retry will create another temp SO and leak the previous handle/path instead of cleaning it up first.

Also applies to: 131-150

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/sim/host/device_runner.cpp` around lines 64 - 108, When
dlopen or any dlsym (via the load_sym lambda) fails the code currently returns
leaving aicpu_so_handle_, aicpu_so_path_ and related symbols populated and
aicpu_so_loaded_ false, leaking the temp SO and handle; modify the failure paths
so that on any error you dlclose(aicpu_so_handle_) if non-null, unlink/remove
the temp file at aicpu_so_path_ (or call the corresponding
simpler::common::sim_host cleanup), reset aicpu_so_handle_ and aicpu_so_path_ to
empty/null, clear any partially-set function pointers (e.g. aicpu_execute_func_,
set_platform_regs_func_, etc.), and ensure aicpu_so_loaded_ remains false; apply
the same cleanup logic for the other similar block referenced (lines 131-150) so
no handle/path leaks occur on partial load failures.

716-736: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear every profiling base pointer when collectors are finalized.

finalize_collectors() only zeroes kernel_args_.scope_stats_data_base. After a successful run, the other profiling bases still point at freed shared memory, and the next run() forwards those stale addresses back through set_platform_* even when that feature is disabled.

Proposed fix
 void DeviceRunner::finalize_collectors() {
     auto free_cb = [this](void *dev_ptr) -> int {
         return mem_alloc_.free(dev_ptr);
     };
 
     if (l2_swimlane_collector_.is_initialized()) {
         l2_swimlane_collector_.finalize(nullptr, free_cb);
+        kernel_args_.l2_swimlane_data_base = 0;
+        kernel_args_.l2_swimlane_aicore_rotation_table = 0;
     }
     if (dump_collector_.is_initialized()) {
         dump_collector_.finalize(nullptr, free_cb);
+        kernel_args_.dump_data_base = 0;
     }
     if (pmu_collector_.is_initialized()) {
         pmu_collector_.finalize(nullptr, free_cb);
+        kernel_args_.pmu_data_base = 0;
     }
     if (dep_gen_collector_.is_initialized()) {
         dep_gen_collector_.finalize(nullptr, free_cb);
+        kernel_args_.dep_gen_data_base = 0;
     }
     if (scope_stats_collector_.is_initialized()) {
         scope_stats_collector_.finalize(nullptr, free_cb);
         kernel_args_.scope_stats_data_base = 0;
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/sim/host/device_runner.cpp` around lines 716 - 736, In
finalize_collectors(), after finalizing each collector (l2_swimlane_collector_,
dump_collector_, pmu_collector_, dep_gen_collector_, scope_stats_collector_),
clear the corresponding kernel_args_ base pointers so they no longer point at
freed shared memory; specifically set kernel_args_.l2_swimlane_data_base,
kernel_args_.dump_data_base, kernel_args_.pmu_data_base,
kernel_args_.dep_gen_data_base and kernel_args_.scope_stats_data_base to 0 (in
the same places where you currently set scope_stats_data_base) to ensure stale
addresses are not forwarded on subsequent run() calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/common/platform/sim/host/c_api_shared.cpp`:
- Around line 213-233: After attach_current_thread succeeds, exceptions during
building the aicpu/aicore vectors or in runner->set_executors() currently return
-1 without releasing the acquired sim device; change the error path so it calls
the runner cleanup (e.g. runner->finalize_device() or the appropriate release
method on SimDeviceRunnerBase) before returning, ensuring
attach_current_thread/SimDeviceRunnerBase state is rolled back when
set_executors throws; update the catch(...) block after the vector/set_executors
section to call that finalize/release method and then return the error.
- Around line 186-193: finalize_device currently queries the caller thread's
bound device (pto_cpu_sim_get_bound_device) and may release the wrong device;
instead obtain the device id recorded on the runner and bind/release that device
around the runner->finalize() call. Change finalize_device to cast ctx to
SimDeviceRunnerBase*, read the stored device id (e.g. runner->device_id or a
getter on SimDeviceRunnerBase), call pto_cpu_sim_bind_device(dev) if dev >= 0,
invoke runner->finalize(), and then call pto_cpu_sim_release_device(dev) in a
finally/cleanup path; remove reliance on pto_cpu_sim_get_bound_device so
finalize uses the context's device regardless of the caller thread.
- Line 150: The ABI currently only exports get_runtime_size() and requires
callers to placement-new a Runtime into caller-owned storage, which can be
insufficiently aligned because Runtime contains a 64-byte-aligned Handshake; add
an exported function get_runtime_alignment() that returns alignof(Runtime) (or
otherwise change the API so the callee returns/allocates properly aligned
storage) and update documentation/use sites (e.g., run_prepared, the
placement-new call in c_api_shared.cpp and the caller in ChipWorker that uses
runtime_buf_) so callers allocate runtime_buf_ with the returned alignment (or
obtain aligned memory from the API) to ensure placement-new into runtime meets
alignof(Runtime).

In `@src/common/platform/sim/host/device_runner_base.cpp`:
- Around line 52-53: The temp-DSO file is currently made
world-readable/executable with fchmod(fd, 0755), exposing extracted
executor/kernel/AICPU SOs; change the permission to owner-only by using
fchmod(fd, 0700) where the temp file is created (the mkstemp -> fchmod sequence
in device_runner_base.cpp) so dlopen still works but the temp-DSO is not
accessible to other local users; update any related comments to reflect
owner-only permissions.

---

Outside diff comments:
In `@src/a2a3/platform/sim/host/device_runner.cpp`:
- Around line 262-306: Create the RAII cleanup guard (perf_cleanup) before any
of the init_* calls so that finalize_collectors() is guaranteed on early
returns; move the RAIIScopeGuard construction using finalize_collectors() to
precede init_l2_swimlane(), init_tensor_dump(), init_pmu(), init_dep_gen(), and
init_scope_stats() (or otherwise ensure perf_cleanup is created immediately at
run() start) so any partial initialization is cleaned up if a subsequent init_*
returns non-zero.
- Around line 404-418: The non-zero AICPU return path returns before
joining/stopping collector threads, risking live threads accessing freed
buffers; ensure you call the matching stop() on each collector you started (e.g.
l2_swimlane_collector_.stop(), dump_collector_.stop(), pmu_collector_.stop(),
dep_gen_collector_.stop(), scope_stats_collector_.stop()) in the runtime_rc != 0
error branch (the same fix also apply to the other early-return block around the
479-483 area) before invoking perf_cleanup()/return so all mgmt/poll threads are
joined and collectors are cleanly shut down.
- Around line 64-108: When dlopen or any dlsym (via the load_sym lambda) fails
the code currently returns leaving aicpu_so_handle_, aicpu_so_path_ and related
symbols populated and aicpu_so_loaded_ false, leaking the temp SO and handle;
modify the failure paths so that on any error you dlclose(aicpu_so_handle_) if
non-null, unlink/remove the temp file at aicpu_so_path_ (or call the
corresponding simpler::common::sim_host cleanup), reset aicpu_so_handle_ and
aicpu_so_path_ to empty/null, clear any partially-set function pointers (e.g.
aicpu_execute_func_, set_platform_regs_func_, etc.), and ensure aicpu_so_loaded_
remains false; apply the same cleanup logic for the other similar block
referenced (lines 131-150) so no handle/path leaks occur on partial load
failures.
- Around line 716-736: In finalize_collectors(), after finalizing each collector
(l2_swimlane_collector_, dump_collector_, pmu_collector_, dep_gen_collector_,
scope_stats_collector_), clear the corresponding kernel_args_ base pointers so
they no longer point at freed shared memory; specifically set
kernel_args_.l2_swimlane_data_base, kernel_args_.dump_data_base,
kernel_args_.pmu_data_base, kernel_args_.dep_gen_data_base and
kernel_args_.scope_stats_data_base to 0 (in the same places where you currently
set scope_stats_data_base) to ensure stale addresses are not forwarded on
subsequent run() calls.

In `@src/a5/platform/sim/host/device_runner.cpp`:
- Around line 533-543: stop_collectors() fails to stop scope_stats_collector_,
leaving it running on early returns; update DeviceRunner::stop_collectors() to
check scope_stats_collector_.is_initialized() and call
scope_stats_collector_.stop() so any collector started by init_scope_stats() is
properly torn down (this addresses the early-return behavior in run()).
- Around line 507-520: The teardown frees arenas and finalizes mem_alloc_ before
freeing device_wall_dev_ptr_, which was allocated in run() via
allocate_tensor(); move the free so that free_tensor(device_wall_dev_ptr_) (and
nulling device_wall_dev_ptr_) is called before gm_heap_arena_.release(),
gm_sm_arena_.release(), runtime_arena_pool_.release(), and
mem_alloc_.finalize(); update the cleanup sequence in the DeviceRunner
destructor/cleanup block (the code containing gm_heap_arena_.release(),
mem_alloc_.finalize(), clear_cpu_sim_shared_storage(), and device_wall_dev_ptr_
handling) to free the tensor first to avoid using torn-down allocator state.
- Around line 56-149: ensure_binaries_loaded can return early after
create_temp_so_file/dlopen failures leaving temp paths/handles and function
pointers in a half-initialized state; add rollback cleanup before every return
-1: for the AICPU branch ensure that on any failure after create_temp_so_file()
you remove aicpu_so_path_ (std::remove and clear), dlclose(aicpu_so_handle_) if
non-null and set aicpu_so_handle_ = nullptr, reset any loaded function pointers
(aicpu_execute_func_, set_platform_regs_func_, etc.) and keep aicpu_so_loaded_
== false; when aicpu_so_loaded_ is set true only after all symbols load
successfully. Do the same for the AICore branch: if create_temp_so_file() or
dlopen() or dlsym fails, remove and clear aicore_so_path_, dlclose
aicore_so_handle_ if set and null it, reset aicore_execute_func_ (and any helper
pointers) before returning.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b65b5ee4-92bf-41a6-a86e-609deb7aa17d

📥 Commits

Reviewing files that changed from the base of the PR and between cee40dd and 5ab1585.

📒 Files selected for processing (13)
  • src/a2a3/platform/sim/host/CMakeLists.txt
  • src/a2a3/platform/sim/host/device_runner.cpp
  • src/a2a3/platform/sim/host/device_runner.h
  • src/a2a3/platform/sim/host/pto_runtime_c_api.cpp
  • src/a5/platform/sim/host/CMakeLists.txt
  • src/a5/platform/sim/host/device_runner.cpp
  • src/a5/platform/sim/host/device_runner.h
  • src/a5/platform/sim/host/memory_allocator.cpp
  • src/a5/platform/sim/host/pto_runtime_c_api.cpp
  • src/common/platform/sim/host/c_api_shared.cpp
  • src/common/platform/sim/host/device_runner_base.cpp
  • src/common/platform/sim/host/device_runner_base.h
  • src/common/platform/sim/host/memory_allocator.cpp
💤 Files with no reviewable changes (1)
  • src/a5/platform/sim/host/memory_allocator.cpp

Comment thread src/common/platform/sim/host/c_api_shared.cpp
Comment thread src/common/platform/sim/host/c_api_shared.cpp
Comment thread src/common/platform/sim/host/c_api_shared.cpp
Comment thread src/common/platform/sim/host/device_runner_base.cpp
@ChaoWao ChaoWao merged commit f97a2dd into hw-native-sys:main May 31, 2026
16 checks passed
@ChaoWao ChaoWao deleted the sim-host-common branch May 31, 2026 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants