Skip to content

A2a3/pref bug#1

Open
ChaoZheng109 wants to merge 2 commits intoindigo1973:mainfrom
ChaoZheng109:a2a3/pref_bug
Open

A2a3/pref bug#1
ChaoZheng109 wants to merge 2 commits intoindigo1973:mainfrom
ChaoZheng109:a2a3/pref_bug

Conversation

@ChaoZheng109
Copy link
Copy Markdown

No description provided.

…layer (hw-native-sys#416)

Wrap the 6-step runtime C API (set_device, get_runtime_size, init_runtime,
enable_profiling, launch_runtime, finalize_runtime) into a single C++ class
with 3 methods: init(), run(), reset(). Exposed via nanobind as _ChipWorker
with a Python ChipWorker wrapper integrating RuntimeBuilder.

Key changes:
- New src/common/worker/chip_worker.{h,cpp}: dlopen/dlsym-based ChipWorker
- Fix profiling ordering: enable_profiling called AFTER init_runtime
  (placement new was overwriting the flag)
- code_runner.py uses ChipWorker instead of bindings.py ctypes calls
- kernel_compiler.py simplified: removed bindings.py dependency
- bindings.py deleted (ctypes layer fully replaced)
- CI detect-changes: inverted to exclusion-based A5 detection

Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>
@ChaoZheng109 ChaoZheng109 force-pushed the a2a3/pref_bug branch 6 times, most recently from cf3963e to be73eb9 Compare March 31, 2026 14:10
…cords

Buffer recycling:
- Replace alloc-per-swap with closed-loop buffer recycling in
  ProfMemoryManager. Completed buffers go to recycled pools instead of
  being freed, and process_ready_entry replenishes free_queues from
  recycled pool → done_queue drain → alloc (last resort).
- Pre-allocate PLATFORM_PROF_BUFFERS_PER_CORE / _PER_THREAD buffers at
  init, seeding 1 into each free_queue and the rest into recycled pools.
- Reduce PLATFORM_PROF_SLOT_COUNT from 8 to 4 (recycling makes deep
  slot rings unnecessary).
- Add proactive replenishment scan in mgmt_loop as safety net for
  depleted cores/threads.
- Free recycled buffers in ProfMemoryManager::stop().

Mgmt thread device context:
- Add PerfSetDeviceCallback (dependency-injected like existing alloc/
  register/free callbacks) so the mgmt thread can call rtSetDevice once
  at startup. Without this, rtMalloc fails on the mgmt thread because
  CANN device context is per-thread.
- Onboard device_runner passes rtSetDevice wrapper; sim passes nullptr.

Implicit task record collection:
- Process ready buffers during the expected_tasks wait phase to prevent
  device memory buildup.
- Add execution_complete signal so poll_and_collect exits promptly after
  stream synchronization instead of relying solely on record counts.
- Add scan_remaining_perf_buffers() to recover partial records from
  active buffers after device execution completes.

Housekeeping:
- Add copyright headers to platform_config.h, runtime.h, aicpu_executor.cpp
- Normalize include guard names to match file paths
- Replace C-style casts with reinterpret_cast in aicpu_executor.cpp
- Reduce RUNTIME_MAX_FANOUT from 512 to 128
- Fix formatting (alignment, line length) across touched files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants