Open
Conversation
b8d37f1 to
0920e26
Compare
…layer (hw-native-sys#416) Wrap the 6-step runtime C API (set_device, get_runtime_size, init_runtime, enable_profiling, launch_runtime, finalize_runtime) into a single C++ class with 3 methods: init(), run(), reset(). Exposed via nanobind as _ChipWorker with a Python ChipWorker wrapper integrating RuntimeBuilder. Key changes: - New src/common/worker/chip_worker.{h,cpp}: dlopen/dlsym-based ChipWorker - Fix profiling ordering: enable_profiling called AFTER init_runtime (placement new was overwriting the flag) - code_runner.py uses ChipWorker instead of bindings.py ctypes calls - kernel_compiler.py simplified: removed bindings.py dependency - bindings.py deleted (ctypes layer fully replaced) - CI detect-changes: inverted to exclusion-based A5 detection Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>
cf3963e to
be73eb9
Compare
…cords Buffer recycling: - Replace alloc-per-swap with closed-loop buffer recycling in ProfMemoryManager. Completed buffers go to recycled pools instead of being freed, and process_ready_entry replenishes free_queues from recycled pool → done_queue drain → alloc (last resort). - Pre-allocate PLATFORM_PROF_BUFFERS_PER_CORE / _PER_THREAD buffers at init, seeding 1 into each free_queue and the rest into recycled pools. - Reduce PLATFORM_PROF_SLOT_COUNT from 8 to 4 (recycling makes deep slot rings unnecessary). - Add proactive replenishment scan in mgmt_loop as safety net for depleted cores/threads. - Free recycled buffers in ProfMemoryManager::stop(). Mgmt thread device context: - Add PerfSetDeviceCallback (dependency-injected like existing alloc/ register/free callbacks) so the mgmt thread can call rtSetDevice once at startup. Without this, rtMalloc fails on the mgmt thread because CANN device context is per-thread. - Onboard device_runner passes rtSetDevice wrapper; sim passes nullptr. Implicit task record collection: - Process ready buffers during the expected_tasks wait phase to prevent device memory buildup. - Add execution_complete signal so poll_and_collect exits promptly after stream synchronization instead of relying solely on record counts. - Add scan_remaining_perf_buffers() to recover partial records from active buffers after device execution completes. Housekeeping: - Add copyright headers to platform_config.h, runtime.h, aicpu_executor.cpp - Normalize include guard names to match file paths - Replace C-style casts with reinterpret_cast in aicpu_executor.cpp - Reduce RUNTIME_MAX_FANOUT from 512 to 128 - Fix formatting (alignment, line length) across touched files
be73eb9 to
728dd97
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.