You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The build system has no first-class sanitizer support (no ASAN/TSAN/UBSan toggle in CMakeLists.txt, the nanobind module, toolchain.py, or CI). For a codebase this concurrency- and lifetime-heavy (host orchestrator threads + ~100 sim AICPU/AICore host threads, custom ring allocators, drain/teardown races), memory-safety and data-race bugs currently surface only as intermittent st-sim-* crashes (rc=-11 SIGSEGV / rc=-6 SIGABRT / rc=124 hang), which are slow and painful to root-cause.
This is not hypothetical — the recently-fixed sim-oversubscription bug family is exactly the class sanitizers target:
fix: serialize drain() teardown against scheduler loop (host UAF) #901 (host hierarchical drain()use-after-free) was effectively unfindable from a stripped core dump; it was only located by hand-injecting -fsanitize=address into a throwaway build (which then hit pip wheel-cache traps). ASAN named the free/alloc/read sites instantly.
simpler_setup/toolchain.py, simpler_setup/runtime_compiler.py (device runtime build flags — would need threading for device-side coverage)
.github/workflows/ci.yml (no sanitizer job)
Proposed Fix
Stage it by value/effort:
ASAN (host) + UBSan, opt-in. Add a SIMPLER_ENABLE_ASAN CMake option (off by default) that appends -fsanitize=address,undefined -fno-omit-frame-pointer -g to the host targets (_task_interface + src/common/hierarchical/ + host runtime). Provide a documented build/run recipe so contributors don't re-hit the pip --no-cache-dir / LD_PRELOAD=libasan setup pain. ASAN+UBSan combine in one build; this is the cheap, high-value first step (covers the fix: serialize drain() teardown against scheduler loop (host UAF) #901 host-UAF class with standard new/delete redzones).
Optional asan-sim CI job (nightly or label-triggered, since ASAN is ~2-3x slower) running the L2/L3 examples under LD_PRELOAD=libasan on ubuntu-latest (no hardware needed).
Known limitation to document: host-side coverage is clean (standard allocators), but the device runtime compiled into aicpu_sim/aicore_sim uses custom mmap/DeviceArena/HeapRing allocators that bypass ASAN redzones unless manually poisoned — so device-runtime bugs get weaker coverage than host bugs until the custom allocators are wired to ASAN's manual-poisoning API.
Priority
Medium (minor risk, should fix in next few releases)
Category
Robustness (potential edge-case failure)
Component
Build System
Description
The build system has no first-class sanitizer support (no ASAN/TSAN/UBSan toggle in
CMakeLists.txt, the nanobind module,toolchain.py, or CI). For a codebase this concurrency- and lifetime-heavy (host orchestrator threads + ~100 sim AICPU/AICore host threads, custom ring allocators, drain/teardown races), memory-safety and data-race bugs currently surface only as intermittentst-sim-*crashes (rc=-11SIGSEGV /rc=-6SIGABRT /rc=124hang), which are slow and painful to root-cause.This is not hypothetical — the recently-fixed sim-oversubscription bug family is exactly the class sanitizers target:
drain()use-after-free) was effectively unfindable from a stripped core dump; it was only located by hand-injecting-fsanitize=addressinto a throwaway build (which then hit pip wheel-cache traps). ASAN named the free/alloc/read sites instantly.pending_taskpointer + a null-deref — exactly what TSAN (race) and UBSan (null deref) flag directly.A standing sanitizer build + an on-demand CI job would catch these automatically, before they become flaky production crashes.
Location
CMakeLists.txt(top-level — no sanitizer option)python/bindings/CMakeLists.txt(host_task_interface/src/common/hierarchical/— where the fix: serialize drain() teardown against scheduler loop (host UAF) #901 UAF lived; ASAN covers this cleanly)simpler_setup/toolchain.py,simpler_setup/runtime_compiler.py(device runtime build flags — would need threading for device-side coverage).github/workflows/ci.yml(no sanitizer job)Proposed Fix
Stage it by value/effort:
SIMPLER_ENABLE_ASANCMake option (off by default) that appends-fsanitize=address,undefined -fno-omit-frame-pointer -gto the host targets (_task_interface+src/common/hierarchical/+ host runtime). Provide a documented build/run recipe so contributors don't re-hit thepip --no-cache-dir/LD_PRELOAD=libasansetup pain. ASAN+UBSan combine in one build; this is the cheap, high-value first step (covers the fix: serialize drain() teardown against scheduler loop (host UAF) #901 host-UAF class with standardnew/deleteredzones).SIMPLER_ENABLE_TSAN) — ASAN and TSAN can't share one binary. Highest payoff for the scheduler/drain races (the fix: null-guard pending_task in sync_start drain election (sim segfault) #898 class), but needs false-positive triage with the custom sync primitives, so do it second.asan-simCI job (nightly or label-triggered, since ASAN is ~2-3x slower) running the L2/L3 examples underLD_PRELOAD=libasanonubuntu-latest(no hardware needed).Known limitation to document: host-side coverage is clean (standard allocators), but the device runtime compiled into
aicpu_sim/aicore_simuses custommmap/DeviceArena/HeapRingallocators that bypass ASAN redzones unless manually poisoned — so device-runtime bugs get weaker coverage than host bugs until the custom allocators are wired to ASAN's manual-poisoning API.Priority
Medium (minor risk, should fix in next few releases)