Skip to content

[Code Health] Add opt-in sanitizer builds (ASAN/UBSan host, TSAN) + sim CI job #904

@ChaoWao

Description

@ChaoWao

Category

Robustness (potential edge-case failure)

Component

Build System

Description

The build system has no first-class sanitizer support (no ASAN/TSAN/UBSan toggle in CMakeLists.txt, the nanobind module, toolchain.py, or CI). For a codebase this concurrency- and lifetime-heavy (host orchestrator threads + ~100 sim AICPU/AICore host threads, custom ring allocators, drain/teardown races), memory-safety and data-race bugs currently surface only as intermittent st-sim-* crashes (rc=-11 SIGSEGV / rc=-6 SIGABRT / rc=124 hang), which are slow and painful to root-cause.

This is not hypothetical — the recently-fixed sim-oversubscription bug family is exactly the class sanitizers target:

A standing sanitizer build + an on-demand CI job would catch these automatically, before they become flaky production crashes.

Location

  • CMakeLists.txt (top-level — no sanitizer option)
  • python/bindings/CMakeLists.txt (host _task_interface / src/common/hierarchical/ — where the fix: serialize drain() teardown against scheduler loop (host UAF) #901 UAF lived; ASAN covers this cleanly)
  • simpler_setup/toolchain.py, simpler_setup/runtime_compiler.py (device runtime build flags — would need threading for device-side coverage)
  • .github/workflows/ci.yml (no sanitizer job)

Proposed Fix

Stage it by value/effort:

  1. ASAN (host) + UBSan, opt-in. Add a SIMPLER_ENABLE_ASAN CMake option (off by default) that appends -fsanitize=address,undefined -fno-omit-frame-pointer -g to the host targets (_task_interface + src/common/hierarchical/ + host runtime). Provide a documented build/run recipe so contributors don't re-hit the pip --no-cache-dir / LD_PRELOAD=libasan setup pain. ASAN+UBSan combine in one build; this is the cheap, high-value first step (covers the fix: serialize drain() teardown against scheduler loop (host UAF) #901 host-UAF class with standard new/delete redzones).
  2. TSAN as a separate build (SIMPLER_ENABLE_TSAN) — ASAN and TSAN can't share one binary. Highest payoff for the scheduler/drain races (the fix: null-guard pending_task in sync_start drain election (sim segfault) #898 class), but needs false-positive triage with the custom sync primitives, so do it second.
  3. Optional asan-sim CI job (nightly or label-triggered, since ASAN is ~2-3x slower) running the L2/L3 examples under LD_PRELOAD=libasan on ubuntu-latest (no hardware needed).

Known limitation to document: host-side coverage is clean (standard allocators), but the device runtime compiled into aicpu_sim/aicore_sim uses custom mmap/DeviceArena/HeapRing allocators that bypass ASAN redzones unless manually poisoned — so device-runtime bugs get weaker coverage than host bugs until the custom allocators are wired to ASAN's manual-poisoning API.

Priority

Medium (minor risk, should fix in next few releases)

Metadata

Metadata

Assignees

No one assigned

    Labels

    code healthTechnical debt, robustness, code quality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions