Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 26 additions & 4 deletions .github/workflows/sanitizers.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,11 @@ jobs:
sanitizer-sim:
runs-on: ubuntu-latest
timeout-minutes: 90
# ASAN gates the nightly; TSAN runs informationally. TSAN's ~5-15x slowdown
# (vs ASAN's ~1.7x) makes the sim's threaded scheduler livelock on
# oversubscription-heavy cases, so its run reliability is still being worked
# out — the build is validated, the run is best-effort for now.
continue-on-error: ${{ matrix.sanitizer == 'tsan' }}
strategy:
fail-fast: false
matrix:
Expand All @@ -39,9 +44,13 @@ jobs:
run: |
sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install -y ninja-build graphviz
sudo apt-get install -y g++-15 || sudo apt-get install -y g++
# build-essential provides the unversioned gcc/g++ that
# _ensure_host_compilers checks for; gcc-15/g++-15 are the actual
# compilers the sanitizer build unifies on (GxxToolchain prefer_g15).
sudo apt-get install -y ninja-build graphviz build-essential
sudo apt-get install -y gcc-15 g++-15 || sudo apt-get install -y gcc g++
if ! command -v g++-15; then sudo ln -s "$(which g++)" /usr/local/bin/g++-15; fi
if ! command -v gcc-15; then sudo ln -s "$(which gcc)" /usr/local/bin/gcc-15; fi

- name: Set up Python
uses: actions/setup-python@v6
Expand All @@ -59,10 +68,23 @@ jobs:
run: |
# Sim unifies host compilation on g++-15, so preload g++-15's runtime.
LIB=$(g++-15 -print-file-name=lib${{ matrix.sanitizer }}.so)
ARCH=$(echo "${{ matrix.platform }}" | sed 's/sim$//')
# Scope to the core register / run / dlopen / kernel-compile /
# orchestration paths, cap parallelism, and skip the parallel-broadcast
# case: ASAN/TSAN slow the sim enough that oversubscription-heavy cases
# livelock on a 4-vCPU runner (docs/troubleshooting/sim-oversubscription-hang.md).
TARGETS="tests/st/$ARCH/tensormap_and_ringbuffer/prepared_callable"
if [ -d "tests/st/$ARCH/tensormap_and_ringbuffer/dynamic_register" ]; then
TARGETS="$TARGETS tests/st/$ARCH/tensormap_and_ringbuffer/dynamic_register"
fi
LD_PRELOAD="$LIB" \
ASAN_OPTIONS=detect_leaks=0:abort_on_error=1:halt_on_error=1 \
UBSAN_OPTIONS=halt_on_error=1:print_stacktrace=1 \
TSAN_OPTIONS=halt_on_error=1 \
pytest examples tests/st --platform ${{ matrix.platform }} --device 0-15 \
--sanitizer ${{ matrix.sanitizer }} -v --pto-session-timeout 1200 \
# Exclude dlopen_count tests: they assert exact dlopen accounting,
# which ASAN/TSAN perturb by interposing dlopen (orthogonal to the
# memory/race checks the sanitizers are here for).
pytest $TARGETS --platform ${{ matrix.platform }} --device 0-7 --max-parallel 2 \
-k "not parallel_broadcast and not dlopen_count" \
--sanitizer ${{ matrix.sanitizer }} -v --pto-session-timeout 600 \
--pto-isa-commit ${{ env.PTO_ISA_COMMIT }} --clone-protocol https --require-pto-isa
15 changes: 12 additions & 3 deletions docs/ci.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,18 @@ A **separate** workflow, [`sanitizers.yml`](../.github/workflows/sanitizers.yml)
runs on a nightly `schedule` — kept out of `ci.yml` so the cron fires only the
sanitizer jobs, never the PR/self-hosted pipeline. Its
`sanitizer-sim` job builds the sim runtime + kernels with ASAN or TSAN
(`pip install --config-settings=cmake.define.SIMPLER_SANITIZER=...`) and runs
`pytest examples tests/st` under the matching `LD_PRELOAD` (a2a3sim/a5sim,
ubuntu-only). Not a PR gate; see [testing.md](testing.md#sanitizer-builds-asan--tsan).
(`pip install --config-settings=cmake.define.SIMPLER_SANITIZER=...`) and runs a
Comment thread
ChaoWao marked this conversation as resolved.
**scoped** subset under the matching `LD_PRELOAD` — the `tensormap_and_ringbuffer`
`prepared_callable` path (plus `dynamic_register` where it exists; a5 has only
the former), `--max-parallel 2`, with `-k "not parallel_broadcast and not
dlopen_count"` (the `dlopen_count` tests assert exact dlopen accounting that the
sanitizers perturb by interposing `dlopen`) (a2a3sim/a5sim, ubuntu-only). The full suite is avoided
because ASAN/TSAN slow the sim enough that oversubscription-heavy spmd stress
Comment thread
coderabbitai[bot] marked this conversation as resolved.
cases livelock on a 4-vCPU runner. **ASAN gates the job; TSAN runs
`continue-on-error`** — its ~5-15x slowdown (vs ASAN's ~1.7x) still livelocks the
threaded scheduler, so the TSAN build is validated but its run is best-effort
pending further work. Not a PR gate; see
[testing.md](testing.md#sanitizer-builds-asan--tsan).

### Parallel ST runs on hardware

Expand Down
14 changes: 9 additions & 5 deletions simpler_setup/runtime_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,11 +203,15 @@ def _ensure_host_compilers(self):

@staticmethod
def _find_executable(name: str) -> bool:
"""Check if an executable exists (either as absolute path or in PATH)."""
if os.path.isfile(name) and os.access(name, os.X_OK):
return True
result = subprocess.run(["which", name], check=False, capture_output=True, timeout=1)
return result.returncode == 0
"""Whether ``name`` resolves to an executable (absolute path or on PATH).

``shutil.which`` is used (in-process) rather than spawning ``which``:
under a sanitizer the test process runs with ``LD_PRELOAD=lib{a,t}san.so``
and the preloaded runtime can abort an uninstrumented ``which`` child,
which would otherwise make this falsely report the compiler missing.
``shutil.which`` already handles abs/relative paths and the X_OK check.
"""
return shutil.which(name) is not None

def compile(
self,
Expand Down
Loading