CI: make the nightly sanitizer sweep pass (ASAN gate, TSAN informational) by ChaoWao · Pull Request #931 · hw-native-sys/simpler

ChaoWao · 2026-05-31T04:08:59Z

Summary

The first nightly Sanitizers run (after #915) failed 4/4 cells. This fixes the workflow so ASAN passes reliably and the build/run plumbing actually works.

Root causes & fixes

gcc not found (TSAN cells errored at setup). Two layers:
- sanitizers.yml installed only g++-15, so the unversioned gcc/g++ that _ensure_host_compilers checks were missing on the ubuntu-24 runner → install build-essential + gcc-15 g++-15 + symlink fallbacks.
- runtime_compiler._find_executable spawned which, which aborts under LD_PRELOAD=libtsan.so (the preloaded runtime kills the uninstrumented child), falsely reporting gcc missing → switched to in-process shutil.which. (This was the real reason TSAN cells died instantly at setup.)
ASAN full-suite TIMEOUT @ 1200s. The full examples tests/st livelocks under ASAN — oversubscription-heavy spmd stress cases on a 4-vCPU runner (documented in sim-oversubscription-hang.md). → scope the run to tensormap_and_ringbuffer dynamic_register + prepared_callable, --max-parallel 2, excluding parallel_broadcast and the dlopen_count tests (those assert exact dlopen accounting that the sanitizers perturb).
TSAN runs hang even scoped (its ~5-15x slowdown vs ASAN's ~1.7x livelocks the threaded scheduler). → continue-on-error on tsan cells: build validated, run best-effort; ASAN gates the job.

Validated on Linux (docker)

ASAN scoped run green (prepared_callable + dynamic_register, dlopen_count/parallel_broadcast excluded)
TSAN builds; the _find_executable fix lets it get past setup (run is informational)
_find_executable → shutil.which keeps test_sanitizers + unit suite green; pre-commit clean

Testing

Local docker validation (above)
Next nightly confirms the ASAN cells go green (TSAN informational)

The first nightly Sanitizers run failed 4/4. Fixes: - sanitizers.yml: install build-essential (the unversioned gcc/g++ that _ensure_host_compilers checks) + gcc-15/g++-15 + symlink fallbacks. The install previously pulled g++-15 only, so plain gcc was missing on the ubuntu-24 runner. - runtime_compiler._find_executable: use shutil.which instead of spawning `which`. Under a sanitizer the test process runs with LD_PRELOAD=lib{a,t}san.so, and the preloaded runtime aborts the uninstrumented `which` child, which made the check falsely report gcc missing (this is why TSAN cells errored at setup). - sanitizers.yml: scope the run to the tensormap_and_ringbuffer dynamic_register + prepared_callable paths, --max-parallel 2, excluding the parallel-broadcast and dlopen_count cases. The full suite livelocked under ASAN/TSAN (oversubscription-heavy spmd stress on a 4-vCPU runner), and the dlopen_count tests assert exact dlopen accounting that the sanitizers perturb. Validated on Linux (docker): ASAN scoped run green; TSAN builds + the gcc-check fix lets it run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

TSAN's ~5-15x slowdown livelocks the sim's threaded scheduler on oversubscription-heavy cases even in the scoped subset, so mark the tsan matrix cells continue-on-error: the build is validated and the run is best-effort while its reliability under that slowdown is worked out. ASAN (scoped run validated green on Linux) gates the job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-31T04:09:06Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d2f7d306-c751-444e-be79-2311309b092a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR hardens nightly sanitizer testing by treating TSAN as best-effort while keeping ASAN as the gating path, refining the test suite to architecture-specific targets with lower parallelism and tighter resource constraints, improving compiler detection to handle gcc-15 symlinks, and switching compiler discovery from subprocess-based which to shutil.which to avoid process aborts under sanitizer runtimes.

Changes

Sanitizer Workflow & Compiler Discovery

Layer / File(s)	Summary
TSAN/ASAN Error Handling & Gating `.github/workflows/sanitizers.yml`	TSAN matrix job is configured as best-effort with `continue-on-error`, while ASAN remains failure-gating for the workflow.
Compiler Toolchain Setup with GCC-15 Symlinks `.github/workflows/sanitizers.yml`	Compiler setup step installs `build-essential`, ensures `gcc-15`/`g++-15` are available with fallback, and adds symlinks to guarantee unversioned `gcc-15`/`g++-15` commands resolve correctly.
Constrained Sanitized Test Execution `.github/workflows/sanitizers.yml`	Test execution is refocused to specific `tensormap_and_ringbuffer` TARGETS with optional `dynamic_register`, preloads sanitizer runtime, limits devices to 0-7, caps parallelism to 2, excludes problematic tests, and reduces session timeout from 1200 to 600.
Documentation of Sanitizer Sweep `docs/ci.md`	CI documentation is expanded with specific details of sanitizer-sim runtime paths, `LD_PRELOAD` gating, test exclusions, parallelism limits, and ASAN/TSAN behavioral differences.
Compiler Discovery using shutil.which `simpler_setup/runtime_compiler.py`	`_find_executable` replaces manual file checks and subprocess-spawned `which` with `shutil.which(name)` to avoid sanitizer-instrumented process aborts under uninstrumented child processes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A workflow refined with grace,
TSAN now best-effort's pace,
Compiler symlinks true and clear,
Test paths narrowed, focused here,
Shutil.which avoids the crash—
Sanitizers tamed in a dash!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: making the nightly sanitizer sweep pass with ASAN as the gate and TSAN as informational, which is the primary objective of this PR.
Description check	✅ Passed	The description provides a detailed explanation of root causes, fixes, and validation steps directly related to the changeset, covering all modified files and their purposes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the CI documentation to describe the scoped subset of tests run under ASAN/TSAN sanitizers to prevent livelocks, and replaces the external which subprocess call in _find_executable with shutil.which to avoid issues with preloaded sanitizer runtimes. The review feedback suggests simplifying _find_executable by removing redundant file and access checks since shutil.which handles them, and correcting the CMake option name in the documentation from SIMPLER_SANITIZER to SIMPLER_SANITIZERS.

shutil.which already handles absolute/relative paths and the X_OK check, so the os.path.isfile + os.access pre-check is redundant. Drop it.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/ci.md`:
- Around line 57-60: Update the docs so the pytest filter and target-selection
text matches the workflow: state that the scoped subset selects tests under
LD_PRELOAD including tensormap_and_ringbuffer and prepared_callable, that
dynamic_register is included only when present (not always), and explicitly
mention that dlopen_count is excluded; also note the use of --max-parallel 2 and
the exclusion of parallel-broadcast cases (a2a3sim/a5sim, ubuntu-only) to match
the actual configuration.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4839d35f-5359-4ab1-ab75-b65ff126272e

📥 Commits

Reviewing files that changed from the base of the PR and between 39c0445 and 386dd54.

📒 Files selected for processing (3)

.github/workflows/sanitizers.yml
docs/ci.md
simpler_setup/runtime_compiler.py

…in ci.md

ChaoWao and others added 2 commits May 31, 2026 11:56

gemini-code-assist Bot reviewed May 31, 2026

View reviewed changes

Comment thread simpler_setup/runtime_compiler.py Outdated

Comment thread docs/ci.md

fix(pr): simplify _find_executable per review

386dd54

shutil.which already handles absolute/relative paths and the X_OK check, so the os.path.isfile + os.access pre-check is redundant. Drop it.

coderabbitai Bot reviewed May 31, 2026

View reviewed changes

Comment thread docs/ci.md

fix(pr): note dlopen_count exclusion + dynamic_register-when-present …

3c0683a

…in ci.md

ChaoWao merged commit 85a3b63 into hw-native-sys:main May 31, 2026
16 of 29 checks passed

ChaoWao deleted the fix/sanitizers-nightly branch May 31, 2026 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: make the nightly sanitizer sweep pass (ASAN gate, TSAN informational)#931

CI: make the nightly sanitizer sweep pass (ASAN gate, TSAN informational)#931
ChaoWao merged 4 commits into
hw-native-sys:mainfrom
ChaoWao:fix/sanitizers-nightly

ChaoWao commented May 31, 2026

Uh oh!

coderabbitai Bot commented May 31, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented May 31, 2026

Summary

Root causes & fixes

Validated on Linux (docker)

Testing

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 31, 2026 •

edited

Loading