CI: make the TSAN nightly cell gate (scoped light subset + report races)#947
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR refactors the nightly sanitizer workflow to establish both ASAN and TSAN as gating builds, restructure test execution via sanitizer-specific variables (restricting TSAN to serial prepared_callable runs while expanding ASAN to include dynamic_register with controlled parallelism), and updates documentation to match the new scoping and gating policy. ChangesSanitizer nightly sweep refactoring
Possibly related PRs
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the documentation in docs/ci.md to clarify the configuration and execution of the nightly ASAN and TSAN sanitizer simulation jobs, detailing their respective parallelism limits, test scopes, and gating behaviors. The review feedback focuses on improving the clarity, readability, and technical writing style of the updated documentation, suggesting minor phrasing adjustments such as clarifying 'Both gate' and correcting colloquial expressions.
Scope the nightly TSAN cell to the light prepared_callable L2 subset, run serially, and report races without aborting or failing the job. ASAN keeps the broader scope. Drops continue-on-error so both cells gate on hang/crash. TSAN_OPTIONS=halt_on_error=0:exitcode=0 — halt_on_error=0 alone still lets TSAN's default exitcode=66 redden the cell on any reported race; exitcode=0 keeps races in the log without failing (triaging them into a suppressions file is a follow-up). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a9e73a7 to
6f46423
Compare
Summary
Turns the TSAN nightly cell from
continue-on-error(best-effort, was hanging) into a real gating job, after investigating why it hung — rather than filing it as a follow-up.Finding (docker, Linux): TSAN does not deterministically deadlock. A single light L2 test (
prepared_callable::test_run) passes in ~6.5s and surfaces 66 ThreadSanitizer warnings. The nightly hang was oversubscription — TSAN's ~5-15× slowdown throttles the chip-fork L3 cases' internal AICPU/AICore/host threads into a livelock that--max-parallel(cross-case only) can't help — andhalt_on_error=1aborted on the first reported race.Fix:
prepared_callableL2 tests (still surface races),--max-parallel 1,TSAN_OPTIONS=halt_on_error=0(report races without aborting). Validated green: 2 passed, 66 warnings, no hang.prepared_callable+dynamic_register),--max-parallel 2.continue-on-error— both sanitizers now gate (on hang/crash).TSAN_OPTIONS=… \env prefix andpytest, breaking the\continuation sopytestwould have run withoutLD_PRELOAD. Restructured to an if/else with exported*_OPTIONSandLD_PRELOADon the pytest line.Follow-up (noted, not this PR): triage the 66 reported races into a TSAN suppressions file, then flip
halt_on_error=1so the job gates on new races.Testing
prepared_callablelight subset green (2 passed, 66 warnings); ASAN broad subset greenbash -non the run step clean; pre-commit green