Skip to content

Add sanitizer instrumentation to CI#3

Open
leboshi wants to merge 5 commits into
umbrellio:masterfrom
magellan-ai:ci/sanitizers
Open

Add sanitizer instrumentation to CI#3
leboshi wants to merge 5 commits into
umbrellio:masterfrom
magellan-ai:ci/sanitizers

Conversation

@leboshi
Copy link
Copy Markdown

@leboshi leboshi commented May 8, 2026

Overview

The C++ extension lives at the Ruby/clickhouse-cpp boundary. Bugs at that boundary — heap use-after-free, buffer overruns, undefined pointer arithmetic, races between Ruby threads sharing native state — usually don't fail tests in obvious ways. Instead, they tend to corrupt memory silently and surface much later as production segfaults, mysterious connection drops, or "the worker pool got confused" reports that nobody can reproduce locally. These sanitizers in CI are intended to catch that class of bug deterministically during a normal test run.

Additions to CI configuration

  • A GC.stress = true job runs the full suite with Ruby's GC firing on every allocation, meant to catch bugs like unrooted VALUEs held across allocations in the C extension.
  • AddressSanitizer (ASan) tracks heap allocations and memory access to catch use-after-free, double-free, heap/stack buffer overruns, and stack overflow bugs.
  • UndefinedBehaviorSanitizer (UBSan) flags operations that are technically undefined behavior in C/C++, even when they appear to "work". The C standard permits compilers to do anything they want with UB, including future-version optimizations that'll silently break the code.
  • ThreadSanitizer (TSan) detects data races and lock-order inversions across threads. Since this gem's Pool is meant to genuinely scale across N threads (the GVL gets released around blocking C++ calls), races at the boundary are the highest-impact class of bug we can ship.

ASan + UBSan share a build and run as a per-PR merge gate. TSan is mutually exclusive with ASan and runs as a nightly cron job; its overhead is high enough (5-15x runtime) that gating PRs on it would hurt iteration speed, but the data-race detection is worth running consistently.

Impact

Two real findings have already been caught and fixed by this setup in vendored library code that would otherwise have shipped to production. Both are technically undefined behavior, both are benign on every real implementation today, and both are exactly the class of bug a future compiler optimization could turn into silent corruption. The patches are small, mechanical guards against NULL + 0-style expressions; full diffs in ext/clickhouse_native/patches/.

The sanitizers also flagged that Clients weren't always being closed after specs, which was harmless but generated noise in ASan. The test suite now sweeps the ObjectSpace after every spec, closing open Clients. A new spec covers the GC/dfree path the auto-sweep cleanup bypasses.

Next steps

I have another commit containing documentation on this PR – why I used clang's libsanitizer instead of gcc's, how to read and handle CI failures, notes on subtle library interactions that affect how CI interprets the response from this ASan+UBSan step – as well as an audit of outstanding concerns and missing features, and an in-depth explanation of the specific "gotchas" that exist at the Ruby/C++ boundary and how to mitigate them. The set of documentation felt distinct enough from the scope of this PR that I'm planning to PR it separately after this, but if you'd prefer that I just add that commit to this PR, I can do that instead.

leboshi added 5 commits May 8, 2026 01:30
In practice, libc will short-circuit on n == 0, but this trivial UB is technically a standards violation.
Another case that real usage *should* never run into but is nevertheless concretely wrong upstream.
…d add a spec for the corner case that the sweep may hide.

The new FD test is skipped under GC.stress. Stress reclaims each Client between iterations, so there's never an observable peak; the test premise (FDs grow by ~20) doesn't hold.
Copy link
Copy Markdown
Member

@tycooon tycooon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a high-quality, well-justified change. The upstream UB fixes (memcpy on empty string_view, lz4's NULL + 0) are real and worth shipping regardless of whether the sanitizer jobs stay. The CI plumbing is sound. Findings below, ordered by impact.


1. Patch 0003 widens -Werror exemption for every build, not just sanitizer builds — ext/clickhouse_native/patches/0003-no-error-on-nonnull-warning.patch

Adding -Wno-error=nonnull to clickhouse-cpp's CMAKE_CXX_FLAGS happens unconditionally — extconf.rb and the Rakefile both apply this patch on every build, so test, macos-compile, gc-stress, and tsan all silently get the relaxed flag too, not just the sanitize job. The PR's framing is "sanitizer instrumentation," but this is a permanent compiler-strictness loosening of vendored code on the trunk.

If the gcc-11 nonnull warning is real and load-bearing, applying the patch unconditionally is fine — just call that out explicitly in the patch comment header (it's empty right now). If it's only needed when sanitizers are active, scope it via a CMake guard rather than a patch file. As-is, the change is hidden inside a PR labeled "CI sanitizers."

2. FD-count assertion in the GC finalization spec is fragile — spec/clickhouse_native_spec.rb

20.times { ClickhouseNative::Client.new(**CH_KWARGS) }
expect(fd_count.call - baseline).to be >= 20

Each Client allocated inside the block is immediately unreferenced. Ruby's GC can opportunistically reclaim them mid-loop under memory pressure (and definitely will under GC.stress, but the spec skips that). The >= 20 premise relies on no GC firing during the loop — not guaranteed, just usually true. Pinning references first removes the timing dependency:

clients = []
20.times { clients << ClickhouseNative::Client.new(**CH_KWARGS) }
expect(fd_count.call - baseline).to be >= 20
clients.clear
GC.start(full_mark: true, immediate_sweep: true)
GC.start(full_mark: true, immediate_sweep: true)
expect(fd_count.call).to be <= baseline

This also makes the "witness that sockets opened" assertion mean what the comment says.

3. Hardcoded x86_64 in LD_PRELOAD — .github/workflows/ci.yml

$(clang -print-file-name=libclang_rt.asan-x86_64.so):$(clang -print-file-name=libclang_rt.ubsan_standalone-x86_64.so)

ubuntu-24.04 is currently x86_64, but if anyone ever switches to ubuntu-24.04-arm (increasingly common on GHA) this silently breaks: clang -print-file-name returns the basename verbatim when the file isn't found, so LD_PRELOAD ends up with a non-existent path and the sanitizer runtime never loads — the suite would pass under a no-op preload. Same concern for tsan.yml's gcc -print-file-name=libtsan.so (less risky since gcc resolves a non-arch-specific name).

Two defenses worth adding:

ASAN_LIB=$(clang -print-file-name=libclang_rt.asan-x86_64.so)
UBSAN_LIB=$(clang -print-file-name=libclang_rt.ubsan_standalone-x86_64.so)
[ -f "$ASAN_LIB" ] && [ -f "$UBSAN_LIB" ] || { echo "sanitizer runtime missing: $ASAN_LIB $UBSAN_LIB"; exit 1; }
export LD_PRELOAD="$ASAN_LIB:$UBSAN_LIB"

This fails fast both on arch drift and on apt's clang package not pulling libclang-rt-*-dev as a strict dep.

4. TSan nightly is fire-and-forget — .github/workflows/tsan.yml

continue-on-error: true + cron schedule means a failing run shows green at the workflow level and produces no notification. Someone has to remember to check the Actions tab. Fine during soft-launch triage, but worth a TODO to either (a) drop continue-on-error once stable, or (b) add an if: failure() step that opens an issue / posts to a webhook. Otherwise the nightly will silently rot.

5. TSan suppression is broad — .tsan-suppressions

deadlock:rb_native_mutex_lock matches anywhere that function appears in a deadlock report, including hypothetical lock-order issues introduced by our extension's interaction with Ruby's mutex. The extension doesn't call rb_native_mutex_lock directly (verified — no GVL-internal calls in client.cpp), so the practical risk is low, but the suppression's text leaves room for false negatives. The comment is good; consider tightening to a more specific pattern if upstream Ruby gives one (e.g., scoped to thread_pthread.c).

6. Auto-sweep config.after is broader than needed but safe — spec/spec_helper.rb

Walks all ClickhouseNative::Client instances in ObjectSpace after every example, including those owned by Pools that the example still holds via subject memoization. Since RSpec re-evaluates subject(:pool) per example, this is fine in the current spec layout. If anyone later adds a before(:all) Pool (an anti-pattern but possible), the sweep would close its clients between examples and the suite would break in confusing ways. A short comment in the helper warning against that pattern, or restricting the sweep to non-pooled clients (e.g., a marker ivar set by Client.new and cleared by Pool#with), would future-proof it. Low priority.

7. Sanitizer + GC.stress jobs only test Ruby 4.0 — .github/workflows/ci.yml

Existing test matrix runs 3.3 / 3.4 / 4.0. A Ruby-ABI-specific bug at the C-ext boundary (e.g. a stack-VALUE rule changed in 4.0) would slip past. Probably an intentional cost trade-off, but worth a comment in the matrix saying so — otherwise a future contributor will helpfully "add 3.3 too" without thinking about CI minutes.

8. Minor: ASAN_OPTIONS=detect_leaks=0 is correct but disables real leak detection too

LSan false-positives from Ruby's conservative GC are the documented reason, and disabling them is standard practice. Just noting that a forgotten delete in our own C++ code won't surface via this CI — use-after-free and overflow paths still do, which is the higher-impact class. No action needed.

9. Minor: CMake honors CFLAGS/CXXFLAGS env vars only on initial configure

A locally-cached tmp/cpp-build-<arch>/CMakeCache.txt will pin flags from whatever configure first ran. Doesn't matter in CI (ephemeral runner), but a dev who runs rake compile with one set of flags then changes them needs to rm -rf tmp/cpp-build-*. The README's troubleshooting section (or a comment in extconf.rb) might mention this if it bites later. Not a PR blocker.


Things this PR gets right that are worth calling out

  • ASan abort_on_error=1 + UBSan halt_on_error=1: failing-fast is the right call for CI.
  • Idempotent patch application (patch -R --dry-run) — both extconf.rb and the Rakefile cleanly handle re-runs and submodule resets.
  • unique_ptr::reset() for Client#close correctly makes the double-close-from-sweep a no-op.
  • GC.stress block placed before any require — correct; catches issues during extension load too.
  • Spec for the dfree path uses FD count as the observable rather than ObjectSpace, which avoids the "passes on broken dfree" trap (good comment, too).
  • The lz4 patch preserves dictEnd == dictionary semantics in the empty-dict case rather than just silencing the warning.

Suggested before-merge actions

In rough priority order:

  1. Either scope or document patch 0003's permanence (#1 above).
  2. Pin Client refs in the FD-count spec (#2 above).
  3. Add an existence check for the sanitizer runtime files (#3 above).
  4. Add a TODO comment about TSan nightly notification, or remove continue-on-error (#4 above).

The rest are nits / future-proofing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants