Skip to content

fix(profiler): make _class_map access signal-safe in walkVM#512

Draft
jbachorik wants to merge 2 commits intomainfrom
muse/sigsegv-in-recording-writeclasses
Draft

fix(profiler): make _class_map access signal-safe in walkVM#512
jbachorik wants to merge 2 commits intomainfrom
muse/sigsegv-in-recording-writeclasses

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented May 7, 2026

What does this PR do?:

Fixes a SIGSEGV in Recording::writeClasses (Dictionary::collect walking a torn DictTable* chain) by closing two correctness holes around Profiler::_class_map.

  • walkVM at hotspotSupport.cpp:388 previously called the inserting 2-arg Dictionary::lookup from a signal handler, which calls malloc/calloc (dictionary.cpp:25,114). That is async-signal-unsafe and could tear the malloc arena. Switched to the existing read-only bounded_lookup(..., 0); on miss (sentinel INT_MAX) the synthetic vtable-target frame is skipped — the JVMTI / Java-API path will populate the entry from a non-signal context.
  • Recording::writeClasses previously called _classes->collect(map) with no lock while JVMTI shared-lock writers (Profiler::lookupClass, ObjectSampler, LivenessTracker) could mutate the dictionary, and the dump path's exclusive _class_map.clear() ran shortly after. Now wrapped in SharedLockGuard guard(Profiler::instance()->classMapLock()). Multiple shared-lock holders coexist with CAS-based inserters; the exclusive clear() waits until writeClasses returns.
  • Added a public Profiler::classMapLock() accessor next to the existing classMap() accessor.
  • Documented the signal-safety hazard on the inserting Dictionary::lookup overloads.
  • Corrected the misleading comments in writeCpool and writeClasses that previously claimed concurrent access was already safe.

Motivation:

PROF-14549 reports a SIGSEGV in Dictionary::collect reached via Recording::writeClasses during Profiler::dump. Investigation traced the crash to:

  1. walkVM calling inserting lookup from a signal handler — malloc-arena tearing under signal preemption.
  2. writeClasses reading the Dictionary chain unsynchronised against the only writer that bypassed _class_map_lock (the same walkVM site) and against the dump-path clear().

The vtable_target feature is on by default (arguments.h:217 initialises the bit-field with _features{1, 1, 1, 1, 1, 1}), so the buggy site is reachable on every CPU/wall/native-alloc sample whose top frame is a vtable stub.

Additional Notes:

  • The BCI_ALLOC reuse at hotspotSupport.cpp:388 is semantic overloading (frame marker, not allocation event). It was downported from upstream async-profiler PR #1537 and is preserved as-is here — only the signal-safety of the lookup is changed.
  • The same hazard exists in upstream src/stackWalker.cpp:399-405 (verified against commit 5aee9cdb). Upstream's dictionary.cpp has no bounded_lookup helper; the Datadog-local helper added during the OTEL hot-path safety work makes this fix possible without modifying Dictionary itself. Filing the equivalent fix upstream is a follow-up.
  • Trade-off: first-occurrence vtable-target annotations for cold classes are temporarily lost until the JVMTI / Java-API path inserts the class via Profiler::lookupClass. Eliminating the SIGSEGV outweighs the cosmetic loss.
  • The dump-side unlockAll()_class_map_lock.lock() window in profiler.cpp:1371-1375 is left alone: once walkVM no longer writes, the only writers are lock-respecting JVMTI/Java-API paths.

How to test the change?:

  • New unit/stress test at ddprof-lib/src/test/cpp/dictionary_concurrent_ut.cpp covers:
    1. bounded_lookup(..., 0) on a miss returns INT_MAX and does not insert (verified via subsequent collect() returning empty).
    2. bounded_lookup(..., 0) on a hit returns the same id that the inserting lookup produced.
    3. ~500ms concurrent run with 4 shared-lock inserter threads, one shared-lock collector thread, and one exclusive-lock clear thread driving the same Dictionary through an external SpinLock — asserts no crash, well-formed snapshots, and progress on every role. The test is the regression gate for the lock discipline introduced here; it fails (data race / SIGSEGV under TSan/ASan) without the lock and passes with it.
  • Existing CPU / wall / native-alloc / JVMTI alloc / liveness profile tests continue to pass — there is no behavioral change on the lookup hit path.
  • Manual repro recommendation for reviewers: a vtable-heavy workload (e.g. ConcurrentHashMap-heavy benchmarks) with allocation profiling enabled and periodic JavaProfiler.dump, sustained for several minutes, should no longer SIGSEGV.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: PROF-14549

Unsure? Have a question? Request a review!

🤖 Generated with Claude Code

@jbachorik jbachorik added the AI label May 7, 2026
@jbachorik jbachorik requested a review from Copilot May 7, 2026 10:12
@jbachorik jbachorik changed the title fix(profiler): make _class_map access signal-safe in walkVM (PROF-14549) fix(profiler): make _class_map access signal-safe in walkVM May 7, 2026

This comment was marked as outdated.

walkVM ran the inserting Dictionary::lookup from a signal handler, calling
malloc/calloc — async-signal-unsafe. Recording::writeClasses also walked
_class_map without holding _class_map_lock, racing the exclusive clear()
that follows in Profiler::dump.

Switch the vtable-target site to bounded_lookup(..., 0) and skip the
synthetic frame on miss; hold _class_map_lock shared while writeClasses
recurses into Dictionary::collect; document the signal-safety hazard on
the inserting overloads; add a concurrent test for the lock discipline.

Fixes PROF-14549.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the muse/sigsegv-in-recording-writeclasses branch from f3e2ccb to 7880e7c Compare May 7, 2026 10:25
@jbachorik jbachorik requested a review from Copilot May 7, 2026 10:28
@jbachorik
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7880e7ce7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddprof-lib/src/main/cpp/hotspot/hotspotSupport.cpp Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread ddprof-lib/src/test/cpp/dictionary_concurrent_ut.cpp
Codex flagged a remaining race after the initial fix: walkVM's read-only
bounded_lookup still traverses row->next and reads row->keys[j] without
holding _class_map_lock, while Profiler::dump's exclusive _class_map.clear()
runs in the unlockAll -> lock window and free()s those same nodes/strings.

Take _class_map_lock shared via OptionalSharedLockGuard (try-lock so the
signal handler never blocks). When an exclusive clear() is in progress,
drop the synthetic vtable-target frame instead of dereferencing freed
memory.

Also add the missing <cstring> include in dictionary_concurrent_ut.cpp;
strlen() previously worked only via a transitive include through
gtest_crash_handler.h.

Addresses review comments on PR #512.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 7, 2026

CI Test Results

Run: #25491321377 | Commit: 54dd2c0 | Duration: 28m 52s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-05-07 11:35:04 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants