fix(profiler): make _class_map access signal-safe in walkVM#512
Draft
fix(profiler): make _class_map access signal-safe in walkVM#512
Conversation
walkVM ran the inserting Dictionary::lookup from a signal handler, calling malloc/calloc — async-signal-unsafe. Recording::writeClasses also walked _class_map without holding _class_map_lock, racing the exclusive clear() that follows in Profiler::dump. Switch the vtable-target site to bounded_lookup(..., 0) and skip the synthetic frame on miss; hold _class_map_lock shared while writeClasses recurses into Dictionary::collect; document the signal-safety hazard on the inserting overloads; add a concurrent test for the lock discipline. Fixes PROF-14549. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f3e2ccb to
7880e7c
Compare
Collaborator
Author
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7880e7ce7f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex flagged a remaining race after the initial fix: walkVM's read-only bounded_lookup still traverses row->next and reads row->keys[j] without holding _class_map_lock, while Profiler::dump's exclusive _class_map.clear() runs in the unlockAll -> lock window and free()s those same nodes/strings. Take _class_map_lock shared via OptionalSharedLockGuard (try-lock so the signal handler never blocks). When an exclusive clear() is in progress, drop the synthetic vtable-target frame instead of dereferencing freed memory. Also add the missing <cstring> include in dictionary_concurrent_ut.cpp; strlen() previously worked only via a transitive include through gtest_crash_handler.h. Addresses review comments on PR #512. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
CI Test ResultsRun: #25491321377 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-05-07 11:35:04 UTC |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?:
Fixes a SIGSEGV in
Recording::writeClasses(Dictionary::collectwalking a tornDictTable*chain) by closing two correctness holes aroundProfiler::_class_map.walkVMathotspotSupport.cpp:388previously called the inserting 2-argDictionary::lookupfrom a signal handler, which callsmalloc/calloc(dictionary.cpp:25,114). That is async-signal-unsafe and could tear the malloc arena. Switched to the existing read-onlybounded_lookup(..., 0); on miss (sentinelINT_MAX) the synthetic vtable-target frame is skipped — the JVMTI / Java-API path will populate the entry from a non-signal context.Recording::writeClassespreviously called_classes->collect(map)with no lock while JVMTI shared-lock writers (Profiler::lookupClass,ObjectSampler,LivenessTracker) could mutate the dictionary, and the dump path's exclusive_class_map.clear()ran shortly after. Now wrapped inSharedLockGuard guard(Profiler::instance()->classMapLock()). Multiple shared-lock holders coexist with CAS-based inserters; the exclusiveclear()waits untilwriteClassesreturns.Profiler::classMapLock()accessor next to the existingclassMap()accessor.Dictionary::lookupoverloads.writeCpoolandwriteClassesthat previously claimed concurrent access was already safe.Motivation:
PROF-14549 reports a SIGSEGV in
Dictionary::collectreached viaRecording::writeClassesduringProfiler::dump. Investigation traced the crash to:walkVMcalling insertinglookupfrom a signal handler — malloc-arena tearing under signal preemption.writeClassesreading theDictionarychain unsynchronised against the only writer that bypassed_class_map_lock(the samewalkVMsite) and against the dump-pathclear().The
vtable_targetfeature is on by default (arguments.h:217initialises the bit-field with_features{1, 1, 1, 1, 1, 1}), so the buggy site is reachable on every CPU/wall/native-alloc sample whose top frame is a vtable stub.Additional Notes:
BCI_ALLOCreuse athotspotSupport.cpp:388is semantic overloading (frame marker, not allocation event). It was downported from upstream async-profiler PR #1537 and is preserved as-is here — only the signal-safety of the lookup is changed.src/stackWalker.cpp:399-405(verified against commit5aee9cdb). Upstream'sdictionary.cpphas nobounded_lookuphelper; the Datadog-local helper added during the OTEL hot-path safety work makes this fix possible without modifyingDictionaryitself. Filing the equivalent fix upstream is a follow-up.Profiler::lookupClass. Eliminating the SIGSEGV outweighs the cosmetic loss.unlockAll()→_class_map_lock.lock()window inprofiler.cpp:1371-1375is left alone: oncewalkVMno longer writes, the only writers are lock-respecting JVMTI/Java-API paths.How to test the change?:
ddprof-lib/src/test/cpp/dictionary_concurrent_ut.cppcovers:bounded_lookup(..., 0)on a miss returnsINT_MAXand does not insert (verified via subsequentcollect()returning empty).bounded_lookup(..., 0)on a hit returns the same id that the insertinglookupproduced.Dictionarythrough an externalSpinLock— asserts no crash, well-formed snapshots, and progress on every role. The test is the regression gate for the lock discipline introduced here; it fails (data race / SIGSEGV under TSan/ASan) without the lock and passes with it.JavaProfiler.dump, sustained for several minutes, should no longer SIGSEGV.For Datadog employees:
@DataDog/security-design-and-guidance.Unsure? Have a question? Request a review!
🤖 Generated with Claude Code