Skip to content

fix: free line number tables while class pins held#510

Open
jbachorik wants to merge 3 commits intomainfrom
muse/sigsegv-in-recording
Open

fix: free line number tables while class pins held#510
jbachorik wants to merge 3 commits intomainfrom
muse/sigsegv-in-recording

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented May 6, 2026

What does this PR do?:
Fixes SIGSEGV in Recording::cleanupUnreferencedMethods by moving the cleanup call inside finishChunk(), before GetLoadedClasses pins are released via DeleteLocalRef.

Motivation:
PROF-14545: Production crash (SIGSEGV) in Recording::cleanupUnreferencedMethods, first seen in dd-trace-java 1.56.1 after PR #327 introduced method-map cleanup.

Root cause: cleanupUnreferencedMethods() was called after finishChunk() returned, at which point the local references from GetLoadedClasses had already been released. On JVM implementations that free JVMTI-allocated line number table memory when a class is unloaded, the jvmti->Deallocate(_ptr) call inside SharedLineNumberTable::~SharedLineNumberTable() would access freed memory → SIGSEGV.

finishChunk() calls GetLoadedClasses at its start, which pins all loaded classes in memory for the duration. By moving cleanup to run before DeleteLocalRef releases those pins, we guarantee that Deallocate is called while the corresponding class is still loaded.

Additional Notes:
The fix is a 2-line change: finishChunk() gains a do_cleanup=false optional parameter, and switchChunk() passes do_cleanup=true instead of calling cleanupUnreferencedMethods() separately.

How to test the change?:
Added CleanupAfterClassUnloadTest which reproduces the crash scenario:

  1. A dynamically-loaded class with line number tables appears in CPU profiling stack traces (profiler calls GetLineNumberTable and stores the table in method_map)
  2. All references to the class and its class loader are dropped; System.gc() is called until the class is confirmed unloaded via WeakReference
  3. AGE_THRESHOLD+1 dump cycles age the method out of method_map and trigger cleanup
  4. The profiler must survive without SIGSEGV — verified by asserting it produced JFR output

Without the fix, this test will crash on JVM implementations that free JVMTI-allocated memory on class unload. With the fix, it reliably passes.

For Datadog employees:

  • This PR doesn't touch any of that.
  • JIRA: PROF-14545

jbachorik and others added 2 commits May 6, 2026 18:31
cleanupUnreferencedMethods() was called after finishChunk() released
GetLoadedClasses pins. If a JVM freed JVMTI-allocated line number table
memory on class unload, the subsequent jvmti->Deallocate in
SharedLineNumberTable::~SharedLineNumberTable() would hit freed memory.

Move the cleanup call into finishChunk() before releasing class pins so
Deallocate always runs while the corresponding class is still loaded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nup)

Exercise the crash scenario: dynamic class gets profiled (line number
table stored in method_map), class loader is dropped and GC'd, then
AGE_THRESHOLD+1 dump cycles age the method out and trigger cleanup.
Profiler must survive. Catches the bug on JVMs that free JVMTI-
allocated memory on class unload.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik added the AI label May 6, 2026
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 6, 2026

CI Test Results

Run: #25502881382 | Commit: 32538ed | Duration: 33m 29s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-05-07 15:21:32 UTC

@jbachorik jbachorik changed the title fix: free line number tables while class pins held (PROF-14545) fix: free line number tables while class pins held May 6, 2026
@jbachorik jbachorik requested a review from Copilot May 6, 2026 17:18
@jbachorik
Copy link
Copy Markdown
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

💡 Codex Review

long deadline = System.currentTimeMillis() + 8_000;
while (loaderRef.get() != null && System.currentTimeMillis() < deadline) {
System.gc();
Thread.sleep(30);
}

P2 Badge Assert the class actually unloads

In runs where this loader is not cleared before the 8-second deadline, the loop silently falls through and the test continues with the dynamic class still loaded. That means the later dump cycles exercise cleanup while the class is still pinned/live, so the regression can pass without reproducing the class-unload scenario it is meant to guard; please assert or use an assumption that loaderRef.get() == null before aging out the method.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@jbachorik
Copy link
Copy Markdown
Collaborator Author

Addressed — added assumeTrue(loaderRef.get() == null, …) immediately after the GC poll loop so the test skips rather than silently continuing when class unloading does not occur within the 8 s window. Using assumeTrue (vs assertTrue) because the CI matrix spans JVMs (J9, Graal, musl variants) where class unloading is non-deterministic — a hard failure would be noise, not signal.

- CleanupAfterClassUnloadTest.java:98 — assumeTrue after GC poll so
  test is skipped (not silently passed) on JVMs where class unloading
  is non-deterministic (J9, Graal, musl variants)

Co-Authored-By: muse <muse@noreply>
@jbachorik jbachorik marked this pull request as ready for review May 7, 2026 15:37
@jbachorik jbachorik requested a review from a team as a code owner May 7, 2026 15:37
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8f17cf30f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +120 to +123
} finally {
try { Files.deleteIfExists(baseFile); } catch (IOException ignored) {}
try { Files.deleteIfExists(dumpFile); } catch (IOException ignored) {}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop the manually restarted profiler on abort

When this test aborts at the class-unload assumption or throws anywhere after the manual profiler.execute(...), the finally block only deletes files and leaves the profiler running. Because stopProfiler() was called before the manual start, AbstractProfilerTest.stopped remains true, so the @AfterEach cleanup also skips profiler.stop(). This can leave an active recording that makes subsequent tests fail or observe the wrong profiler state; the manual start should be paired with a stop in this finally path.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant