Skip to content

cDAC GC stress verification tool (GCSTRESS_CDAC=0x20)#125505

Draft
max-charlamb wants to merge 29 commits intodotnet:mainfrom
max-charlamb:cdac-stackreferences-2-with-stress
Draft

cDAC GC stress verification tool (GCSTRESS_CDAC=0x20)#125505
max-charlamb wants to merge 29 commits intodotnet:mainfrom
max-charlamb:cdac-stackreferences-2-with-stress

Conversation

@max-charlamb
Copy link
Member

cDAC GC Stress Verification Tool

Adds an in-process verification mode that compares cDAC stack reference walking against the legacy DAC and runtime at GC stress instruction-level trigger points.

What it does

At each GC stress point (DOTNET_GCStress=0x24), the tool:

  1. Loads the cDAC (mscordaccore_universal) and legacy DAC (mscordaccore) in-process
  2. Collects stack GC references from all three sources (cDAC, legacy DAC, runtime)
  3. Compares cDAC vs DAC (apples-to-apples) and reports mismatches with full ref details

New files

  • cdacgcstress.h/cpp — In-process loading, COM adapters, three-way comparison
  • test-cdac-gcstress.ps1 — Build and test script

Integration points

  • GCSTRESS_CDAC=0x20 flag in eeconfig.h
  • GCStressCdacFailFast / GCStressCdacLogFile config vars
  • Hooks in both DoGcStress functions (gccover.cpp)
  • Init/shutdown in ceemain.cpp
  • cdac_reader_flush_cache API for cache invalidation between stress points

Usage

DOTNET_GCStress=0x24 DOTNET_GCStressCdacLogFile=results.txt corerun test.dll

Depends on

Test results

Verified with 1,700+ stress points across 5 test apps (simple allocations, generics, exception handling, closures, recursion) with 0 failures.

Max Charlamb and others added 29 commits March 11, 2026 13:30
… support

Squash of cdac-stackreferences branch changes onto main:
- Implement stack reference enumeration (EnumerateStackRefs)
- Add GC scanning support (GcScanner, GcScanContext, BitStreamReader)
- Add exception handling for stack walks (ExceptionHandling)
- Add IsFunclet/IsFilterFunclet to execution manager
- Add EH clause retrieval for ReadyToRun
- Add data types: EEILExceptionClause, CorCompileExceptionClause,
  CorCompileExceptionLookupEntry, LastReportedFuncletInfo
- Update datadescriptor.inc with new type layouts
- Update SOSDacImpl with improved stack walk support

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port the native GcInfoDecoder::EnumerateLiveSlots to managed code:
- Add FindSafePoint for partially-interruptible safe point lookup
- Handle partially-interruptible path (1-bit-per-slot and RLE encoded)
- Handle indirect live state table with pointer offset indirection
- Handle fully-interruptible path with chunk-based lifetime transitions
  (couldBeLive bitvectors, final state bits, transition offsets)
- Report untracked slots (always live unless suppressed by flags)
- Add InterruptibleRanges/SlotTable decode points for lazy decoding
- Save safe point and live state bit offsets during body decode
- Add POINTER_SIZE_ENCBASE, LIVESTATE_RLE_*, NUM_NORM_CODE_OFFSETS_*
  constants to IGCInfoTraits (same across all platforms)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix IsFrameless: use StackWalkState.SW_FRAMELESS check
- Fix EnumGcRefs call: pass CodeManagerFlags parameter (was missing)
- Add public access modifier to GetMethodRegionInfo in ExecutionManager_1/2
- Fix redundant equality (== false) in ExecutionManagerCore
- Suppress unused parameter/variable analyzer errors in GcScanner stub

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wire GcScanner to use IGCInfoDecoder.EnumerateLiveSlots
- Add LiveSlotCallback delegate and EnumerateLiveSlots to IGCInfoDecoder
- Add interface implementation in GcInfoDecoder that wraps the generic method
- Translate register slots to values via IPlatformAgnosticContext
- Translate stack slots using SP/FP base + offset addressing
- Add StackBaseRegister accessor to GcInfoDecoder
- Report live slots to GcScanContext.GCEnumCallback with proper flags
- Add GcScanFlags.None value

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add StackReferenceData public data class in Abstractions
- Change IStackWalk.WalkStackReferences to return IReadOnlyList<StackReferenceData>
- Update StackWalk_1.WalkStackReferences to convert and return results
- Add ISOSStackRefEnum, ISOSStackRefErrorEnum COM interfaces with GUIDs
- Add SOSStackRefData, SOSStackRefError structs, SOSStackSourceType enum
- Add SOSStackRefEnum class implementing ISOSStackRefEnum (follows SOSHandleEnum pattern)
- Wire up SOSDacImpl.GetStackReferences: find thread by OS ID, walk stack
  references, convert to SOSStackRefData[], return via COM enumerator
- Remove Console.WriteLine debug output from WalkStackReferences

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add three test classes for stack reference enumeration:
- StackReferenceDumpTests: Basic tests using StackWalk debuggee
  (WalkStackReferences returns without throwing, refs have valid source info)
- GCRootsStackReferenceDumpTests: Tests using GCRoots debuggee which keeps
  objects alive on stack via GC.KeepAlive (finds refs, refs point to valid objects)
- PInvokeFrameStackReferenceDumpTests: Tests using PInvokeStub debuggee which
  has InlinedCallFrame on the stack (non-frameless Frame path)

The PInvokeStub tests exercise the Frame::GcScanRoots path which is not yet
implemented (empty else block in WalkStackReferences).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add native C++ changes needed for the data descriptor entries:
- Add friend cdac_data<ExInfo> to ExceptionFlags for m_flags access
- Add LastReportedFuncletInfo struct and field to ExInfo
- Add cdac_data<PatchpointInfo> specialization for LocalCount
- Use cdac_data<ExInfo>::ExceptionFlagsValue for ExceptionFlags offset

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add ScanFrameRoots method that dispatches based on frame type name.
Most frame types use the base Frame::GcScanRoots_Impl which is a no-op.

Key findings documented in the code:
- GCFrame is NOT part of the Frame chain and the DAC does not scan it
- Stub frames (StubDispatch, External, CallCounting, Dynamic, CLRToCOM)
  call PromoteCallerStack to report method arguments — not yet implemented
- InlinedCallFrame, SoftwareExceptionFrame, etc. use the base no-op

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ation

- Fix GcScanSlotLocation register for stack slots: was hardcoded to 0,
  now correctly maps GC_SP_REL→RSP(4), GC_FRAMEREG_REL→stackBaseRegister
- Update GetStackReferences debug block to use set-based comparison
  (match by Address) instead of index-based, since ref ordering may differ
- Validate Object, SourceType, Source, and Flags for each matched ref

Known issue: Some refs have different computed addresses between cDAC and
legacy DAC due to stack slot address computation differences. Needs further
investigation of SP/FP handling during stack walk context management.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix two bugs in the GCInfoDecoder slot table decoder that caused wrong
slots to be reported as live:

1. When previous slot had non-zero flags, subsequent slots use a FULL
   offset (STACK_SLOT_ENCBASE) not a delta. The managed code incorrectly
   used STACK_SLOT_DELTA_ENCBASE for this case.

2. When previous slot had zero flags, subsequent slots use an unsigned
   delta (DecodeVarLengthUnsigned) with no +1 adjustment. The managed
   code incorrectly used DecodeVarLengthSigned with +1.

Both bugs affected tracked and untracked stack slot sections.

Verified with DOTNET_ENABLE_CDAC=1 and cdb against three debuggee dumps:
all refs now match the legacy DAC exactly (count, Address, Object,
Source, SourceType, Flags, Register, Offset for every ref).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix two bugs found via deep comparison with native GCInfoDecoder:

1. ARM64GCInfoTraits.DenormalizeStackBaseRegister used 0x29 (41 decimal)
   instead of 29 decimal. ARM64's frame pointer is X29, so the native
   XORs with 29. This would produce wrong addresses for all ARM64
   stack-base-relative GC slots.

2. When ExecutionAborted and instruction offset is not in any
   interruptible range, the native code jumps to ExitSuccess (skips
   all reporting). The managed code incorrectly jumped to
   ReportUntracked, which would over-report untracked slots for
   aborted frames.

Also documented the missing scratch register/slot filtering as a
known gap (TODO in ReportSlot). The native ReportSlotToGC checks
IsScratchRegister/IsScratchStackSlot for non-leaf frames; the cDAC
currently reports all slots unconditionally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Match native safe point skip: always skip numSafePoints * numTracked
  bits in the else branch, matching the native behavior. The indirect
  table case (numBitsPerOffset != 0) combined with interruptible ranges
  is unreachable in practice.
- Add TODO for FindSafePoint binary search optimization (perf only,
  no correctness impact).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add scratch register filtering to match native ReportSlotToGC behavior:

- Add IsScratchRegister to IGCInfoTraits with per-platform implementations:
  - AMD64: preserved = rbx, rbp, rsi, rdi, r12-r15 (Windows ABI)
  - ARM64: preserved = x19-x28; scratch = x0-x17, x29-x30
  - ARM: preserved = r4-r11; scratch = r0-r3, r12, r14
  - Interpreter: no scratch registers
- Add scratch filtering in ReportSlot: skip scratch registers for
  non-leaf frames (when ActiveStackFrame is not set)
- Add ReportFPBasedSlotsOnly filtering: skip register slots and
  non-FP-relative stack slots when flag is set
- Add IsScratchStackSlot check: skip SP-relative slots in the
  outgoing/scratch area for non-leaf frames
- Set ActiveStackFrame flag for the first frameless frame in
  WalkStackReferences (matching native GetCodeManagerFlags behavior)

Verified with DOTNET_ENABLE_CDAC=1 against three debuggee dumps:
all refs match the legacy DAC exactly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 5 issues from PR dotnet#125075 review:

1. datadescriptor.inc: Fix EHInfo type annotation from /*uint16*/ to
   /*pointer*/ — phdrJitEHInfo is PTR_EE_ILEXCEPTION, not uint16.

2. StackWalk.md: Update GetMethodDescPtr(IStackDataFrameHandle) docs
   to describe InlinedCallFrame special case for interop MethodDesc
   reporting at SW_SKIPPED_FRAME positions.

3. BitStreamReader: Replace static host-dependent BitsPerSize
   (IntPtr.Size * 8) with instance-based _bitsPerSize
   (target.PointerSize * 8) for correct cross-architecture analysis.

4. SOSDacImpl: Restore GetMethodDescPtrFromFrame implementation that
   was incorrectly stubbed with E_FAIL. Restores the cDAC
   implementation with debug validation against legacy DAC.

5. ReadyToRunJitManager: Fix GetEHClauses clause address computation
   to include entry.ExceptionInfoRva — was computing from imageBase
   directly, missing the RVA offset to the exception info section.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix several bugs in the cDAC's stack reference walking that caused
mismatches against the legacy DAC during GC stress testing:

- Fix GC_CALLER_SP_REL using wrong base address: GcScanner used the
  current context's StackPointer for GC_CALLER_SP_REL slots instead
  of the actual caller SP. Fixed by computing the caller SP via
  clone+unwind, with lazy caching to avoid repeated unwinds.

- Fix IsFirst/ActiveStackFrame tracking: The cDAC used a simple
  isFirstFramelessFrame boolean to determine active frame status.
  Replaced with an IsFirst state machine in StackWalkData matching
  native CrawlFrame::isFirst semantics - starts true, set false
  after frameless frames, restored to true after FRAME_ATTR_RESUMABLE
  frames (ResumableFrame, RedirectedThreadFrame, HijackFrame).

- Fix FaultingExceptionFrame incorrectly treated as resumable:
  FaultingExceptionFrame has FRAME_ATTR_FAULTED but NOT
  FRAME_ATTR_RESUMABLE. Including it in the resumable check caused
  IsFirst=true on the wrong managed frame, producing spurious
  scratch register refs.

- Skip Frames below initial context SP in CreateStackWalk: Matches
  the native DAC behavior where StackWalkFrames with a profiler
  filter context skips Frames at lower SP (pushed more recently).
  Without this, RedirectedThreadFrame from GC stress redirect
  incorrectly set IsFirst=true for non-leaf managed frames.

- Refactor scratch stack slot detection into IsScratchStackSlot on
  platform traits (AMD64, ARM64, ARM), matching the native
  GcInfoDecoder per-platform IsScratchStackSlot pattern.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The initial Frame skip used the leaf's SP as the threshold, which
missed active InlinedCallFrames whose address was above the leaf SP
but below the caller SP. These Frames would be processed as SW_FRAME,
causing UpdateContextFromFrame to restore the IP to the P/Invoke
return address within the same method and producing duplicate GC refs.

Use the caller SP (computed by unwinding the initial managed frame)
as the skip threshold, matching the native CheckForSkippedFrames
which uses EnsureCallerContextIsValid + GetSP(pCallerContext). This
correctly skips all Frames between the managed frame and its caller,
including both RedirectedThreadFrame and active InlinedCallFrames.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Newer fields added to RealCodeHeader (EHInfo), ReadyToRunInfo
(ExceptionInfoSection), and ExceptionInfo (ExceptionFlags,
StackLowBound, StackHighBound, PassNumber, CSFEHClause,
CSFEnclosingClause, CallerOfActualHandlerFrame,
LastReportedFuncletInfo) may not exist in older contract versions.
Guard each with type.Fields.ContainsKey and default to safe values
to prevent KeyNotFoundException when analyzing older dumps.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove unused usings in GcScanContext.cs (Data namespace, StackWalk_1 static)
- Fix trailing semicolon on class closing brace in StackWalk_1.cs
- Discard unused pMethodDesc assignment in StackWalk_1.cs
- Add buffer length validation in SOSStackRefEnum.Next to prevent IndexOutOfRangeException
- Use Debug.ValidateHResult in GetMethodDescPtrFromFrame to match codebase pattern

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove unused 'using Microsoft.Diagnostics.DataContractReader.Contracts.Extensions' from StackWalk_1.cs
- Remove unused 'using System.Linq' and 'using System' from StackReferenceDumpTests.cs
- Remove unused 'using System' from StackRefData.cs and GcScanSlotLocation.cs
- Clear ppEnum.Interface on failure paths in SOSDacImpl.GetStackReferences

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore the full GetMethodDescPtr(IStackDataFrameHandle) documentation
  in StackWalk.md that describes the ReportInteropMD special case. The
  docs were incorrectly simplified but the implementation was unchanged.
- Use specific friend declaration in patchpointinfo.h instead of generic
  template friend, matching the codebase convention.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The m_lastReportedFunclet field was added to ExInfo but is never
written by the runtime, making it always zero-initialized. The cDAC
code that reads it can never trigger. Remove the field from ExInfo,
the data descriptor entry, and the managed LastReportedFuncletInfo
data class. Mark the Filter code path as explicitly unreachable
with a TODO for when runtime support is added.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an in-process cDAC verification mode that runs at GC stress
instruction-level trigger points. At each stress point, the tool:
1. Loads the cDAC (mscordaccore_universal) and legacy DAC in-process
2. Collects stack GC references from cDAC, legacy DAC, and runtime
3. Compares all three and reports mismatches

New files:
- cdacgcstress.h/cpp: In-process cDAC/DAC loading, three-way
  comparison framework with detailed mismatch logging
- test-cdac-gcstress.ps1: Build and test script

Integration:
- GCSTRESS_CDAC=0x20 flag in eeconfig.h
- GCStressCdacFailFast/GCStressCdacLogFile config vars
- Hooks in both DoGcStress functions in gccover.cpp
- Init/shutdown in ceemain.cpp
- cdac_reader_flush_cache API for cache invalidation

Usage: DOTNET_GCStress=0x24 (instruction JIT + cDAC verification)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 12, 2026 17:11
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants