cDAC GC stress verification tool (GCSTRESS_CDAC=0x20)#125505
Draft
max-charlamb wants to merge 29 commits intodotnet:mainfrom
Draft
cDAC GC stress verification tool (GCSTRESS_CDAC=0x20)#125505max-charlamb wants to merge 29 commits intodotnet:mainfrom
max-charlamb wants to merge 29 commits intodotnet:mainfrom
Conversation
… support Squash of cdac-stackreferences branch changes onto main: - Implement stack reference enumeration (EnumerateStackRefs) - Add GC scanning support (GcScanner, GcScanContext, BitStreamReader) - Add exception handling for stack walks (ExceptionHandling) - Add IsFunclet/IsFilterFunclet to execution manager - Add EH clause retrieval for ReadyToRun - Add data types: EEILExceptionClause, CorCompileExceptionClause, CorCompileExceptionLookupEntry, LastReportedFuncletInfo - Update datadescriptor.inc with new type layouts - Update SOSDacImpl with improved stack walk support Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port the native GcInfoDecoder::EnumerateLiveSlots to managed code: - Add FindSafePoint for partially-interruptible safe point lookup - Handle partially-interruptible path (1-bit-per-slot and RLE encoded) - Handle indirect live state table with pointer offset indirection - Handle fully-interruptible path with chunk-based lifetime transitions (couldBeLive bitvectors, final state bits, transition offsets) - Report untracked slots (always live unless suppressed by flags) - Add InterruptibleRanges/SlotTable decode points for lazy decoding - Save safe point and live state bit offsets during body decode - Add POINTER_SIZE_ENCBASE, LIVESTATE_RLE_*, NUM_NORM_CODE_OFFSETS_* constants to IGCInfoTraits (same across all platforms) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix IsFrameless: use StackWalkState.SW_FRAMELESS check - Fix EnumGcRefs call: pass CodeManagerFlags parameter (was missing) - Add public access modifier to GetMethodRegionInfo in ExecutionManager_1/2 - Fix redundant equality (== false) in ExecutionManagerCore - Suppress unused parameter/variable analyzer errors in GcScanner stub Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wire GcScanner to use IGCInfoDecoder.EnumerateLiveSlots - Add LiveSlotCallback delegate and EnumerateLiveSlots to IGCInfoDecoder - Add interface implementation in GcInfoDecoder that wraps the generic method - Translate register slots to values via IPlatformAgnosticContext - Translate stack slots using SP/FP base + offset addressing - Add StackBaseRegister accessor to GcInfoDecoder - Report live slots to GcScanContext.GCEnumCallback with proper flags - Add GcScanFlags.None value Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add StackReferenceData public data class in Abstractions - Change IStackWalk.WalkStackReferences to return IReadOnlyList<StackReferenceData> - Update StackWalk_1.WalkStackReferences to convert and return results - Add ISOSStackRefEnum, ISOSStackRefErrorEnum COM interfaces with GUIDs - Add SOSStackRefData, SOSStackRefError structs, SOSStackSourceType enum - Add SOSStackRefEnum class implementing ISOSStackRefEnum (follows SOSHandleEnum pattern) - Wire up SOSDacImpl.GetStackReferences: find thread by OS ID, walk stack references, convert to SOSStackRefData[], return via COM enumerator - Remove Console.WriteLine debug output from WalkStackReferences Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add three test classes for stack reference enumeration: - StackReferenceDumpTests: Basic tests using StackWalk debuggee (WalkStackReferences returns without throwing, refs have valid source info) - GCRootsStackReferenceDumpTests: Tests using GCRoots debuggee which keeps objects alive on stack via GC.KeepAlive (finds refs, refs point to valid objects) - PInvokeFrameStackReferenceDumpTests: Tests using PInvokeStub debuggee which has InlinedCallFrame on the stack (non-frameless Frame path) The PInvokeStub tests exercise the Frame::GcScanRoots path which is not yet implemented (empty else block in WalkStackReferences). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add native C++ changes needed for the data descriptor entries: - Add friend cdac_data<ExInfo> to ExceptionFlags for m_flags access - Add LastReportedFuncletInfo struct and field to ExInfo - Add cdac_data<PatchpointInfo> specialization for LocalCount - Use cdac_data<ExInfo>::ExceptionFlagsValue for ExceptionFlags offset Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add ScanFrameRoots method that dispatches based on frame type name. Most frame types use the base Frame::GcScanRoots_Impl which is a no-op. Key findings documented in the code: - GCFrame is NOT part of the Frame chain and the DAC does not scan it - Stub frames (StubDispatch, External, CallCounting, Dynamic, CLRToCOM) call PromoteCallerStack to report method arguments — not yet implemented - InlinedCallFrame, SoftwareExceptionFrame, etc. use the base no-op Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ation - Fix GcScanSlotLocation register for stack slots: was hardcoded to 0, now correctly maps GC_SP_REL→RSP(4), GC_FRAMEREG_REL→stackBaseRegister - Update GetStackReferences debug block to use set-based comparison (match by Address) instead of index-based, since ref ordering may differ - Validate Object, SourceType, Source, and Flags for each matched ref Known issue: Some refs have different computed addresses between cDAC and legacy DAC due to stack slot address computation differences. Needs further investigation of SP/FP handling during stack walk context management. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix two bugs in the GCInfoDecoder slot table decoder that caused wrong slots to be reported as live: 1. When previous slot had non-zero flags, subsequent slots use a FULL offset (STACK_SLOT_ENCBASE) not a delta. The managed code incorrectly used STACK_SLOT_DELTA_ENCBASE for this case. 2. When previous slot had zero flags, subsequent slots use an unsigned delta (DecodeVarLengthUnsigned) with no +1 adjustment. The managed code incorrectly used DecodeVarLengthSigned with +1. Both bugs affected tracked and untracked stack slot sections. Verified with DOTNET_ENABLE_CDAC=1 and cdb against three debuggee dumps: all refs now match the legacy DAC exactly (count, Address, Object, Source, SourceType, Flags, Register, Offset for every ref). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix two bugs found via deep comparison with native GCInfoDecoder: 1. ARM64GCInfoTraits.DenormalizeStackBaseRegister used 0x29 (41 decimal) instead of 29 decimal. ARM64's frame pointer is X29, so the native XORs with 29. This would produce wrong addresses for all ARM64 stack-base-relative GC slots. 2. When ExecutionAborted and instruction offset is not in any interruptible range, the native code jumps to ExitSuccess (skips all reporting). The managed code incorrectly jumped to ReportUntracked, which would over-report untracked slots for aborted frames. Also documented the missing scratch register/slot filtering as a known gap (TODO in ReportSlot). The native ReportSlotToGC checks IsScratchRegister/IsScratchStackSlot for non-leaf frames; the cDAC currently reports all slots unconditionally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Match native safe point skip: always skip numSafePoints * numTracked bits in the else branch, matching the native behavior. The indirect table case (numBitsPerOffset != 0) combined with interruptible ranges is unreachable in practice. - Add TODO for FindSafePoint binary search optimization (perf only, no correctness impact). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add scratch register filtering to match native ReportSlotToGC behavior: - Add IsScratchRegister to IGCInfoTraits with per-platform implementations: - AMD64: preserved = rbx, rbp, rsi, rdi, r12-r15 (Windows ABI) - ARM64: preserved = x19-x28; scratch = x0-x17, x29-x30 - ARM: preserved = r4-r11; scratch = r0-r3, r12, r14 - Interpreter: no scratch registers - Add scratch filtering in ReportSlot: skip scratch registers for non-leaf frames (when ActiveStackFrame is not set) - Add ReportFPBasedSlotsOnly filtering: skip register slots and non-FP-relative stack slots when flag is set - Add IsScratchStackSlot check: skip SP-relative slots in the outgoing/scratch area for non-leaf frames - Set ActiveStackFrame flag for the first frameless frame in WalkStackReferences (matching native GetCodeManagerFlags behavior) Verified with DOTNET_ENABLE_CDAC=1 against three debuggee dumps: all refs match the legacy DAC exactly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 5 issues from PR dotnet#125075 review: 1. datadescriptor.inc: Fix EHInfo type annotation from /*uint16*/ to /*pointer*/ — phdrJitEHInfo is PTR_EE_ILEXCEPTION, not uint16. 2. StackWalk.md: Update GetMethodDescPtr(IStackDataFrameHandle) docs to describe InlinedCallFrame special case for interop MethodDesc reporting at SW_SKIPPED_FRAME positions. 3. BitStreamReader: Replace static host-dependent BitsPerSize (IntPtr.Size * 8) with instance-based _bitsPerSize (target.PointerSize * 8) for correct cross-architecture analysis. 4. SOSDacImpl: Restore GetMethodDescPtrFromFrame implementation that was incorrectly stubbed with E_FAIL. Restores the cDAC implementation with debug validation against legacy DAC. 5. ReadyToRunJitManager: Fix GetEHClauses clause address computation to include entry.ExceptionInfoRva — was computing from imageBase directly, missing the RVA offset to the exception info section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix several bugs in the cDAC's stack reference walking that caused mismatches against the legacy DAC during GC stress testing: - Fix GC_CALLER_SP_REL using wrong base address: GcScanner used the current context's StackPointer for GC_CALLER_SP_REL slots instead of the actual caller SP. Fixed by computing the caller SP via clone+unwind, with lazy caching to avoid repeated unwinds. - Fix IsFirst/ActiveStackFrame tracking: The cDAC used a simple isFirstFramelessFrame boolean to determine active frame status. Replaced with an IsFirst state machine in StackWalkData matching native CrawlFrame::isFirst semantics - starts true, set false after frameless frames, restored to true after FRAME_ATTR_RESUMABLE frames (ResumableFrame, RedirectedThreadFrame, HijackFrame). - Fix FaultingExceptionFrame incorrectly treated as resumable: FaultingExceptionFrame has FRAME_ATTR_FAULTED but NOT FRAME_ATTR_RESUMABLE. Including it in the resumable check caused IsFirst=true on the wrong managed frame, producing spurious scratch register refs. - Skip Frames below initial context SP in CreateStackWalk: Matches the native DAC behavior where StackWalkFrames with a profiler filter context skips Frames at lower SP (pushed more recently). Without this, RedirectedThreadFrame from GC stress redirect incorrectly set IsFirst=true for non-leaf managed frames. - Refactor scratch stack slot detection into IsScratchStackSlot on platform traits (AMD64, ARM64, ARM), matching the native GcInfoDecoder per-platform IsScratchStackSlot pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The initial Frame skip used the leaf's SP as the threshold, which missed active InlinedCallFrames whose address was above the leaf SP but below the caller SP. These Frames would be processed as SW_FRAME, causing UpdateContextFromFrame to restore the IP to the P/Invoke return address within the same method and producing duplicate GC refs. Use the caller SP (computed by unwinding the initial managed frame) as the skip threshold, matching the native CheckForSkippedFrames which uses EnsureCallerContextIsValid + GetSP(pCallerContext). This correctly skips all Frames between the managed frame and its caller, including both RedirectedThreadFrame and active InlinedCallFrames. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Newer fields added to RealCodeHeader (EHInfo), ReadyToRunInfo (ExceptionInfoSection), and ExceptionInfo (ExceptionFlags, StackLowBound, StackHighBound, PassNumber, CSFEHClause, CSFEnclosingClause, CallerOfActualHandlerFrame, LastReportedFuncletInfo) may not exist in older contract versions. Guard each with type.Fields.ContainsKey and default to safe values to prevent KeyNotFoundException when analyzing older dumps. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove unused usings in GcScanContext.cs (Data namespace, StackWalk_1 static) - Fix trailing semicolon on class closing brace in StackWalk_1.cs - Discard unused pMethodDesc assignment in StackWalk_1.cs - Add buffer length validation in SOSStackRefEnum.Next to prevent IndexOutOfRangeException - Use Debug.ValidateHResult in GetMethodDescPtrFromFrame to match codebase pattern Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove unused 'using Microsoft.Diagnostics.DataContractReader.Contracts.Extensions' from StackWalk_1.cs - Remove unused 'using System.Linq' and 'using System' from StackReferenceDumpTests.cs - Remove unused 'using System' from StackRefData.cs and GcScanSlotLocation.cs - Clear ppEnum.Interface on failure paths in SOSDacImpl.GetStackReferences Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore the full GetMethodDescPtr(IStackDataFrameHandle) documentation in StackWalk.md that describes the ReportInteropMD special case. The docs were incorrectly simplified but the implementation was unchanged. - Use specific friend declaration in patchpointinfo.h instead of generic template friend, matching the codebase convention. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The m_lastReportedFunclet field was added to ExInfo but is never written by the runtime, making it always zero-initialized. The cDAC code that reads it can never trigger. Remove the field from ExInfo, the data descriptor entry, and the managed LastReportedFuncletInfo data class. Mark the Filter code path as explicitly unreachable with a TODO for when runtime support is added. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an in-process cDAC verification mode that runs at GC stress instruction-level trigger points. At each stress point, the tool: 1. Loads the cDAC (mscordaccore_universal) and legacy DAC in-process 2. Collects stack GC references from cDAC, legacy DAC, and runtime 3. Compares all three and reports mismatches New files: - cdacgcstress.h/cpp: In-process cDAC/DAC loading, three-way comparison framework with detailed mismatch logging - test-cdac-gcstress.ps1: Build and test script Integration: - GCSTRESS_CDAC=0x20 flag in eeconfig.h - GCStressCdacFailFast/GCStressCdacLogFile config vars - Hooks in both DoGcStress functions in gccover.cpp - Init/shutdown in ceemain.cpp - cdac_reader_flush_cache API for cache invalidation Usage: DOTNET_GCStress=0x24 (instruction JIT + cDAC verification) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cDAC GC Stress Verification Tool
Adds an in-process verification mode that compares cDAC stack reference walking against the legacy DAC and runtime at GC stress instruction-level trigger points.
What it does
At each GC stress point (
DOTNET_GCStress=0x24), the tool:New files
cdacgcstress.h/cpp— In-process loading, COM adapters, three-way comparisontest-cdac-gcstress.ps1— Build and test scriptIntegration points
GCSTRESS_CDAC=0x20flag ineeconfig.hGCStressCdacFailFast/GCStressCdacLogFileconfig varsDoGcStressfunctions (gccover.cpp)ceemain.cppcdac_reader_flush_cacheAPI for cache invalidation between stress pointsUsage
Depends on
Test results
Verified with 1,700+ stress points across 5 test apps (simple allocations, generics, exception handling, closures, recursion) with 0 failures.