Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR adds a new JIT/EE interface to expose platform-specific TLS access details for the thread-local allocation context and uses that information to keep GT_ALLOCOBJ through morphing so xarch codegen can emit an inline bump-pointer allocation fast path (with a helper-call slow path).
Changes:
- Introduces
getObjectAllocContextInfo/CORINFO_OBJECT_ALLOC_CONTEXT_INFOacross the JIT/EE boundary (plus SuperPMI + NativeAOT plumbing). - Adds EE support to describe how to locate
t_runtime_thread_locals(allocation context) via TLS on supported platforms. - Implements xarch codegen/LSRA handling to generate inline allocation for
GT_ALLOCOBJ(currently Windows-focused, guarded by config/opts).
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/vm/threadstatics.h | Declares a new EE helper to provide TLS access metadata for the alloc context. |
| src/coreclr/vm/threadstatics.cpp | Implements TLS metadata discovery for t_runtime_thread_locals on multiple targets. |
| src/coreclr/vm/jitinterface.cpp | Exposes CEEInfo::getObjectAllocContextInfo and gates support based on runtime configuration. |
| src/coreclr/vm/amd64/asmhelpers.S | Adds descriptor helpers for TLS access to t_runtime_thread_locals (Apple TLVP + ELF TLSGD). |
| src/coreclr/tools/superpmi/superpmi/icorjitinfo.cpp | Records/replays the new JIT/EE query in SuperPMI. |
| src/coreclr/tools/superpmi/superpmi-shim-simple/icorjitinfo_generated.cpp | Forwards the new API in the simple shim. |
| src/coreclr/tools/superpmi/superpmi-shim-counter/icorjitinfo_generated.cpp | Counts/intercepts the new API in the counter shim. |
| src/coreclr/tools/superpmi/superpmi-shim-collector/icorjitinfo.cpp | Records/replays the new API in the collector shim. |
| src/coreclr/tools/superpmi/superpmi-shared/methodcontext.h | Adds packets + record/replay declarations for alloc-context info. |
| src/coreclr/tools/superpmi/superpmi-shared/methodcontext.cpp | Implements record/dump/replay for alloc-context info. |
| src/coreclr/tools/superpmi/superpmi-shared/lwmlist.h | Adds LWM entry for alloc-context info. |
| src/coreclr/tools/superpmi/superpmi-shared/agnostic.h | Defines the agnostic record type for alloc-context info. |
| src/coreclr/tools/aot/jitinterface/jitinterface_generated.h | Adds callback + wrapper method for getObjectAllocContextInfo. |
| src/coreclr/tools/Common/JitInterface/ThunkGenerator/ThunkInput.txt | Extends thunk generation inputs with the new API + struct. |
| src/coreclr/tools/Common/JitInterface/CorInfoTypes.cs | Adds managed definition of CORINFO_OBJECT_ALLOC_CONTEXT_INFO. |
| src/coreclr/tools/Common/JitInterface/CorInfoImpl_generated.cs | Adds unmanaged callback plumbing for getObjectAllocContextInfo. |
| src/coreclr/tools/Common/JitInterface/CorInfoImpl.cs | Provides a stub implementation for NativeAOT/crossgen2. |
| src/coreclr/jit/objectalloc.cpp | Keeps GT_ALLOCOBJ for codegen-side expansion under certain conditions. |
| src/coreclr/jit/lsraxarch.cpp | Adds LSRA build logic for GT_ALLOCOBJ with internal temps + call kill/defs. |
| src/coreclr/jit/lsrabuild.cpp | Adds kill-set modeling for GT_ALLOCOBJ as NEWSFAST. |
| src/coreclr/jit/lower.cpp | Adds a lowering case placeholder for GT_ALLOCOBJ on xarch. |
| src/coreclr/jit/jitconfigvalues.h | Introduces JitInlineAllocFast config knob. |
| src/coreclr/jit/gtlist.h | Allows GT_ALLOCOBJ in LIR for codegen-side handling. |
| src/coreclr/jit/compiler.h | Caches CORINFO_OBJECT_ALLOC_CONTEXT_INFO in the compiler instance. |
| src/coreclr/jit/codegenxarch.cpp | Implements inline fast-path allocation codegen for GT_ALLOCOBJ (+ slow-path helper call). |
| src/coreclr/jit/codegen.h | Declares genCodeForAllocObj. |
| src/coreclr/jit/ICorJitInfo_wrapper_generated.hpp | Adds wrapper forwarding for the new API. |
| src/coreclr/jit/ICorJitInfo_names_generated.h | Adds the API name for logging/profiling hooks. |
| src/coreclr/inc/jiteeversionguid.h | Updates JIT/EE version GUID due to interface shape change. |
| src/coreclr/inc/icorjitinfoimpl_generated.h | Adds the new virtual method to the generated EE impl interface. |
| src/coreclr/inc/corinfo.h | Defines CORINFO_OBJECT_ALLOC_CONTEXT_INFO + adds new ICorStaticInfo method. |
|
@EgorBot -windows_x64 -windows_arm64 using BenchmarkDotNet.Attributes;
public class Bench
{
[Benchmark]
public Bench CreateInstance() => new Bench();
} |
|
@EgorBot -linux_x64 using BenchmarkDotNet.Attributes;
public class Bench
{
[Benchmark]
public Bench CreateInstance() => new Bench();
} |
src/coreclr/jit/codegenxarch.cpp
Outdated
| // ---- Bump allocation (non-GC-interruptible) ---- | ||
| GetEmitter()->emitDisableGC(); | ||
|
|
||
| // Size not known at JIT time — read from MethodTable.m_BaseSize at runtime. |
There was a problem hiding this comment.
Why is that? If you are trying to boil the ocean to try to save a nano second for an allocation, you can also embed the size into the code as a constant.
There was a problem hiding this comment.
Why is that? If you are trying to boil the ocean to try to save a nano second for an allocation, you can also embed the size into the code as a constant.
Well, that was one of the goals for this change (even have it in the description), but initially just wanted to see if it works as is
src/coreclr/jit/codegenxarch.cpp
Outdated
| // Size not known at JIT time — read from MethodTable.m_BaseSize at runtime. | ||
| emit->emitIns_R_AR(INS_mov, EA_4BYTE, tmpReg, mtReg, (int)allocInfo->methodTableBaseSizeOffset); | ||
| emit->emitIns_R_AR(INS_mov, EA_PTRSIZE, dstReg, allocCtxReg, (int)allocInfo->allocPtrFieldOffset); | ||
| emit->emitIns_R_ARX(INS_lea, EA_PTRSIZE, tmpReg, dstReg, tmpReg, 1, 0); |
There was a problem hiding this comment.
Different from what the allocation helpers do. This has a potential integer overflow security bug. You are assuming that allocPtr + size won't overflow.
|
I would be curious whether this improves (and not regresses) anything real. The code bloat is not free and it won't show up in microbenchmarks. |
This assumes that the MethodTable won't be accessed by anything else... Have we studied this pattern in detail? When we have seen expensive memory accesses like this, it was often from having frequently written to data next the read-only data (MethodTable in this case). |
src/coreclr/jit/codegenxarch.cpp
Outdated
| // Linux x64: call __tls_get_addr. Save arg registers on the stack. | ||
| // Always push an even number of 8-byte values for 16-byte stack alignment. | ||
| emit->emitIns_R(INS_push, EA_PTRSIZE, mtReg); | ||
| if (isArray) | ||
| { | ||
| emit->emitIns_R(INS_push, EA_PTRSIZE, lenReg); | ||
| } | ||
| else | ||
| { | ||
| emit->emitIns_R_I(INS_sub, EA_PTRSIZE, REG_SPBASE, 8); | ||
| } |
There was a problem hiding this comment.
On the non-Windows AMD64 path this saves argument registers using push/pop (and sub/add rsp, 8) around the __tls_get_addr call. On x64 the JIT generally assumes RSP is stable after the prolog for unwind/GC stack walking; these dynamic SP adjustments are not reflected in unwind data or CodeGen stack tracking and can break EH/stack walking. Prefer saving/restoring the args in volatile registers (e.g., move MT/len into scratch regs before clobbering REG_ARG_0) or into fixed spill slots in the existing frame instead of adjusting RSP.
| case GT_ALLOCOBJ: | ||
| if (op1->AsAllocObj()->gtNewHelper != op2->AsAllocObj()->gtNewHelper || | ||
| op1->AsAllocObj()->gtAllocObjClsHnd != op2->AsAllocObj()->gtAllocObjClsHnd) | ||
| { | ||
| return false; | ||
| } | ||
| break; |
There was a problem hiding this comment.
GenTree::Compare for GT_ALLOCOBJ only compares gtNewHelper and gtAllocObjClsHnd, but GenTreeAllocObj also carries gtHelperHasSideEffects and (under FEATURE_READYTORUN) gtEntryPoint. If any of these can differ, treating nodes as equal can lead to incorrect structural comparisons (e.g., in CSE/value numbering or debug asserts). Consider including gtHelperHasSideEffects and the ReadyToRun entrypoint lookup in the comparison (or document why they are irrelevant/invariant).
- Remove NEWARR_1_VC/PTR inline expansion from objectalloc.cpp and codegen - Remove constant-size folding at codegen time (no more JIT-time MT reads) - Remove objectMethodTableOffset field (MT pointer always at offset 0) - Remove arrayLengthOffset, arrayBaseSize, methodTableComponentSizeOffset - Remove threadVarsSection (macOS not supported for inline alloc) - Add tlsRootOffset field for Linux ARM64 (pre-computed tpidr_el0 offset) - Add genInlineAllocCall for ARM64 in codegenarmarch.cpp - Windows: x18 (TEB) + TLS array + index + offset pattern - Linux: mrs tpidr_el0 + pre-computed offset (no function call needed) - macOS: not supported (set supported=false) - Add GetRuntimeThreadLocalsVariableOffset assembly stub for ARM64 Linux - Update CORINFO_OBJECT_ALLOC_CONTEXT_INFO in corinfo.h, CorInfoTypes.cs - Update SuperPMI agnostic struct and methodcontext rec/dmp/rep - Update jitinterface.cpp and threadstatics.cpp for new struct layout - Change TARGET_AMD64 guards to TARGET_AMD64 || TARGET_ARM64 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


Just an AI-driven experiment to inline allocators in codegen.
My first attempt was to move them to a managed helper, but that was not inlineable (due to nogc requirement) so we couldn't benefit from constant-folded obj->BaseSize or constant-folded length checks for array allocators. Also, its GetAllocContext TLS access was inefficient. Also, we shouldn't expand allocators early anyway.