Split Dictionary.Remove inner loops for value-type key optimization by danmoseley · Pull Request #125884 · dotnet/runtime

danmoseley · 2026-03-21T06:46:32Z

Split Dictionary.Remove inner loops for value-type key optimization

FindValue, TryInsert, and CollectionsMarshalHelper.GetValueRefOrAddDefault all split their inner loops into separate value-type and comparer paths:

if (typeof(TKey).IsValueType && comparer == null)
{
    while (...) { ... EqualityComparer<TKey>.Default.Equals(...) ... }
}
else
{
    while (...) { ... comparer.Equals(...) ... }
}

This lets the JIT register-allocate each loop independently. The value-type loop has no virtual calls, so the register allocator can keep all loop variables in registers.

Remove (both overloads) was not split during the PR #75663 refactoring that introduced this pattern. It uses a single loop with an inline ternary:

while (i >= 0)
{
    if (entry.hashCode == hashCode &&
        (typeof(TKey).IsValueType && comparer == null
            ? EqualityComparer<TKey>.Default.Equals(entry.key, key)
            : comparer!.Equals(entry.key, key)))
    { ... }
}

The JIT keeps both code paths alive in a single loop, so the register allocator must plan for the comparer virtual call even on the value-type path. This causes unnecessary stack spills and reloads every iteration.

This PR applies the same split to both Remove(TKey key) and Remove(TKey key, out TValue value).

Codegen impact (x64, `Dictionary<int, int>.Remove`)

Before (single merged loop): 3 spills + 3 reloads per iteration

; LOOP ENTRY - spills happen every iteration
mov  [rsp+0x30], r10d       ; SPILL entries.Length
mov  [rsp+0x34], eax        ; SPILL i
mov  [rsp+0x28], r9         ; SPILL &entries[i]
...
; LOOP ADVANCE - reloads
mov  eax, [rsp+0x34]        ; RELOAD i
mov  eax, [r9+0x04]         ; i = entry.next
cmp  [rsp+0x30], edi        ; RELOAD entries.Length

After (value-type loop): 0 spills, 0 reloads

; LOOP BODY - all registers
cmp  eax, ebp               ; bounds check (register)
lea  rcx, [r13+rcx+0x10]    ; &entries[i] (register)
cmp  [rcx], r14d             ; hashCode (register)
cmp  [rcx+0x08], esi         ; key compare - direct, no virtual call
; LOOP ADVANCE
mov  eax, [rcx+0x04]        ; i = entry.next (register)
cmp  ebp, edi               ; entries.Length (register)

arm64 shows the same pattern: baseline has 1 spill + 2 reloads per iteration (fewer due to more callee-saved registers); modified value-type loop has 0.

Benchmark results

Both builds from identical commit (b38cc2a), only Dictionary.cs differs. BDN v0.16.0-nightly, --coreRun A/B, MannWhitney statistical test with 3ns threshold. Two independent runs: Run 1 (15 iterations), Run 2 (30 iterations, high priority process). Run 2 shown below; Run 1 results were consistent.

Machine: Intel Core i9-14900K 3.20GHz, 24 cores (hybrid P/E), 64GB RAM, Windows 11.

All Remove scenarios (Run 2, 30 iterations)

Type	Method	Size	Baseline (ns)	Modified (ns)	Ratio	MannWhitney
Guid, Int32	Remove_Hit	16	101.0	84.1	0.83	Faster <--
Guid, Int32	Remove_Hit	512	3,081.4	2,864.2	0.93	Faster <--
Guid, Int32	Remove_Hit	4096	30,195.3	27,925.7	0.92	Faster <--
Guid, Int32	Remove_Miss	16	30.4	30.2	0.99	Same
Guid, Int32	Remove_Miss	512	1,080.9	926.8	0.89	Same
Guid, Int32	Remove_Miss	4096	8,750.4	8,850.4	1.01	Slower
Guid, Int32	TryRemove_Hit	16	99.6	87.1	0.87	Faster <--
Guid, Int32	TryRemove_Hit	512	3,657.3	2,975.0	0.82	Faster <--
Guid, Int32	TryRemove_Hit	4096	29,808.7	28,089.5	0.94	Faster <--
Int32, Int32	Remove_Hit	16	67.5	63.3	0.94	Faster <--
Int32, Int32	Remove_Hit	512	2,266.9	2,124.9	0.94	Faster <--
Int32, Int32	Remove_Hit	4096	20,806.0	20,775.1	1.00	Same <--
Int32, Int32	Remove_Miss	16	24.7	23.5	0.95	Same
Int32, Int32	Remove_Miss	512	716.1	668.5	0.93	Faster
Int32, Int32	Remove_Miss	4096	6,414.8	6,647.6	1.04	Slower
Int32, Int32	TryRemove_Hit	16	67.8	63.2	0.93	Faster <--
Int32, Int32	TryRemove_Hit	512	2,316.2	2,092.7	0.90	Faster <--
Int32, Int32	TryRemove_Hit	4096	21,362.0	20,311.9	0.95	Faster <--
String, String	Remove_Hit	16	185.0	185.6	1.00	Same
String, String	Remove_Hit	512	6,986.0	6,827.3	0.98	Faster
String, String	Remove_Hit	4096	109,184.1	113,931.7	1.04	Slower
String, String	Remove_Miss	16	81.3	81.5	1.00	Same
String, String	Remove_Miss	512	2,482.6	2,491.2	1.00	Same
String, String	Remove_Miss	4096	49,162.6	48,452.3	0.99	Faster
String, String	TryRemove_Hit	16	185.8	185.8	1.00	Same
String, String	TryRemove_Hit	512	7,012.7	6,837.4	0.98	Faster
String, String	TryRemove_Hit	4096	113,241.7	111,874.3	0.99	Faster

Analysis

Value-type Hit (Guid, Int32): 6-18% faster — consistent improvement across all sizes and both Remove overloads, statistically significant in both runs. These are marked with <--
String (ref type): neutral — expected, as ref types always store a non-null comparer (since Fix Dictionary perf regression for non-string ref type keys #75663) and always take the comparer path regardless of the loop split.
Miss scenarios: neutral — expected, as the optimization eliminates register spills in the equality comparison, which is rarely reached on a miss (most iterations fail the hashCode check and skip immediately). Small fluctuations in both directions are within machine noise.

Split both Remove(TKey) and Remove(TKey, out TValue) inner loops into separate value-type and comparer paths, matching the existing pattern in FindValue, TryInsert, and CollectionsMarshalHelper. The previous single-loop approach used an inline ternary to choose between EqualityComparer<TKey>.Default.Equals and comparer.Equals, which forced the register allocator to plan for the virtual comparer call even on the value-type path. This caused 3 unnecessary stack spills and reloads per iteration for value-type keys. With split loops, the value-type path has no virtual calls, allowing the JIT to keep all loop variables in registers. Codegen analysis shows the value-type inner loop shrinks from ~133 to ~56 bytes with zero stack traffic vs 6 memory operations per iteration. Benchmark results (MannWhitney statistical test, threshold=3ns): - Guid Remove/TryRemove hit: 8-18% faster (all sizes) - Int32 Remove/TryRemove hit: 4-15% faster (most sizes) - String (ref type): neutral (as expected) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR updates Dictionary<TKey, TValue>.Remove to match the existing loop-splitting pattern used by FindValue / TryInsert, separating the value-type-key fast path (no comparer, devirtualized EqualityComparer<TKey>.Default) from the comparer-based path to improve JIT register allocation and reduce spills in the hot loop.

Changes:

Split the inner chain-walk loop in Remove(TKey) into two loops: (TKey is value type && comparer is null) vs comparer-based.
Apply the same split to Remove(TKey, out TValue) and add explicit “keep overloads in sync” notes.

dotnet-policy-service · 2026-03-21T06:49:13Z

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

stephentoub · 2026-03-27T00:38:58Z

@MihuBot benchmark Dictionary

MihuBot · 2026-03-27T02:29:22Z

See benchmark results at https://gist.github.com/MihuBot/28f7bffb0197607f3106e866cdca2fb7

danmoseley · 2026-03-27T04:19:29Z

Note

This analysis was generated by Copilot.

MihuBot benchmark analysis

The MihuBot benchmark results cover a wide range of dictionary-related benchmarks, but none of them test Dictionary.Remove, which is the only method changed in this PR. The standard perf suites don't include a Remove benchmark.

Below is the complete list of benchmarks that directly test Dictionary<TKey,TValue> operations (excluding unrelated types like ConcurrentDictionary, FrozenDictionary, ImmutableDictionary, SortedDictionary, and unrelated suites like JSON serialization and LINQ). All ratios are PR/Main.

Perf_Dictionary (3000 items)

Method	Ratio
Clone	1.00
ContainsValue	1.00

DictionarySequentialKeys

Method	Ratio
ContainsValue_17_Int_Int	0.99
ContainsKey_17_Int_32ByteValue	1.00
ContainsKey_17_Int_32ByteRefsValue	1.00
ContainsValue_3k_Int_Int	1.00
ContainsKey_3k_Int_32ByteValue	0.99
ContainsKey_3k_Int_32ByteRefsValue	1.00
TryGetValue_17_Int_32ByteValue	1.01
TryGetValue_17_Int_32ByteRefsValue	1.00
TryGetValue_3k_Int_Int	1.00
TryGetValue_3k_Int_32ByteValue	1.00
TryGetValue_3k_Int_32ByteRefsValue	1.00
TryGetValue_17_Int_Int	1.02

TryGetValue (512 items, Dictionary rows only)

Method	Key, Value Type	Ratio
TryGetValue_True	String, String	1.00
TryGetValue_True	SmallClass, SmallClass	1.00
TryGetValue_True	BigStruct, BigStruct	0.98
TryGetValue_True	Int32, Int32	1.00
TryGetValue_False	String, String	0.99
TryGetValue_False	SmallClass, SmallClass	1.00
TryGetValue_False	BigStruct, BigStruct	1.01
TryGetValue_False	Int32, Int32	1.01

TryAdd (512 items, Dictionary rows only)

Method	Key Type	Ratio
TryAdd_GivenSize	String	1.01
TryAdd_GivenSize	Int32	0.96
TryAdd_DefaultSize	String	1.04
TryAdd_DefaultSize	Int32	0.97

All ratios fall in the 0.96-1.04 noise band. No regressions on untouched Dictionary operations. TryAdd and TryGetValue already had the split-loop pattern from PR #75663, so no change is expected there. The improvement from this PR (value-type key Remove) isn't covered by the existing perf suites.

danmoseley · 2026-03-27T04:23:44Z

Good to go I think?

stephentoub · 2026-03-27T11:00:33Z

Good to go I think?

For something as core as dictionary, I'd prefer to have lab measurements for anything we think is worth improving. Could you add remove benchmarks to the perf repo?

danmoseley · 2026-03-27T19:29:18Z

@stephentoub done, see dotnet/performance#5178 (comment) with the results with and without this PR.

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Dictionary.cs

danmoseley · 2026-03-29T07:11:54Z

I think this is in good shape to merge.

Copilot AI review requested due to automatic review settings March 21, 2026 06:46

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 21, 2026

dotnet-policy-service bot assigned danmoseley Mar 21, 2026

Copilot started reviewing on behalf of danmoseley March 21, 2026 06:47 View session

danmoseley added area-System.Collections and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 21, 2026

Copilot AI reviewed Mar 21, 2026

View reviewed changes

This was referenced Mar 21, 2026

Cannot find 'arm64-v8a' device dotnet/dnceng#2284

Open

Android arm32 device not found (armeabi-v7a architecture unavailable) #125440

Open

danmoseley requested review from eiriktsarpalis and stephentoub March 23, 2026 22:01

MihuBot mentioned this pull request Mar 27, 2026

[Benchmark X64] [danmoseley] Split Dictionary.Remove inner loops for value-t ... MihuBot/runtime-utils#1838

Open

danmoseley mentioned this pull request Mar 27, 2026

Add Dictionary and HashSet Remove/Contains benchmarks dotnet/performance#5178

Open

prozolic reviewed Mar 29, 2026

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Dictionary.cs Show resolved Hide resolved

prozolic reviewed Mar 29, 2026

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Dictionary.cs Show resolved Hide resolved

prozolic reviewed Mar 29, 2026

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Dictionary.cs Show resolved Hide resolved

prozolic reviewed Mar 29, 2026

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Dictionary.cs Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split Dictionary.Remove inner loops for value-type key optimization#125884

Split Dictionary.Remove inner loops for value-type key optimization#125884
danmoseley wants to merge 1 commit intodotnet:mainfrom
danmoseley:dict-remove-split

danmoseley commented Mar 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

dotnet-policy-service bot commented Mar 21, 2026

Uh oh!

stephentoub commented Mar 27, 2026

Uh oh!

MihuBot commented Mar 27, 2026

Uh oh!

danmoseley commented Mar 27, 2026

Uh oh!

danmoseley commented Mar 27, 2026

Uh oh!

stephentoub commented Mar 27, 2026

Uh oh!

danmoseley commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danmoseley commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

danmoseley commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!