Skip to content

Split Dictionary.Remove inner loops for value-type key optimization#125884

Open
danmoseley wants to merge 1 commit intodotnet:mainfrom
danmoseley:dict-remove-split
Open

Split Dictionary.Remove inner loops for value-type key optimization#125884
danmoseley wants to merge 1 commit intodotnet:mainfrom
danmoseley:dict-remove-split

Conversation

@danmoseley
Copy link
Copy Markdown
Member

@danmoseley danmoseley commented Mar 21, 2026

Split Dictionary.Remove inner loops for value-type key optimization

FindValue, TryInsert, and CollectionsMarshalHelper.GetValueRefOrAddDefault all split their inner loops into separate value-type and comparer paths:

if (typeof(TKey).IsValueType && comparer == null)
{
    while (...) { ... EqualityComparer<TKey>.Default.Equals(...) ... }
}
else
{
    while (...) { ... comparer.Equals(...) ... }
}

This lets the JIT register-allocate each loop independently. The value-type loop has no virtual calls, so the register allocator can keep all loop variables in registers.

Remove (both overloads) was not split during the PR #75663 refactoring that introduced this pattern. It uses a single loop with an inline ternary:

while (i >= 0)
{
    if (entry.hashCode == hashCode &&
        (typeof(TKey).IsValueType && comparer == null
            ? EqualityComparer<TKey>.Default.Equals(entry.key, key)
            : comparer!.Equals(entry.key, key)))
    { ... }
}

The JIT keeps both code paths alive in a single loop, so the register allocator must plan for the comparer virtual call even on the value-type path. This causes unnecessary stack spills and reloads every iteration.

This PR applies the same split to both Remove(TKey key) and Remove(TKey key, out TValue value).

Codegen impact (x64, Dictionary<int, int>.Remove)

Before (single merged loop): 3 spills + 3 reloads per iteration

; LOOP ENTRY - spills happen every iteration
mov  [rsp+0x30], r10d       ; SPILL entries.Length
mov  [rsp+0x34], eax        ; SPILL i
mov  [rsp+0x28], r9         ; SPILL &entries[i]
...
; LOOP ADVANCE - reloads
mov  eax, [rsp+0x34]        ; RELOAD i
mov  eax, [r9+0x04]         ; i = entry.next
cmp  [rsp+0x30], edi        ; RELOAD entries.Length

After (value-type loop): 0 spills, 0 reloads

; LOOP BODY - all registers
cmp  eax, ebp               ; bounds check (register)
lea  rcx, [r13+rcx+0x10]    ; &entries[i] (register)
cmp  [rcx], r14d             ; hashCode (register)
cmp  [rcx+0x08], esi         ; key compare - direct, no virtual call
; LOOP ADVANCE
mov  eax, [rcx+0x04]        ; i = entry.next (register)
cmp  ebp, edi               ; entries.Length (register)

arm64 shows the same pattern: baseline has 1 spill + 2 reloads per iteration (fewer due to more callee-saved registers); modified value-type loop has 0.

Benchmark results

Both builds from identical commit (b38cc2a), only Dictionary.cs differs. BDN v0.16.0-nightly, --coreRun A/B, MannWhitney statistical test with 3ns threshold. Two independent runs: Run 1 (15 iterations), Run 2 (30 iterations, high priority process). Run 2 shown below; Run 1 results were consistent.

Machine: Intel Core i9-14900K 3.20GHz, 24 cores (hybrid P/E), 64GB RAM, Windows 11.

All Remove scenarios (Run 2, 30 iterations)

Type Method Size Baseline (ns) Modified (ns) Ratio MannWhitney
Guid, Int32 Remove_Hit 16 101.0 84.1 0.83 Faster <--
Guid, Int32 Remove_Hit 512 3,081.4 2,864.2 0.93 Faster <--
Guid, Int32 Remove_Hit 4096 30,195.3 27,925.7 0.92 Faster <--
Guid, Int32 Remove_Miss 16 30.4 30.2 0.99 Same
Guid, Int32 Remove_Miss 512 1,080.9 926.8 0.89 Same
Guid, Int32 Remove_Miss 4096 8,750.4 8,850.4 1.01 Slower
Guid, Int32 TryRemove_Hit 16 99.6 87.1 0.87 Faster <--
Guid, Int32 TryRemove_Hit 512 3,657.3 2,975.0 0.82 Faster <--
Guid, Int32 TryRemove_Hit 4096 29,808.7 28,089.5 0.94 Faster <--
Int32, Int32 Remove_Hit 16 67.5 63.3 0.94 Faster <--
Int32, Int32 Remove_Hit 512 2,266.9 2,124.9 0.94 Faster <--
Int32, Int32 Remove_Hit 4096 20,806.0 20,775.1 1.00 Same <--
Int32, Int32 Remove_Miss 16 24.7 23.5 0.95 Same
Int32, Int32 Remove_Miss 512 716.1 668.5 0.93 Faster
Int32, Int32 Remove_Miss 4096 6,414.8 6,647.6 1.04 Slower
Int32, Int32 TryRemove_Hit 16 67.8 63.2 0.93 Faster <--
Int32, Int32 TryRemove_Hit 512 2,316.2 2,092.7 0.90 Faster <--
Int32, Int32 TryRemove_Hit 4096 21,362.0 20,311.9 0.95 Faster <--
String, String Remove_Hit 16 185.0 185.6 1.00 Same
String, String Remove_Hit 512 6,986.0 6,827.3 0.98 Faster
String, String Remove_Hit 4096 109,184.1 113,931.7 1.04 Slower
String, String Remove_Miss 16 81.3 81.5 1.00 Same
String, String Remove_Miss 512 2,482.6 2,491.2 1.00 Same
String, String Remove_Miss 4096 49,162.6 48,452.3 0.99 Faster
String, String TryRemove_Hit 16 185.8 185.8 1.00 Same
String, String TryRemove_Hit 512 7,012.7 6,837.4 0.98 Faster
String, String TryRemove_Hit 4096 113,241.7 111,874.3 0.99 Faster

Analysis

  • Value-type Hit (Guid, Int32): 6-18% faster — consistent improvement across all sizes and both Remove overloads, statistically significant in both runs. These are marked with <--
  • String (ref type): neutral — expected, as ref types always store a non-null comparer (since Fix Dictionary perf regression for non-string ref type keys #75663) and always take the comparer path regardless of the loop split.
  • Miss scenarios: neutral — expected, as the optimization eliminates register spills in the equality comparison, which is rarely reached on a miss (most iterations fail the hashCode check and skip immediately). Small fluctuations in both directions are within machine noise.

Split both Remove(TKey) and Remove(TKey, out TValue) inner loops into
separate value-type and comparer paths, matching the existing pattern
in FindValue, TryInsert, and CollectionsMarshalHelper.

The previous single-loop approach used an inline ternary to choose
between EqualityComparer<TKey>.Default.Equals and comparer.Equals,
which forced the register allocator to plan for the virtual comparer
call even on the value-type path. This caused 3 unnecessary stack
spills and reloads per iteration for value-type keys.

With split loops, the value-type path has no virtual calls, allowing
the JIT to keep all loop variables in registers. Codegen analysis
shows the value-type inner loop shrinks from ~133 to ~56 bytes with
zero stack traffic vs 6 memory operations per iteration.

Benchmark results (MannWhitney statistical test, threshold=3ns):
- Guid Remove/TryRemove hit: 8-18% faster (all sizes)
- Int32 Remove/TryRemove hit: 4-15% faster (most sizes)
- String (ref type): neutral (as expected)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 21, 2026 06:46
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 21, 2026
@danmoseley danmoseley added area-System.Collections and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Dictionary<TKey, TValue>.Remove to match the existing loop-splitting pattern used by FindValue / TryInsert, separating the value-type-key fast path (no comparer, devirtualized EqualityComparer<TKey>.Default) from the comparer-based path to improve JIT register allocation and reduce spills in the hot loop.

Changes:

  • Split the inner chain-walk loop in Remove(TKey) into two loops: (TKey is value type && comparer is null) vs comparer-based.
  • Apply the same split to Remove(TKey, out TValue) and add explicit “keep overloads in sync” notes.

@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Copy Markdown
Member

@MihuBot benchmark Dictionary

@MihuBot
Copy link
Copy Markdown

MihuBot commented Mar 27, 2026

@danmoseley
Copy link
Copy Markdown
Member Author

Note

This analysis was generated by Copilot.

MihuBot benchmark analysis

The MihuBot benchmark results cover a wide range of dictionary-related benchmarks, but none of them test Dictionary.Remove, which is the only method changed in this PR. The standard perf suites don't include a Remove benchmark.

Below is the complete list of benchmarks that directly test Dictionary<TKey,TValue> operations (excluding unrelated types like ConcurrentDictionary, FrozenDictionary, ImmutableDictionary, SortedDictionary, and unrelated suites like JSON serialization and LINQ). All ratios are PR/Main.

Perf_Dictionary (3000 items)

Method Ratio
Clone 1.00
ContainsValue 1.00

DictionarySequentialKeys

Method Ratio
ContainsValue_17_Int_Int 0.99
ContainsKey_17_Int_32ByteValue 1.00
ContainsKey_17_Int_32ByteRefsValue 1.00
ContainsValue_3k_Int_Int 1.00
ContainsKey_3k_Int_32ByteValue 0.99
ContainsKey_3k_Int_32ByteRefsValue 1.00
TryGetValue_17_Int_32ByteValue 1.01
TryGetValue_17_Int_32ByteRefsValue 1.00
TryGetValue_3k_Int_Int 1.00
TryGetValue_3k_Int_32ByteValue 1.00
TryGetValue_3k_Int_32ByteRefsValue 1.00
TryGetValue_17_Int_Int 1.02

TryGetValue (512 items, Dictionary rows only)

Method Key, Value Type Ratio
TryGetValue_True String, String 1.00
TryGetValue_True SmallClass, SmallClass 1.00
TryGetValue_True BigStruct, BigStruct 0.98
TryGetValue_True Int32, Int32 1.00
TryGetValue_False String, String 0.99
TryGetValue_False SmallClass, SmallClass 1.00
TryGetValue_False BigStruct, BigStruct 1.01
TryGetValue_False Int32, Int32 1.01

TryAdd (512 items, Dictionary rows only)

Method Key Type Ratio
TryAdd_GivenSize String 1.01
TryAdd_GivenSize Int32 0.96
TryAdd_DefaultSize String 1.04
TryAdd_DefaultSize Int32 0.97

All ratios fall in the 0.96-1.04 noise band. No regressions on untouched Dictionary operations. TryAdd and TryGetValue already had the split-loop pattern from PR #75663, so no change is expected there. The improvement from this PR (value-type key Remove) isn't covered by the existing perf suites.

@danmoseley
Copy link
Copy Markdown
Member Author

Good to go I think?

@stephentoub
Copy link
Copy Markdown
Member

Good to go I think?

For something as core as dictionary, I'd prefer to have lab measurements for anything we think is worth improving. Could you add remove benchmarks to the perf repo?

@danmoseley
Copy link
Copy Markdown
Member Author

@stephentoub done, see dotnet/performance#5178 (comment) with the results with and without this PR.

@danmoseley
Copy link
Copy Markdown
Member Author

I think this is in good shape to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants