Split Dictionary.Remove inner loops for value-type key optimization#125884
Split Dictionary.Remove inner loops for value-type key optimization#125884danmoseley wants to merge 1 commit intodotnet:mainfrom
Conversation
Split both Remove(TKey) and Remove(TKey, out TValue) inner loops into separate value-type and comparer paths, matching the existing pattern in FindValue, TryInsert, and CollectionsMarshalHelper. The previous single-loop approach used an inline ternary to choose between EqualityComparer<TKey>.Default.Equals and comparer.Equals, which forced the register allocator to plan for the virtual comparer call even on the value-type path. This caused 3 unnecessary stack spills and reloads per iteration for value-type keys. With split loops, the value-type path has no virtual calls, allowing the JIT to keep all loop variables in registers. Codegen analysis shows the value-type inner loop shrinks from ~133 to ~56 bytes with zero stack traffic vs 6 memory operations per iteration. Benchmark results (MannWhitney statistical test, threshold=3ns): - Guid Remove/TryRemove hit: 8-18% faster (all sizes) - Int32 Remove/TryRemove hit: 4-15% faster (most sizes) - String (ref type): neutral (as expected) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates Dictionary<TKey, TValue>.Remove to match the existing loop-splitting pattern used by FindValue / TryInsert, separating the value-type-key fast path (no comparer, devirtualized EqualityComparer<TKey>.Default) from the comparer-based path to improve JIT register allocation and reduce spills in the hot loop.
Changes:
- Split the inner chain-walk loop in
Remove(TKey)into two loops:(TKey is value type && comparer is null)vs comparer-based. - Apply the same split to
Remove(TKey, out TValue)and add explicit “keep overloads in sync” notes.
|
Tagging subscribers to this area: @dotnet/area-system-collections |
|
@MihuBot benchmark Dictionary |
|
See benchmark results at https://gist.github.com/MihuBot/28f7bffb0197607f3106e866cdca2fb7 |
|
Note This analysis was generated by Copilot. MihuBot benchmark analysisThe MihuBot benchmark results cover a wide range of dictionary-related benchmarks, but none of them test Below is the complete list of benchmarks that directly test Perf_Dictionary (3000 items)
DictionarySequentialKeys
TryGetValue (512 items, Dictionary rows only)
TryAdd (512 items, Dictionary rows only)
All ratios fall in the 0.96-1.04 noise band. No regressions on untouched Dictionary operations. TryAdd and TryGetValue already had the split-loop pattern from PR #75663, so no change is expected there. The improvement from this PR (value-type key Remove) isn't covered by the existing perf suites. |
|
Good to go I think? |
For something as core as dictionary, I'd prefer to have lab measurements for anything we think is worth improving. Could you add remove benchmarks to the perf repo? |
|
@stephentoub done, see dotnet/performance#5178 (comment) with the results with and without this PR. |
src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Dictionary.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Dictionary.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Dictionary.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Dictionary.cs
Show resolved
Hide resolved
|
I think this is in good shape to merge. |
Split Dictionary.Remove inner loops for value-type key optimization
FindValue,TryInsert, andCollectionsMarshalHelper.GetValueRefOrAddDefaultall split their inner loops into separate value-type and comparer paths:This lets the JIT register-allocate each loop independently. The value-type loop has no virtual calls, so the register allocator can keep all loop variables in registers.
Remove(both overloads) was not split during the PR #75663 refactoring that introduced this pattern. It uses a single loop with an inline ternary:The JIT keeps both code paths alive in a single loop, so the register allocator must plan for the comparer virtual call even on the value-type path. This causes unnecessary stack spills and reloads every iteration.
This PR applies the same split to both
Remove(TKey key)andRemove(TKey key, out TValue value).Codegen impact (x64,
Dictionary<int, int>.Remove)Before (single merged loop): 3 spills + 3 reloads per iteration
After (value-type loop): 0 spills, 0 reloads
arm64 shows the same pattern: baseline has 1 spill + 2 reloads per iteration (fewer due to more callee-saved registers); modified value-type loop has 0.
Benchmark results
Both builds from identical commit (b38cc2a), only
Dictionary.csdiffers. BDN v0.16.0-nightly,--coreRunA/B, MannWhitney statistical test with 3ns threshold. Two independent runs: Run 1 (15 iterations), Run 2 (30 iterations, high priority process). Run 2 shown below; Run 1 results were consistent.Machine: Intel Core i9-14900K 3.20GHz, 24 cores (hybrid P/E), 64GB RAM, Windows 11.
All Remove scenarios (Run 2, 30 iterations)
Analysis
<--