Skip to content

Optimize positional lookups with cached prefix-sum tuple#249

Open
EliMunkey wants to merge 1 commit intograntjenks:masterfrom
EliMunkey:pr/cumsum-optimization
Open

Optimize positional lookups with cached prefix-sum tuple#249
EliMunkey wants to merge 1 commit intograntjenks:masterfrom
EliMunkey:pr/cumsum-optimization

Conversation

@EliMunkey
Copy link

Summary

This PR introduces a cached prefix-sum tuple (_cumsum) that accelerates the two most expensive internal methods:

  • _loc (sublist position → flat index): O(log n) Python tree traversal → O(1) tuple lookup
  • _pos (flat index → sublist position): O(log n) Python tree traversal → O(log n) C-level bisect_right

The _cumsum is built during _build_index and invalidated on structural changes. When cumsum is unavailable (e.g., during __delitem__ loops where each delete invalidates it), the original tree traversal is used as a fallback — so write-heavy operations are unaffected.

Additional optimizations

  • __slots__ on SortedList and SortedKeyList
  • Simplified __getitem__ for integers — removed redundant special-case checks already handled by _pos
  • __class__ is int dispatch in __getitem__/__delitem__ (faster than isinstance(x, slice))
  • Deferred attribute lookups in _delete/_expand_maxes and _load only loaded in rare split/merge paths
  • Inline _expand guard in add() — skip function call when index is empty and no split needed
  • Lazy _maxes update in _delete — skip when deleted element is not the sublist maximum
  • _cumsum validation in _check() invariant method for both classes

All optimizations applied consistently to both SortedList and SortedKeyList.

Benchmark results

Tested on Python 3.14.3 (Apple Silicon), 1M elements. Results verified with interleaved A/B testing (3 trials, best-of-3 per operation) using the project's own benchmark_sortedlist.py:

Per-operation (at 1,000,000 elements)

Operation Original Optimized Speedup
getitem 21.2ms 6.9ms +67.5%
bisect 22.3ms 10.6ms +52.3%
index 23.7ms 13.8ms +41.9%
update_large 338.3ms 295.2ms +12.8%
remove 17.8ms 15.5ms +12.9%
delitem 37.5ms 33.6ms +10.5%
update_small 114.3ms 103.5ms +9.4%
add 17.5ms 16.0ms +8.9%
contains 10.5ms 9.6ms +8.9%
count 15.3ms 14.0ms +8.3%
priorityqueue 152.1ms 141.4ms +7.1%
iter 56.9ms 52.9ms +7.0%
multiset 171.3ms 162.4ms +5.2%
pop 5.1ms 5.1ms +1.3%
ranking 251.8ms 249.9ms +0.8%
neighbor 250.0ms 270.1ms -8.1%
intervals 282.9ms 293.0ms -3.6%
init 258.7ms 268.9ms -3.9%

Overall: +4.2% across all 18 operations at 1M elements

The mixed workloads (priorityqueue, multiset, ranking) also benefit because they interleave bisect, index, and getitem with add/remove.

How it works

The existing _index tree stores sublist lengths in a dense binary tree, supporting O(log n) traversal for both _pos (downward, root-to-leaf) and _loc (upward, leaf-to-root). These traversals use Python while loops with ~10 iterations each.

The _cumsum tuple stores (0, len₀, len₀+len₁, len₀+len₁+len₂, ...) — a prefix sum with a leading zero. This enables:

  • _loc(pos, idx): simply _cumsum[pos] + idx (O(1), one tuple index + one add)
  • _pos(idx): bisect_right(_cumsum, idx) - 1 to find the sublist, then idx -= _cumsum[pos] for the offset (O(log n) in C via bisect)

The leading zero eliminates the need for pos - 1 indexing or if pos: branching.

The _cumsum is invalidated (set to empty tuple) whenever the tree is modified — both on structural changes (splits/merges that clear the tree) and on incremental updates (single add/delete that update tree nodes). The fallback tree traversal handles the write-heavy case (e.g., __delitem__ loops) without regression.

Test plan

  • All 299 existing tests pass (unit tests, coverage tests, stress tests)
  • All 37 doctests pass
  • ruff check — no new lint warnings (5 pre-existing B905 warnings unchanged)
  • ruff format — formatting passes
  • _cumsum validation added to _check() for both SortedList and SortedKeyList
  • Edge cases verified: empty list, single element, single sublist, negative indices, bool indexing, int subclasses, pickle, copy, _reset() with custom load factors
  • A/B benchmarked on Python 3.14 at 100k and 1M elements

🤖 Generated with Claude Code

Add a cached prefix-sum tuple (_cumsum) that accelerates the two most
expensive internal methods: _pos (flat index to sublist position) and
_loc (sublist position to flat index).

The _cumsum is a tuple of cumulative sublist lengths with a leading zero,
built during _build_index and invalidated on structural changes. When
available, _loc becomes O(1) via direct tuple indexing, and _pos becomes
O(log n) via C-level bisect_right — replacing O(log n) Python-level tree
traversal in both cases.

Additional optimizations:
- Add __slots__ to SortedList and SortedKeyList
- Simplify __getitem__ for integer indices by removing redundant checks
  already handled by _pos
- Use __class__ is int dispatch in __getitem__/__delitem__ for faster
  type checking than isinstance(index, slice)
- Defer _maxes and _load attribute lookups to rare split/merge paths
- Skip _expand call in add() when index is empty and no split needed
- Add lazy _maxes update in _delete (skip when deleted element is not max)
- Add _cumsum validation to _check() invariant method

Benchmark results (Python 3.14, 1M elements, A/B tested):

  getitem   +67%    (cumsum _pos + simplified dispatch)
  bisect    +52%    (cumsum _loc)
  index     +42%    (cumsum _loc)
  add       +9%     (inline _expand guard)
  delitem   +10%    (deferred attr lookups)
  remove    +13%    (deferred attr lookups)
  Overall   +4-6%   across all 18 benchmark operations

All changes applied consistently to both SortedList and SortedKeyList.
299 existing tests pass. Zero API changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant