Skip to content

obs/ash: Performance improvements#170900

Open
alyshanjahani-crl wants to merge 3 commits into
cockroachdb:masterfrom
alyshanjahani-crl:alyshanjahani/ash-perf-fixes
Open

obs/ash: Performance improvements#170900
alyshanjahani-crl wants to merge 3 commits into
cockroachdb:masterfrom
alyshanjahani-crl:alyshanjahani/ash-perf-fixes

Conversation

@alyshanjahani-crl
Copy link
Copy Markdown
Collaborator

@alyshanjahani-crl alyshanjahani-crl commented May 25, 2026

See individual commits for details.

Fixes: #170822
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-63845
Release note: None

@alyshanjahani-crl alyshanjahani-crl requested review from a team as code owners May 25, 2026 19:58
@alyshanjahani-crl alyshanjahani-crl requested review from kyle-a-wong and removed request for a team May 25, 2026 19:58
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 25, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@alyshanjahani-crl alyshanjahani-crl force-pushed the alyshanjahani/ash-perf-fixes branch from ccee925 to 78d0d15 Compare May 25, 2026 20:00
@alyshanjahani-crl
Copy link
Copy Markdown
Collaborator Author

The KV0 roachtest (kv0/enc=false/nodes=3/cpu=96) didn't show much of an impact from enabling ASH - it was within noise.

┌─────────┬────────┬────────┬────────┬────────┬────────┐
  │ Config  │ Run 1  │ Run 2  │ Run 3  │  Mean  │ Stdev  │
  ├─────────┼────────┼────────┼────────┼────────┼────────┤
  │ ASH off │ 56,503 │ 61,212 │ 57,236 │ 58,317 │ ~2,530 │
  ├─────────┼────────┼────────┼────────┼────────┼────────┤
  │ ASH on  │ 52,320 │ 57,665 │ 61,137 │ 57,041 │ ~4,430 │
  └─────────┴────────┴────────┴────────┴────────┴────────┘

I did one run after including the fixes in this commit with ASH enabled, and observed a throughput of 62,418

The KV50 roachtest is where we saw more of a regression and filed the linked issue and lead to this PR as a result:

Config                                    ops/sec   p50(ms)   p99(ms)
  ──────────────────────────────────────────────────────────────────────
  ASH off (baseline)                        2291.7    3.5       8.1
  ASH on (no fixes)                         1909.3    4.2      10.5
  ASH on + mutex+map only                   2024.4    3.8       9.4
  ASH on + LatchWait + sharded retired      2049.9    3.9       9.4
  ASH on + ALL THREE fixes (run 1)          1971.7    4.1      10.0
  ASH on + ALL THREE fixes (run 2)          2179.0    3.8       8.9

The microbenchmark added in this PR showed:

│  before (syncutil.Map) │   after (mutex+map)                  │
  SetWorkStateWithRange     1504.0 ns/op ± 15%      952.5 ns/op ± 4%   -36.67% (p=0.000)
                             514 B/op                279 B/op           -45.72%
                               8 allocs/op              3 allocs/op     -62.50%

alyshanjahani-crl and others added 3 commits May 25, 2026 16:09
Previously, SetWorkState was called unconditionally at the top of
wait(), meaning every latch acquisition paid the ASH registration
cost even when no conflicting latches existed. In the common
uncontended case, the iterator finds no conflicts and returns
immediately, making the SetWorkState/clearWorkState pair pure
overhead.

Move the SetWorkState call into iterAndWait so it only fires when
the goroutine actually needs to block on a conflicting latch. This
also passes the full Guard into iterAndWait (instead of just the
poison.Policy) so the workload metadata is available at the call
site.

Part of: cockroachdb#170822
Epic: CRDB-63845
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
The global retiredWorkStates mutex was a contention point under high
QPS: every clearWorkState call across all goroutines serialized on
the same lock to append to the retired list.

Replace the single global retired list with per-shard retired lists
embedded in workStateMapShard. Each shard's retired list has its own
mutex and a proportional cap (maxRetiredWorkStates / numShards),
eliminating cross-shard contention on the retire path.

Part of: cockroachdb#170822
Epic: CRDB-63845
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
ASH's access pattern — unique, never-reused goroutine IDs as keys —
hits the worst case for syncutil.Map (a fork of sync.Map). Every
SetWorkState inserts a new key, forcing the slow mutex path. When
the sampler calls rangeWorkStates (via Range), it promotes the dirty
map to read and sets dirty=nil. The very next Store then triggers
dirtyLocked(), which allocates a new map and copies all non-expunged
entries from read. Over a 30-minute kv50 benchmark, this creates
synchronized 1-second bursts where every goroutine serializes
through the mutex during the dirty map rebuild.

Replace syncutil.Map with a plain map[int64]*WorkState protected by
a per-shard syncutil.Mutex. This eliminates:
- The dirty map rebuild cycle triggered by Range()
- Stale expunged entry accumulation (delete actually removes keys)
- The double lock acquisition per SetWorkState (Load then Store)

rangeWorkStates now snapshots each shard's entries (value-copied)
under the shard lock, then iterates the snapshot outside the lock.
This keeps lock hold time proportional to shard size rather than
callback execution time.

BenchmarkSetWorkStateWithRange (unique goids + continuous Range)
shows a 37% latency improvement and 63% fewer allocations per op.

Part of: cockroachdb#170822
Epic: CRDB-63845
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@alyshanjahani-crl alyshanjahani-crl force-pushed the alyshanjahani/ash-perf-fixes branch from 78d0d15 to b37abe5 Compare May 25, 2026 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

obs/ash: reduce SetWorkState hot-path overhead (~17% kv50 throughput regression)

2 participants