Skip to content

feat(server): cache system prompt as a reusable KV prefix#66

Open
unsaltedbutter-ai wants to merge 1 commit intoantirez:mainfrom
unsaltedbutter-ai:feat/kv-cache-anchor-user
Open

feat(server): cache system prompt as a reusable KV prefix#66
unsaltedbutter-ai wants to merge 1 commit intoantirez:mainfrom
unsaltedbutter-ai:feat/kv-cache-anchor-user

Conversation

@unsaltedbutter-ai
Copy link
Copy Markdown

Summary

Adds --kv-cache-anchor-user, an opt-in flag that cuts cold KV cache stores at the first <|User|> token instead of at a length-aligned boundary. Workloads that resend the same system prompt with varying user content now get a stable cache key over exactly the system tokens, so every subsequent request hits at the system-prompt boundary regardless of how the user message varies.

Motivation

Cold KV cache stores are sized by kv_cache_store_len using boundary_align_tokens and boundary_trim_tokens. With the default boundary_align_tokens of 2048, any prompt shorter than ~2080 tokens has its store length collapse to the full prompt length, so the cache key SHA covers every token including the user message. Workloads that resend the same system prompt with different user content (categorizers, classifiers, single-turn agents, benchmarks) produce a unique SHA per request and never hit.

Tuning --kv-cache-boundary-align-tokens down to 256 helps, but the cut position is a length-arithmetic guess. It can drift if the system prompt's token count changes, and it always leaves between 0 and align - 1 tokens of cacheable system text on the table.

The DeepSeek chat template emits <|User|> as a single atomic vocabulary token. The tokens preceding it (BOS plus the system text) form a byte-stable prefix that is independent of whatever user message follows. This patch lets the cold-cache cut land exactly at that natural boundary.

What this changes

  • ds4.h / ds4.c: new accessor ds4_token_user exposing vocab.user_id.
  • ds4_server.c:
    • kv_cache_options gains a bool anchor_user_enabled field (default false).
    • kv_disk_cache gains an int user_token_id field, populated at server init from ds4_token_user.
    • New helper kv_cache_anchor_pos walks the prompt tokens and returns the position of the first <|User|>, or -1 when disabled, when the marker is absent, when user_token_id is unset, or when the anchor sits below min_tokens.
    • generate_job prefers the anchor cut over kv_cache_store_len when an anchor is available.
    • New CLI flag --kv-cache-anchor-user enables the feature.
    • Startup log now appends anchor_user=on or anchor_user=off to the existing KV disk cache ... line.
    • Unit test test_kv_cache_anchor_pos_finds_first_user_token covers enable/disable, marker present, marker absent, marker below min_tokens, and unset user_token_id.

Behavior

  • Off by default. Existing behavior is unchanged when the flag is not passed.
  • When enabled and a <|User|> token is found, the cold cache entry covers tokens [0, anchor_pos) (no trim required, since <|User|> sits on an atomic vocabulary boundary).
  • Falls through to the existing length-based alignment heuristic when no <|User|> token is present (raw /v1/completions, non-DeepSeek templates) or when the anchor sits below --kv-cache-min-tokens.
  • Pre-existing evict and continued snapshot behavior is unchanged. The longest-match rule in kv_cache_find_prefix automatically prefers the longer snapshot when a multi-turn conversation is resumed, so anchor entries and conversation snapshots coexist without interference.

Benchmark

Workload: 100 financial-transaction categorization requests sharing one 870-token system prompt, with the transaction details varying inside the user message. Server launched with --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192 --mtp ... --mtp-draft 2.

run wall tok/s Δ vs prev Δ vs v1
v1 (initial) 505.57s 14.01
v2 (first cache fix) 352.00s 19.80 −153.57s (−30.4%) −30.4%
v3 (this cache fix) 312.39s 22.55 −39.61s (−11.3%) −38.2%
  • v1: server defaults. No cache hits because the default --kv-cache-boundary-align-tokens of 2048 forces sub-2k-token prompts to be cached as full-prompt blobs whose SHA includes the variable user content, so no entry ever matches a later request.
  • v2: --kv-cache-boundary-align-tokens 256 --kv-cache-min-tokens 512. Length-aligned cut at 768 tokens, which happens to land inside the 870-token system prompt. Hits on the 768-token cache every request, leaving ~190 tokens to prefill.
  • v3: --kv-cache-anchor-user. Cut lands exactly on the first <|User|> token at position 870, the entire system prompt. ~85 tokens to prefill.

Per-request, the prefill phase drops from ~2.67 s (v1) to ~1.25 s (v2) to ~0.83 s (v3). The cache file hash is byte-identical across every request in v3, confirming the cut lands inside the stable system region:

kv cache hit tokens=870 quant=2 load=7.8 ms file=/tmp/ds4-kv/733b68a90eda....kv
chat ctx=870..960:90 prompt start
chat ctx=870..960:90 prompt done 0.833s

Tests

  • New unit test: test_kv_cache_anchor_pos_finds_first_user_token (in the --server group, no model required).
  • Existing test test_kv_cache_store_len_uses_configured_boundary is unchanged; the helper is independent of the anchor path.

Run with:

make test
# or, to run just the model-free tests:
./ds4_test --server

Usage

./ds4-server --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192 \
             --kv-cache-anchor-user \
             [other flags]

Startup log will print anchor_user=on inside the existing KV disk cache ... line. The first request prints kv cache stored tokens=N reason=cold where N is the token position of the first <|User|>. Subsequent requests sharing the same system prompt print kv cache hit tokens=N.

Add --kv-cache-anchor-user, which cuts cold KV cache stores at the
first <|User|> token rather than at a length-aligned boundary.
Workloads that resend the same system prompt with varying user
content (benchmarks, classifiers, single-turn agents) get a stable
cache key covering exactly the system tokens, so every subsequent
request hits at the system-prompt boundary regardless of how the
user message changes.

Off by default. The anchor takes priority over the existing alignment
heuristic when a <|User|> token is found; the heuristic stays in
place for raw /v1/completions and non-DeepSeek templates. Multi-turn
evict and continued snapshots are unchanged and coexist with anchor
entries; the longest-match rule in kv_cache_find_prefix automatically
picks the longer snapshot when both apply.
@unsaltedbutter-ai unsaltedbutter-ai force-pushed the feat/kv-cache-anchor-user branch from ff6c403 to 983df6c Compare May 10, 2026 22:06
@unsaltedbutter-ai
Copy link
Copy Markdown
Author

I've run some more tests and on agents like Hermes we get an extra cache hit on new context windows with this proposed flag off. Lower --kv-cache-boundary-align-tokens get you a larger chunk of the system prompt being hit. So --kv-cache-anchor-user is good in cases where the system prompt is sha256-identical, as is true in some of my private workflows, but using --kv-cache-boundary-align-tokens is better in agent flows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant