feat(server): cache system prompt as a reusable KV prefix#66
Open
unsaltedbutter-ai wants to merge 1 commit intoantirez:mainfrom
Open
feat(server): cache system prompt as a reusable KV prefix#66unsaltedbutter-ai wants to merge 1 commit intoantirez:mainfrom
unsaltedbutter-ai wants to merge 1 commit intoantirez:mainfrom
Conversation
Add --kv-cache-anchor-user, which cuts cold KV cache stores at the first <|User|> token rather than at a length-aligned boundary. Workloads that resend the same system prompt with varying user content (benchmarks, classifiers, single-turn agents) get a stable cache key covering exactly the system tokens, so every subsequent request hits at the system-prompt boundary regardless of how the user message changes. Off by default. The anchor takes priority over the existing alignment heuristic when a <|User|> token is found; the heuristic stays in place for raw /v1/completions and non-DeepSeek templates. Multi-turn evict and continued snapshots are unchanged and coexist with anchor entries; the longest-match rule in kv_cache_find_prefix automatically picks the longer snapshot when both apply.
ff6c403 to
983df6c
Compare
Author
|
I've run some more tests and on agents like Hermes we get an extra cache hit on new context windows with this proposed flag off. Lower |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
--kv-cache-anchor-user, an opt-in flag that cuts cold KV cache stores at the first<|User|>token instead of at a length-aligned boundary. Workloads that resend the same system prompt with varying user content now get a stable cache key over exactly the system tokens, so every subsequent request hits at the system-prompt boundary regardless of how the user message varies.Motivation
Cold KV cache stores are sized by
kv_cache_store_lenusingboundary_align_tokensandboundary_trim_tokens. With the defaultboundary_align_tokensof 2048, any prompt shorter than ~2080 tokens has its store length collapse to the full prompt length, so the cache key SHA covers every token including the user message. Workloads that resend the same system prompt with different user content (categorizers, classifiers, single-turn agents, benchmarks) produce a unique SHA per request and never hit.Tuning
--kv-cache-boundary-align-tokensdown to 256 helps, but the cut position is a length-arithmetic guess. It can drift if the system prompt's token count changes, and it always leaves between 0 andalign - 1tokens of cacheable system text on the table.The DeepSeek chat template emits
<|User|>as a single atomic vocabulary token. The tokens preceding it (BOS plus the system text) form a byte-stable prefix that is independent of whatever user message follows. This patch lets the cold-cache cut land exactly at that natural boundary.What this changes
ds4.h/ds4.c: new accessords4_token_userexposingvocab.user_id.ds4_server.c:kv_cache_optionsgains abool anchor_user_enabledfield (defaultfalse).kv_disk_cachegains anint user_token_idfield, populated at server init fromds4_token_user.kv_cache_anchor_poswalks the prompt tokens and returns the position of the first<|User|>, or-1when disabled, when the marker is absent, whenuser_token_idis unset, or when the anchor sits belowmin_tokens.generate_jobprefers the anchor cut overkv_cache_store_lenwhen an anchor is available.--kv-cache-anchor-userenables the feature.anchor_user=onoranchor_user=offto the existingKV disk cache ...line.test_kv_cache_anchor_pos_finds_first_user_tokencovers enable/disable, marker present, marker absent, marker belowmin_tokens, and unsetuser_token_id.Behavior
<|User|>token is found, the cold cache entry covers tokens[0, anchor_pos)(no trim required, since<|User|>sits on an atomic vocabulary boundary).<|User|>token is present (raw/v1/completions, non-DeepSeek templates) or when the anchor sits below--kv-cache-min-tokens.evictandcontinuedsnapshot behavior is unchanged. The longest-match rule inkv_cache_find_prefixautomatically prefers the longer snapshot when a multi-turn conversation is resumed, so anchor entries and conversation snapshots coexist without interference.Benchmark
Workload: 100 financial-transaction categorization requests sharing one 870-token system prompt, with the transaction details varying inside the user message. Server launched with
--ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192 --mtp ... --mtp-draft 2.--kv-cache-boundary-align-tokensof 2048 forces sub-2k-token prompts to be cached as full-prompt blobs whose SHA includes the variable user content, so no entry ever matches a later request.--kv-cache-boundary-align-tokens 256 --kv-cache-min-tokens 512. Length-aligned cut at 768 tokens, which happens to land inside the 870-token system prompt. Hits on the 768-token cache every request, leaving ~190 tokens to prefill.--kv-cache-anchor-user. Cut lands exactly on the first<|User|>token at position 870, the entire system prompt. ~85 tokens to prefill.Per-request, the prefill phase drops from ~2.67 s (v1) to ~1.25 s (v2) to ~0.83 s (v3). The cache file hash is byte-identical across every request in v3, confirming the cut lands inside the stable system region:
Tests
test_kv_cache_anchor_pos_finds_first_user_token(in the--servergroup, no model required).test_kv_cache_store_len_uses_configured_boundaryis unchanged; the helper is independent of the anchor path.Run with:
Usage
Startup log will print
anchor_user=oninside the existingKV disk cache ...line. The first request printskv cache stored tokens=N reason=coldwhereNis the token position of the first<|User|>. Subsequent requests sharing the same system prompt printkv cache hit tokens=N.