Phase C.2 foundational APIs: server_tokens coexistence + common_speculative_reset#18
Open
WillowOneVision wants to merge 1 commit into
Conversation
Add the foundational APIs required for future per-batch MTP+mmproj coexistence
dispatch, with no runtime behavior change. The next-phase work (gate MTP draft
per-batch based on image presence, opt-in via --allow-mtp-with-mmproj flag) will
build on top of these primitives.
server_tokens (tools/server/server-common.{h,cpp})
- is_pure_text_continuation(from_idx): O(log n) oracle answering "if I decode
from from_idx, will all remaining tokens be pure text?"
- last_image_end_idx(): end-exclusive idx of last media chunk in the buffer.
- get_text_tokens_post_media(): text-token suffix after last media (returned
by value; existing get_text_tokens() asserts !has_mtmd and is unchanged).
common_speculative (common/speculative.{h,cpp})
- virtual void reset() on common_speculative_state (default = cancel()).
- common_speculative_state_mtp::reset() override: drains in-flight async draft
AND zeroes h_idx, prev_n_acc_drafts, zero_accept_streak, last_spec_params.
Post-condition: next begin()/draft() pair behaves as if MTP was just
constructed.
- common_speculative_reset(spec): C wrapper, null-safe, no-op for non-MTP
impls (matches cancel/prepare_next pattern).
Tests
- tests/test-server-tokens.cpp: 4 test groups for server_tokens new APIs
(empty buffer, text-only, mtmd+empty-map, semantics across from_idx ranges).
With-image cases deferred to integration tests once dispatch is wired up.
- tests/test-speculative-mtp.cpp: contract smoke for common_speculative_reset
+ common_speculative_cancel on null spec (runs unconditionally, no model
required).
All tests pass on aarch64 Pi16; full llama-server links clean. Zero behavior
change to default text-only or default multimodal codepaths.
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces the foundational APIs (Phase C.2.0 + C.2.1) needed for future true coexistence of MTP speculative decoding and
--mmprojmultimodal loading. No runtime behavior change in this PR — these are pure additions that future per-batch dispatch work will consume.The motivating context is a follow-up to PR #17 (
fix(server): SEGV when --mtp-head + --mmproj are both passed). PR #17 made the combination safe by disabling MTP unconditionally when mmproj is loaded. This PR begins the path to keeping MTP enabled and gating it per-batch based on whether the current decode step is generating pure text continuation (post image encoding) or still consuming image tokens — which is where the actual speedup opportunity lives.The next-phase PR (not in this one) will:
--allow-mtp-with-mmprojCLI flag.server-context.cppto gatecommon_speculative_draft()per batch using these new APIs.What's in this PR
server_tokens(tools/server/server-common.{h,cpp})Three new accessors on the existing
server_tokensstruct, no new state added:bool is_pure_text_continuation(size_t from_idx) const— O(log n) oracle. Returnstrueif there is no media chunk extending pastfrom_idx. Trivial for!has_mtmdand forhas_mtmdwith an empty media map.size_t last_image_end_idx() const— end-exclusive index of the last media chunk in the buffer (start_idx + n_tokens). Returns 0 if there are no media chunks.llama_tokens get_text_tokens_post_media() const— returns the text-token suffix after the last media chunk. Returned by value because the underlying buffer interleaves media positions; callers typically bind to a const ref of the temporary. The existingget_text_tokens()still asserts!has_mtmdand is unchanged — these complement, not replace.common_speculative_reset(common/speculative.{h,cpp})virtual void reset()hook oncommon_speculative_state, defaulting tocancel().common_speculative_state_mtp::reset()override: drains any in-flight async draft (cancel semantics) AND zeroesh_idx,prev_n_acc_drafts,zero_accept_streak, andlast_spec_params. Post-condition: the nextbegin()/draft()pair behaves as if MTP was just constructed.void common_speculative_reset(common_speculative * spec)C wrapper, null-safe, no-op for non-MTP impls — same shape as the existingcommon_speculative_cancel/common_speculative_prepare_next.Use case: at known state-boundaries that are not prompt boundaries but DO invalidate the assistant's hidden-state assumptions — e.g. a slot transitioning from image-encoding (where MTP would be gated off) back to text continuation (where MTP should re-engage from a clean slate). The next few text tokens incur the usual warmup cost, but state desync is avoided. This is the "Option 2 cold-restart" strategy from the design.
Tests
tests/test-server-tokens.cpp(new): 4 test groups covering empty buffer, text-only, mtmd-enabled-with-empty-map, andfrom_idxboundary semantics for the new server_tokens APIs. Runs unconditionally (no model required). The with-image scenarios need a realmtmd_input_chunk(clip model + image file) and will be covered by the integration matrix in the dispatch PR.tests/test-speculative-mtp.cpp(extended): added an unconditional contract smoke at the top ofmain()that callscommon_speculative_cancel(nullptr)andcommon_speculative_reset(nullptr)— verifies the documented null-safety contract without needing any model env vars.Why split foundational APIs vs behavior change
The actual coexistence behavior (per-batch dispatch + state management at image→text boundaries) is a more nuanced change with state-alignment subtleties (see the design brief — Option 1 skip-and-resync vs Option 2 cold-restart, hidden-state warmup costs, etc.). Splitting lets these foundational additions land cheaply for review, and lets the behavior-change PR be reviewed independently on its own merits, behind an opt-in flag.
Test plan
cmake --build build -j4 --target test-server-tokens test-speculative-mtp common server-context— all targets compile clean on aarch64, no new warningscmake --build build -j4 --target llama-server— full server binary links clean (verifies API integrates with consumers)./build/bin/test-server-tokens— 4 groups OK, exit 0./build/bin/test-speculative-mtp— null-spec contract OK, then skip path (no model env vars), exit 0ctest -R "speculative|server-tokens"— 2/2 passedserver_tokens::is_pure_text_continuation/get_text_tokens_post_mediauntil per-batch gate is wired up and a multimodal smoke can exercise the boundary)Notes
get_text_tokens()semantics (assert!has_mtmd, return const ref) are untouched.common_speculative_begin / draft / accept / prepare_next / cancel / set_seq_id / set_h_idxsemantics.reset()is purely additive.feature/turboquant-kv-cache. Does not depend on PR fix(server): SEGV when --mtp-head + --mmproj are both passed #17 (the SEGV fix) for compile or test, though the conceptual sequencing is PR fix(server): SEGV when --mtp-head + --mmproj are both passed #17 → this → dispatch PR.🤖 Generated with Claude Code