Skip to content

Phase C.2 foundational APIs: server_tokens coexistence + common_speculative_reset#18

Open
WillowOneVision wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
WillowOneVision:cecil/phase-c2-foundational-apis
Open

Phase C.2 foundational APIs: server_tokens coexistence + common_speculative_reset#18
WillowOneVision wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
WillowOneVision:cecil/phase-c2-foundational-apis

Conversation

@WillowOneVision
Copy link
Copy Markdown

Summary

This PR introduces the foundational APIs (Phase C.2.0 + C.2.1) needed for future true coexistence of MTP speculative decoding and --mmproj multimodal loading. No runtime behavior change in this PR — these are pure additions that future per-batch dispatch work will consume.

The motivating context is a follow-up to PR #17 (fix(server): SEGV when --mtp-head + --mmproj are both passed). PR #17 made the combination safe by disabling MTP unconditionally when mmproj is loaded. This PR begins the path to keeping MTP enabled and gating it per-batch based on whether the current decode step is generating pure text continuation (post image encoding) or still consuming image tokens — which is where the actual speedup opportunity lives.

The next-phase PR (not in this one) will:

  • Add an opt-in --allow-mtp-with-mmproj CLI flag.
  • Modify the dispatch in server-context.cpp to gate common_speculative_draft() per batch using these new APIs.
  • Implement cold-restart of MTP state at image→text boundaries (Option 2 from the design brief).
  • Add an integration test matrix (10 scenarios) covering text-only, image-only, image+text, and multi-turn cases.

What's in this PR

server_tokens (tools/server/server-common.{h,cpp})

Three new accessors on the existing server_tokens struct, no new state added:

  • bool is_pure_text_continuation(size_t from_idx) const — O(log n) oracle. Returns true if there is no media chunk extending past from_idx. Trivial for !has_mtmd and for has_mtmd with an empty media map.
  • size_t last_image_end_idx() const — end-exclusive index of the last media chunk in the buffer (start_idx + n_tokens). Returns 0 if there are no media chunks.
  • llama_tokens get_text_tokens_post_media() const — returns the text-token suffix after the last media chunk. Returned by value because the underlying buffer interleaves media positions; callers typically bind to a const ref of the temporary. The existing get_text_tokens() still asserts !has_mtmd and is unchanged — these complement, not replace.

common_speculative_reset (common/speculative.{h,cpp})

  • New virtual void reset() hook on common_speculative_state, defaulting to cancel().
  • common_speculative_state_mtp::reset() override: drains any in-flight async draft (cancel semantics) AND zeroes h_idx, prev_n_acc_drafts, zero_accept_streak, and last_spec_params. Post-condition: the next begin() / draft() pair behaves as if MTP was just constructed.
  • void common_speculative_reset(common_speculative * spec) C wrapper, null-safe, no-op for non-MTP impls — same shape as the existing common_speculative_cancel / common_speculative_prepare_next.

Use case: at known state-boundaries that are not prompt boundaries but DO invalidate the assistant's hidden-state assumptions — e.g. a slot transitioning from image-encoding (where MTP would be gated off) back to text continuation (where MTP should re-engage from a clean slate). The next few text tokens incur the usual warmup cost, but state desync is avoided. This is the "Option 2 cold-restart" strategy from the design.

Tests

  • tests/test-server-tokens.cpp (new): 4 test groups covering empty buffer, text-only, mtmd-enabled-with-empty-map, and from_idx boundary semantics for the new server_tokens APIs. Runs unconditionally (no model required). The with-image scenarios need a real mtmd_input_chunk (clip model + image file) and will be covered by the integration matrix in the dispatch PR.
  • tests/test-speculative-mtp.cpp (extended): added an unconditional contract smoke at the top of main() that calls common_speculative_cancel(nullptr) and common_speculative_reset(nullptr) — verifies the documented null-safety contract without needing any model env vars.

Why split foundational APIs vs behavior change

The actual coexistence behavior (per-batch dispatch + state management at image→text boundaries) is a more nuanced change with state-alignment subtleties (see the design brief — Option 1 skip-and-resync vs Option 2 cold-restart, hidden-state warmup costs, etc.). Splitting lets these foundational additions land cheaply for review, and lets the behavior-change PR be reviewed independently on its own merits, behind an opt-in flag.

Test plan

  • cmake --build build -j4 --target test-server-tokens test-speculative-mtp common server-context — all targets compile clean on aarch64, no new warnings
  • cmake --build build -j4 --target llama-server — full server binary links clean (verifies API integrates with consumers)
  • ./build/bin/test-server-tokens — 4 groups OK, exit 0
  • ./build/bin/test-speculative-mtp — null-spec contract OK, then skip path (no model env vars), exit 0
  • ctest -R "speculative|server-tokens" — 2/2 passed
  • With-image integration tests — deferred to the dispatch PR (cannot exercise the with-image branches of server_tokens::is_pure_text_continuation / get_text_tokens_post_media until per-batch gate is wired up and a multimodal smoke can exercise the boundary)

Notes

🤖 Generated with Claude Code

Add the foundational APIs required for future per-batch MTP+mmproj coexistence
dispatch, with no runtime behavior change. The next-phase work (gate MTP draft
per-batch based on image presence, opt-in via --allow-mtp-with-mmproj flag) will
build on top of these primitives.

server_tokens (tools/server/server-common.{h,cpp})
  - is_pure_text_continuation(from_idx): O(log n) oracle answering "if I decode
    from from_idx, will all remaining tokens be pure text?"
  - last_image_end_idx(): end-exclusive idx of last media chunk in the buffer.
  - get_text_tokens_post_media(): text-token suffix after last media (returned
    by value; existing get_text_tokens() asserts !has_mtmd and is unchanged).

common_speculative (common/speculative.{h,cpp})
  - virtual void reset() on common_speculative_state (default = cancel()).
  - common_speculative_state_mtp::reset() override: drains in-flight async draft
    AND zeroes h_idx, prev_n_acc_drafts, zero_accept_streak, last_spec_params.
    Post-condition: next begin()/draft() pair behaves as if MTP was just
    constructed.
  - common_speculative_reset(spec): C wrapper, null-safe, no-op for non-MTP
    impls (matches cancel/prepare_next pattern).

Tests
  - tests/test-server-tokens.cpp: 4 test groups for server_tokens new APIs
    (empty buffer, text-only, mtmd+empty-map, semantics across from_idx ranges).
    With-image cases deferred to integration tests once dispatch is wired up.
  - tests/test-speculative-mtp.cpp: contract smoke for common_speculative_reset
    + common_speculative_cancel on null spec (runs unconditionally, no model
    required).

All tests pass on aarch64 Pi16; full llama-server links clean. Zero behavior
change to default text-only or default multimodal codepaths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant