Phase C.2 foundational APIs: server_tokens coexistence + common_speculative_reset by WillowOneVision · Pull Request #18 · AtomicBot-ai/atomic-llama-cpp-turboquant

WillowOneVision · 2026-05-21T22:38:14Z

Summary

This PR introduces the foundational APIs (Phase C.2.0 + C.2.1) needed for future true coexistence of MTP speculative decoding and --mmproj multimodal loading. No runtime behavior change in this PR — these are pure additions that future per-batch dispatch work will consume.

The motivating context is a follow-up to PR #17 (fix(server): SEGV when --mtp-head + --mmproj are both passed). PR #17 made the combination safe by disabling MTP unconditionally when mmproj is loaded. This PR begins the path to keeping MTP enabled and gating it per-batch based on whether the current decode step is generating pure text continuation (post image encoding) or still consuming image tokens — which is where the actual speedup opportunity lives.

The next-phase PR (not in this one) will:

Add an opt-in --allow-mtp-with-mmproj CLI flag.
Modify the dispatch in server-context.cpp to gate common_speculative_draft() per batch using these new APIs.
Implement cold-restart of MTP state at image→text boundaries (Option 2 from the design brief).
Add an integration test matrix (10 scenarios) covering text-only, image-only, image+text, and multi-turn cases.

What's in this PR

`server_tokens` (`tools/server/server-common.{h,cpp}`)

Three new accessors on the existing server_tokens struct, no new state added:

bool is_pure_text_continuation(size_t from_idx) const — O(log n) oracle. Returns true if there is no media chunk extending past from_idx. Trivial for !has_mtmd and for has_mtmd with an empty media map.
size_t last_image_end_idx() const — end-exclusive index of the last media chunk in the buffer (start_idx + n_tokens). Returns 0 if there are no media chunks.
llama_tokens get_text_tokens_post_media() const — returns the text-token suffix after the last media chunk. Returned by value because the underlying buffer interleaves media positions; callers typically bind to a const ref of the temporary. The existing get_text_tokens() still asserts !has_mtmd and is unchanged — these complement, not replace.

`common_speculative_reset` (`common/speculative.{h,cpp}`)

New virtual void reset() hook on common_speculative_state, defaulting to cancel().
common_speculative_state_mtp::reset() override: drains any in-flight async draft (cancel semantics) AND zeroes h_idx, prev_n_acc_drafts, zero_accept_streak, and last_spec_params. Post-condition: the next begin() / draft() pair behaves as if MTP was just constructed.
void common_speculative_reset(common_speculative * spec) C wrapper, null-safe, no-op for non-MTP impls — same shape as the existing common_speculative_cancel / common_speculative_prepare_next.

Use case: at known state-boundaries that are not prompt boundaries but DO invalidate the assistant's hidden-state assumptions — e.g. a slot transitioning from image-encoding (where MTP would be gated off) back to text continuation (where MTP should re-engage from a clean slate). The next few text tokens incur the usual warmup cost, but state desync is avoided. This is the "Option 2 cold-restart" strategy from the design.

Tests

tests/test-server-tokens.cpp (new): 4 test groups covering empty buffer, text-only, mtmd-enabled-with-empty-map, and from_idx boundary semantics for the new server_tokens APIs. Runs unconditionally (no model required). The with-image scenarios need a real mtmd_input_chunk (clip model + image file) and will be covered by the integration matrix in the dispatch PR.
tests/test-speculative-mtp.cpp (extended): added an unconditional contract smoke at the top of main() that calls common_speculative_cancel(nullptr) and common_speculative_reset(nullptr) — verifies the documented null-safety contract without needing any model env vars.

Why split foundational APIs vs behavior change

The actual coexistence behavior (per-batch dispatch + state management at image→text boundaries) is a more nuanced change with state-alignment subtleties (see the design brief — Option 1 skip-and-resync vs Option 2 cold-restart, hidden-state warmup costs, etc.). Splitting lets these foundational additions land cheaply for review, and lets the behavior-change PR be reviewed independently on its own merits, behind an opt-in flag.

Test plan

cmake --build build -j4 --target test-server-tokens test-speculative-mtp common server-context — all targets compile clean on aarch64, no new warnings
cmake --build build -j4 --target llama-server — full server binary links clean (verifies API integrates with consumers)
./build/bin/test-server-tokens — 4 groups OK, exit 0
./build/bin/test-speculative-mtp — null-spec contract OK, then skip path (no model env vars), exit 0
ctest -R "speculative|server-tokens" — 2/2 passed
With-image integration tests — deferred to the dispatch PR (cannot exercise the with-image branches of server_tokens::is_pure_text_continuation / get_text_tokens_post_media until per-batch gate is wired up and a multimodal smoke can exercise the boundary)

Notes

No header-level ABI changes that affect existing callers. Existing get_text_tokens() semantics (assert !has_mtmd, return const ref) are untouched.
No changes to common_speculative_begin / draft / accept / prepare_next / cancel / set_seq_id / set_h_idx semantics. reset() is purely additive.
Branched off feature/turboquant-kv-cache. Does not depend on PR fix(server): SEGV when --mtp-head + --mmproj are both passed #17 (the SEGV fix) for compile or test, though the conceptual sequencing is PR fix(server): SEGV when --mtp-head + --mmproj are both passed #17 → this → dispatch PR.

🤖 Generated with Claude Code

Add the foundational APIs required for future per-batch MTP+mmproj coexistence dispatch, with no runtime behavior change. The next-phase work (gate MTP draft per-batch based on image presence, opt-in via --allow-mtp-with-mmproj flag) will build on top of these primitives. server_tokens (tools/server/server-common.{h,cpp}) - is_pure_text_continuation(from_idx): O(log n) oracle answering "if I decode from from_idx, will all remaining tokens be pure text?" - last_image_end_idx(): end-exclusive idx of last media chunk in the buffer. - get_text_tokens_post_media(): text-token suffix after last media (returned by value; existing get_text_tokens() asserts !has_mtmd and is unchanged). common_speculative (common/speculative.{h,cpp}) - virtual void reset() on common_speculative_state (default = cancel()). - common_speculative_state_mtp::reset() override: drains in-flight async draft AND zeroes h_idx, prev_n_acc_drafts, zero_accept_streak, last_spec_params. Post-condition: next begin()/draft() pair behaves as if MTP was just constructed. - common_speculative_reset(spec): C wrapper, null-safe, no-op for non-MTP impls (matches cancel/prepare_next pattern). Tests - tests/test-server-tokens.cpp: 4 test groups for server_tokens new APIs (empty buffer, text-only, mtmd+empty-map, semantics across from_idx ranges). With-image cases deferred to integration tests once dispatch is wired up. - tests/test-speculative-mtp.cpp: contract smoke for common_speculative_reset + common_speculative_cancel on null spec (runs unconditionally, no model required). All tests pass on aarch64 Pi16; full llama-server links clean. Zero behavior change to default text-only or default multimodal codepaths.

github-actions Bot added testing examples server labels May 21, 2026

WillowOneVision mentioned this pull request May 21, 2026

Phase C.2 dispatch behavior: MTP+mmproj coexistence behind --allow-mtp-with-mmproj (5th first-in-world) #19

Open

9 tasks

WillowOneVision marked this pull request as ready for review May 22, 2026 05:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase C.2 foundational APIs: server_tokens coexistence + common_speculative_reset#18

Phase C.2 foundational APIs: server_tokens coexistence + common_speculative_reset#18
WillowOneVision wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
WillowOneVision:cecil/phase-c2-foundational-apis

WillowOneVision commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WillowOneVision commented May 21, 2026

Summary

What's in this PR

server_tokens (tools/server/server-common.{h,cpp})

common_speculative_reset (common/speculative.{h,cpp})

Tests

Why split foundational APIs vs behavior change

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`server_tokens` (`tools/server/server-common.{h,cpp}`)

`common_speculative_reset` (`common/speculative.{h,cpp}`)