fix(server): SEGV when --mtp-head + --mmproj are both passed by WillowOneVision · Pull Request #17 · AtomicBot-ai/atomic-llama-cpp-turboquant

WillowOneVision · 2026-05-21T20:54:00Z

Summary

llama-server consistently segfaults during slot initialization when launched with both --mtp-head <assistant.gguf> and --mmproj <vision.gguf>. This PR diagnoses the root cause empirically (gdb backtrace from a core dump) and ships a minimal two-layer defensive fix that lets both flags coexist gracefully — MTP cleanly disabled when mmproj is loaded, no crash, all other paths preserved.

Reproducer

./llama-server \
    -m gemma-4-E4B-it-Q4_K_M.gguf \
    --mtp-head gemma-4-E4B-it-assistant.Q4_K_M.gguf --spec-type mtp \
    --draft-block-size 3 \
    --mmproj gemma-4-E4B-it-mmproj-F16.gguf \
    --host 127.0.0.1 --port 8090 -t 4 -c 4096 \
    -ctk turbo4 -ctv turbo4 -ctkd turbo4 -ctvd turbo4

Pre-patch log:

srv    load_model: loaded multimodal model, '.../mmproj-F16.gguf'
srv    load_model: speculative decoding is not supported by multimodal, it will be disabled
srv    load_model: initializing slots, n_slots = 4
[Segmentation fault (core dumped)]

Backtrace from core

Program terminated with signal SIGSEGV, Segmentation fault.
#0 llama_context::n_batch() const  (from libllama.so.0)
#1 common_speculative_init(common_params_speculative&, llama_context*)
#2 server_context_impl::load_model(common_params&)
#3 main

Root cause chain (10 steps)

common/common.cpp::common_init_from_params calls llama_model_load_mtp_from_file() to load the MTP assistant into the target model — no separate draft context exists by design.
tools/server/server-context.cpp:668: when params_spec.type == COMMON_SPECULATIVE_TYPE_MTP, server-context sets params_base.speculative.model_dft = nullptr (correct — no draft context).
tools/server/server-context.cpp:736-739: when mctx != nullptr (mmproj loaded), server-context overrides params_base.speculative.type = COMMON_SPECULATIVE_TYPE_NONE and emits the WARN "speculative decoding is not supported by multimodal, it will be disabled". But it does NOT clear params_base.speculative.mparams_dft.path (= the --mtp-head GGUF path).
common/speculative.cpp:1250::common_speculative_is_compat(ctx_tgt) returns true — it only checks that the target context supports 2-token decode + partial sequence removal, independent of speculative type. So can_spec = true at server-context.cpp:777.
server-context.cpp:795: slot.spec = common_speculative_init(params_base.speculative, slot.ctx).
In common_speculative_init: line 1293 if (params.model_dft && params.type != MTP) { ctx_dft = init_from_model(...); } — model_dft is null (set at step 2), so ctx_dft stays nullptr.
Line 1322: has_draft = !params.mparams_dft.path.empty() evaluates to true (path was set by --mtp-head at CLI; never cleared at step 3).
Line 1330: has_mtp = (params.type == MTP) evaluates to false (was overridden to NONE at step 3).
Lines 1364-1369: if (has_draft) { if (has_mtp) {…} else { configs.push_back(DRAFT); } } — pushes a COMMON_SPECULATIVE_TYPE_DRAFT config based on the orphaned path.
Lines 1383-1390: DRAFT case constructs common_speculative_state_draft(type, ctx_tgt, ctx_dft=nullptr, replacements). The ctor at line 227 calls batch = llama_batch_init(llama_n_batch(ctx_dft), 0, 1) → llama_n_batch(nullptr) → SEGV.

In one sentence: the server's mmproj-disable cleanup at server-context.cpp:738 only clears params.speculative.type but leaves params.speculative.mparams_dft.path set, which lets common_speculative_init push a DRAFT config whose constructor then dereferences a null ctx_dft.

Patch (two-layer defense)

Layer 1 — `common/speculative.cpp::common_speculative_init`

Defensive early bail when type == NONE. Any future caller hitting the same conditions also gets protection.

common_speculative * common_speculative_init(
        common_params_speculative & params,
        llama_context             * ctx_tgt) {
    // Defensive: if speculative was disabled upstream (e.g. server disables it when mmproj
    // is loaded), bail out before any impl construction. Without this guard, a caller that
    // sets params.type=NONE but leaves params.mparams_dft.path (set via --mtp-head/--model-draft)
    // would still trigger the DRAFT config below with ctx_dft=nullptr, crashing in the
    // common_speculative_state_draft ctor at llama_n_batch(ctx_dft).
    if (params.type == COMMON_SPECULATIVE_TYPE_NONE) {
        return nullptr;
    }
    ...existing code...
}

Layer 2 — `tools/server/server-context.cpp:736-739`

Clear the orphan path + pointer at the disable site.

if (params_base.speculative.type != COMMON_SPECULATIVE_TYPE_NONE) {
    params_base.speculative.type = COMMON_SPECULATIVE_TYPE_NONE;
    // Also clear the draft model path so common_speculative_init does not
    // observe an orphan has_draft=true with type=NONE (would build a DRAFT
    // config and crash on ctx_dft=nullptr). See common/speculative.cpp init.
    params_base.speculative.mparams_dft.path.clear();
    params_base.speculative.model_dft = nullptr;
    SRV_WRN("%s\n", "speculative decoding is not supported by multimodal, it will be disabled");
}

Either layer alone is sufficient to prevent the crash; both together provide redundancy at the producer (server) and consumer (init) ends.

Verification

Test	Cmdline	Result
MTP + mmproj launch (the original SEGV path)	`--mtp-head + --mmproj`	✅ Server READY 18s, slots log "speculative decoding context not initialized" 4/4, no crash, no coredump
Text-only request (mmproj loaded, MTP disabled)	same cmdline + text query	✅ "Water sustains all life on Earth." dur 4.4s, finish=stop
Image+text request (mmproj loaded, MTP disabled)	same cmdline + image query	✅ 100 image tokens generated, eval 2.34 tok/s, no crash
Regression: MTP-only (no mmproj)	`--mtp-head` (no mmproj)	✅ slots log "speculative decoding context initialized" 4/4, "Photosynthesis is the process by which plants convert light energy..." dur 10.2s, finish=stop

The regression test is critical: it confirms the patch does not break the existing MTP-only path. The "initialized" status per slot proves the MTP draft head is loaded and ready to draft when text-only requests come in.

Not a true coexistence patch

This fix prevents the crash but does NOT achieve simultaneous MTP+vision (per-request dispatch). It cleanly disables MTP whenever mmproj is loaded, matching the WARN message's intent. True coexistence would require:

Per-batch detection of image tokens
Conditional MTP draft-head invocation (text-only batches → MTP speedup; image-containing batches → standard decode)
Possibly distinct slot configuration per request type

That work is significantly more invasive and is documented as a follow-up.

Test plan

Reproducer command in description triggers SEGV on master, succeeds on this branch.
gdb backtrace from core matches the 10-step chain analysis.
Regression: text-only MTP path still gets MTP speedup post-patch.
No new behavior for non-MTP, non-mmproj launches (unchanged code paths).
(Suggested) Add a regression test in tests/test-speculative-mtp.cpp exercising the mmproj-loaded path.

When llama-server is launched with both --mtp-head <assistant.gguf> (Gemma 4 MTP speculative decoding) and --mmproj <vision.gguf> (multimodal projector), the server consistently segfaults during slot initialization. Crash signature (gdb backtrace from core): #0 llama_context::n_batch() AtomicBot-ai#1 common_speculative_init(common_params_speculative&, llama_context*) AtomicBot-ai#2 server_context_impl::load_model AtomicBot-ai#3 main Root cause chain: 1. tools/server/server-context.cpp:738 sets params.speculative.type = NONE when mmproj is loaded (the WARN "speculative decoding is not supported by multimodal, it will be disabled"). 2. But params.speculative.mparams_dft.path (set via --mtp-head) is NOT cleared. Stale state. 3. common_speculative_init evaluates has_draft = !params.mparams_dft.path.empty() -> true has_mtp = (params.type == MTP) -> false (overridden) and falls into the legacy DRAFT branch. 4. common_speculative_state_draft ctor at speculative.cpp:227 calls llama_batch_init(llama_n_batch(ctx_dft), ...) where ctx_dft is nullptr (because params.model_dft was zeroed for the MTP case). Null deref. Two-layer defensive fix: - common/speculative.cpp: common_speculative_init returns nullptr early when params.type == COMMON_SPECULATIVE_TYPE_NONE. Defensive against any caller that disables speculative but leaves orphan params. - tools/server/server-context.cpp: at the mmproj-disable site, also clear params_base.speculative.mparams_dft.path and ...model_dft. Removes the orphan state at the source. Either layer alone is sufficient; both together provide defense-in-depth. Verified: - Repro before patch: SEGV on launch. - After patch: server boots cleanly, slots log "speculative decoding context not initialized", text-only and image+text requests both work, no crash. - Regression: same binary without --mmproj still gets MTP speedup (slots log "speculative decoding context initialized" 4/4). This patch prevents the crash but does NOT achieve concurrent MTP+vision operation (per-batch dispatch). It matches the WARN message intent (MTP cleanly disabled when mmproj is loaded). True coexistence is a separate scope (per-batch image-token detection + conditional draft head invocation).

github-actions Bot added examples server labels May 21, 2026

This was referenced May 21, 2026

Phase C.2 foundational APIs: server_tokens coexistence + common_speculative_reset #18

Open

Phase C.2 dispatch behavior: MTP+mmproj coexistence behind --allow-mtp-with-mmproj (5th first-in-world) #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): SEGV when --mtp-head + --mmproj are both passed#17

fix(server): SEGV when --mtp-head + --mmproj are both passed#17
WillowOneVision wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
WillowOneVision:cecil/fix-mtp-mmproj-segv

WillowOneVision commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WillowOneVision commented May 21, 2026

Summary

Reproducer

Backtrace from core

Root cause chain (10 steps)

Patch (two-layer defense)

Layer 1 — common/speculative.cpp::common_speculative_init

Layer 2 — tools/server/server-context.cpp:736-739

Verification

Not a true coexistence patch

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Layer 1 — `common/speculative.cpp::common_speculative_init`

Layer 2 — `tools/server/server-context.cpp:736-739`