server: fix checkpoints creation#22929
Conversation
|
tested following way:
Details |
|
Hi @jacekpoplawski, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
d878621 to
ea9369c
Compare
|
Details |
|
Yes, that seems in a good direction. Have you done testing that it works as expected? |
|
This needs autoparser dedicated support for split-marker detection; currently, this will assume that all autoparser models use the ChatML markers ( I'll try to submit the marker detection code ASAP. |
|
|
||
| const auto message_spans = json_value(data, "message_spans", json::array()); | ||
| if (message_spans.is_array()) { | ||
| int32_t last_user_pos = -1; |
There was a problem hiding this comment.
You can probably use 0 as the sentinel value here, since a checkpoint at pos 0 isn't useful. Should help clean up the other logic too.
|
|
||
| if ((size_t) last_user_pos <= prompt.size()) { | ||
| const std::string prefix = prompt.substr(0, (size_t) last_user_pos); | ||
| const auto prefix_tokens = common_tokenize(vocab, prefix, true, true); |
There was a problem hiding this comment.
Just a guess, but this will probably create incorrect checkpoints for multimodal models with at least one image in the prompt.
There was a problem hiding this comment.
Yes, you are right, this breaks after the first image.
It works stable for my usecase: pi, qwen 3.6 27B, 200k ctx, 24 checkpoints With 8 checkpoints I was able to reproduce As @aldehir pointed out, this does not work correctly with multimodal prompts. I committed a fallback to the old mechanism for that case. Should I add a switch to enable this new mechanism as an option, or should I try to support multimodal prompts as well? I understand that the impact of this change is significant, but the benefits are also significant: agentic coding is much more responsive now. |
Overview
Implemented as requested in #22826 (comment)
Uses chat message spans to create context checkpoints at conversation boundaries, right before the latest user input.
This also adds message span extraction for ChatML-style prompts using
<|im_start|>...<|im_end|>.Additional information
This is another chapter in my journey toward fixing
forcing full prompt re-processing due to lack of cache dataTested with Pi, GPT-OSS-20B, Qwen3.6 27B and Gemma 4 31B
Requirements