Skip to content

server: fix checkpoints creation#22929

Open
jacekpoplawski wants to merge 10 commits into
ggml-org:masterfrom
jacekpoplawski:fix-checkpoints-creation
Open

server: fix checkpoints creation#22929
jacekpoplawski wants to merge 10 commits into
ggml-org:masterfrom
jacekpoplawski:fix-checkpoints-creation

Conversation

@jacekpoplawski
Copy link
Copy Markdown
Contributor

@jacekpoplawski jacekpoplawski commented May 11, 2026

Overview

Implemented as requested in #22826 (comment)

Uses chat message spans to create context checkpoints at conversation boundaries, right before the latest user input.

This also adds message span extraction for ChatML-style prompts using <|im_start|>...<|im_end|>.

Additional information

This is another chapter in my journey toward fixing forcing full prompt re-processing due to lack of cache data

Tested with Pi, GPT-OSS-20B, Qwen3.6 27B and Gemma 4 31B

Requirements

@jacekpoplawski jacekpoplawski requested review from a team and pwilkin as code owners May 11, 2026 01:31
@github-actions github-actions Bot added testing Everything test related examples server labels May 11, 2026
@jacekpoplawski
Copy link
Copy Markdown
Contributor Author

tested following way:

CUDA_VISIBLE_DEVICES=0,1,2 ./bin/llama-server -m /mnt/models2/Qwen/3.6/Qwen3.6-27B-UD-Q8_K_XL.gguf --host 0.0.0.0 --ctx-checkpoints 8 -b 8192 --spec-type ngram-mod --parallel 1 --top-p 0.95 --top-k 20 --min-p 0 --presence-penalty 0 --repeat-penalty 1.0

Details
main: starting the main loop...
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=14764, token_pos=3549
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 3562
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 3046, batch.n_tokens = 3046, progress = 0.855138
slot update_slots: id  0 | task 0 | n_tokens = 3046, memory_seq_rm [3046, end)
slot update_slots: id  0 | task 0 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 3549
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 3549, batch.n_tokens = 503, progress = 0.996350
slot update_slots: id  0 | task 0 | skip checkpoint at 3046, expected boundary before user input = 3549
slot update_slots: id  0 | task 0 | n_tokens = 3549, memory_seq_rm [3549, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 3558, batch.n_tokens = 9, progress = 0.998877
slot create_check: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 3548, pos_max = 3548, n_tokens = 3549, size = 149.626 MiB)
slot update_slots: id  0 | task 0 | n_tokens = 3558, memory_seq_rm [3558, end)
slot init_sampler: id  0 | task 0 | init sampler, took 0.51 ms, tokens: text = 3562, total = 3562
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 3562, batch.n_tokens = 4
slot update_slots: id  0 | task 0 | skip checkpoint at 3558, expected boundary before user input = 3549
begin: ngram_mod occupancy = 3517/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
~llama_io_write_device: allocated 'CUDA0' buffer 52.992 MiB
~llama_io_write_device: allocated 'CUDA1' buffer 49.875 MiB
~llama_io_write_device: allocated 'CUDA2' buffer 46.758 MiB
slot print_timing: id  0 | task 0 |
prompt eval time =    2163.37 ms /  3562 tokens (    0.61 ms per token,  1646.51 tokens per second)
       eval time =    3204.93 ms /    73 tokens (   43.90 ms per token,    22.78 tokens per second)
      total time =    5368.29 ms /  3635 tokens
draft acceptance rate = 0.01562 (    1 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 1 71 1, #gen drafts = 1, #acc drafts = 1, #gen tokens = 64, #acc tokens = 1, dur(b,g,a) = 0.516, 0.120, 0.001 ms
slot      release: id  0 | task 0 | stop processing: n_tokens = 3634, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=15117, token_pos=3636
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.890 (> 0.100 thold), f_keep = 1.000
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 77 | processing task, is_child = 0
slot update_slots: id  0 | task 77 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 4085
slot update_slots: id  0 | task 77 | n_tokens = 3634, memory_seq_rm [3634, end)
slot update_slots: id  0 | task 77 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 3636
slot update_slots: id  0 | task 77 | prompt processing progress, n_tokens = 3636, batch.n_tokens = 2, progress = 0.890086
slot update_slots: id  0 | task 77 | skip checkpoint at 3634, expected boundary before user input = 3636
slot update_slots: id  0 | task 77 | n_tokens = 3636, memory_seq_rm [3636, end)
slot update_slots: id  0 | task 77 | prompt processing progress, n_tokens = 4081, batch.n_tokens = 445, progress = 0.999021
slot create_check: id  0 | task 77 | created context checkpoint 2 of 8 (pos_min = 3635, pos_max = 3635, n_tokens = 3636, size = 149.626 MiB)
slot update_slots: id  0 | task 77 | n_tokens = 4081, memory_seq_rm [4081, end)
slot init_sampler: id  0 | task 77 | init sampler, took 0.55 ms, tokens: text = 4085, total = 4085
slot update_slots: id  0 | task 77 | prompt processing done, n_tokens = 4085, batch.n_tokens = 4
slot update_slots: id  0 | task 77 | skip checkpoint at 4081, expected boundary before user input = 3636
begin: ngram_mod occupancy = 4028/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 77 |
prompt eval time =     495.71 ms /   451 tokens (    1.10 ms per token,   909.80 tokens per second)
       eval time =    3886.85 ms /    92 tokens (   42.25 ms per token,    23.67 tokens per second)
      total time =    4382.56 ms /   543 tokens
statistics ngram_mod: #calls(b,g,a) = 2 162 1, #gen drafts = 1, #acc drafts = 1, #gen tokens = 64, #acc tokens = 1, dur(b,g,a) = 1.079, 0.248, 0.001 ms
slot      release: id  0 | task 77 | stop processing: n_tokens = 4176, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=17241, token_pos=4178
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.248 (> 0.100 thold), f_keep = 1.000
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 172 | processing task, is_child = 0
slot update_slots: id  0 | task 172 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 16844
slot update_slots: id  0 | task 172 | n_tokens = 4176, memory_seq_rm [4176, end)
slot update_slots: id  0 | task 172 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 4178
slot update_slots: id  0 | task 172 | prompt processing progress, n_tokens = 4178, batch.n_tokens = 2, progress = 0.248041
slot update_slots: id  0 | task 172 | n_tokens = 4178, memory_seq_rm [4178, end)
slot update_slots: id  0 | task 172 | prompt processing progress, n_tokens = 12370, batch.n_tokens = 8192, progress = 0.734386
slot update_slots: id  0 | task 172 | n_tokens = 12370, memory_seq_rm [12370, end)
slot update_slots: id  0 | task 172 | 8192 tokens since last checkpoint at 3636, creating new checkpoint during processing at position 16328
slot update_slots: id  0 | task 172 | prompt processing progress, n_tokens = 16328, batch.n_tokens = 3958, progress = 0.969366
slot update_slots: id  0 | task 172 | skip checkpoint at 12370, expected boundary before user input = 4178
slot update_slots: id  0 | task 172 | n_tokens = 16328, memory_seq_rm [16328, end)
slot update_slots: id  0 | task 172 | prompt processing progress, n_tokens = 16840, batch.n_tokens = 512, progress = 0.999763
slot update_slots: id  0 | task 172 | skip checkpoint at 16328, expected boundary before user input = 4178
slot update_slots: id  0 | task 172 | n_tokens = 16840, memory_seq_rm [16840, end)
slot init_sampler: id  0 | task 172 | init sampler, took 2.74 ms, tokens: text = 16844, total = 16844
slot update_slots: id  0 | task 172 | prompt processing done, n_tokens = 16844, batch.n_tokens = 4
slot update_slots: id  0 | task 172 | skip checkpoint at 16840, expected boundary before user input = 4178
begin: ngram_mod occupancy = 14423/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 172 |
prompt eval time =    7495.20 ms / 12668 tokens (    0.59 ms per token,  1690.15 tokens per second)
       eval time =    3852.14 ms /    89 tokens (   43.28 ms per token,    23.10 tokens per second)
      total time =   11347.35 ms / 12757 tokens
statistics ngram_mod: #calls(b,g,a) = 3 250 1, #gen drafts = 1, #acc drafts = 1, #gen tokens = 64, #acc tokens = 1, dur(b,g,a) = 2.975, 0.394, 0.001 ms
slot      release: id  0 | task 172 | stop processing: n_tokens = 16932, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=57632, token_pos=16863
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.211 (> 0.100 thold), f_keep = 0.210
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 16932, total state size = 1208.199 MiB
srv          load:  - looking for better prompt, base f_keep = 0.210, sim = 0.211
srv        update:  - cache state: 1 prompts, 1507.452 MiB (limits: 8192.000 MiB, 262144 tokens, 262144 est)
srv        update:    - prompt 0x728630022c00:   16932 tokens, checkpoints:  2,  1507.452 MiB
srv  get_availabl: prompt cache update took 1136.85 ms
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 266 | processing task, is_child = 0
slot update_slots: id  0 | task 266 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 16875
slot update_slots: id  0 | task 266 | n_past = 3560, slot.prompt.tokens.size() = 16932, seq_id = 0, pos_min = 16931, n_swa = 0
slot update_slots: id  0 | task 266 | Checking checkpoint with [3635, 3635] against 3560...
slot update_slots: id  0 | task 266 | Checking checkpoint with [3548, 3548] against 3560...
slot update_slots: id  0 | task 266 | restored context checkpoint (pos_min = 3548, pos_max = 3548, n_tokens = 3549, n_past = 3549, size = 149.626 MiB)
slot update_slots: id  0 | task 266 | erased invalidated context checkpoint (pos_min = 3635, pos_max = 3635, n_tokens = 3636, n_swa = 0, pos_next = 3549, size = 149.626 MiB)
slot update_slots: id  0 | task 266 | n_tokens = 3549, memory_seq_rm [3549, end)
slot update_slots: id  0 | task 266 | prompt processing progress, n_tokens = 11741, batch.n_tokens = 8192, progress = 0.695763
slot update_slots: id  0 | task 266 | n_tokens = 11741, memory_seq_rm [11741, end)
slot update_slots: id  0 | task 266 | 8192 tokens since last checkpoint at 3549, creating new checkpoint during processing at position 16359
slot update_slots: id  0 | task 266 | prompt processing progress, n_tokens = 16359, batch.n_tokens = 4618, progress = 0.969422
slot update_slots: id  0 | task 266 | skip checkpoint at 11741, expected boundary before user input = 16863
slot update_slots: id  0 | task 266 | n_tokens = 16359, memory_seq_rm [16359, end)
slot update_slots: id  0 | task 266 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 16863
slot update_slots: id  0 | task 266 | prompt processing progress, n_tokens = 16863, batch.n_tokens = 504, progress = 0.999289
slot update_slots: id  0 | task 266 | skip checkpoint at 16359, expected boundary before user input = 16863
slot update_slots: id  0 | task 266 | n_tokens = 16863, memory_seq_rm [16863, end)
slot update_slots: id  0 | task 266 | prompt processing progress, n_tokens = 16871, batch.n_tokens = 8, progress = 0.999763
slot create_check: id  0 | task 266 | created context checkpoint 2 of 8 (pos_min = 16862, pos_max = 16862, n_tokens = 16863, size = 149.626 MiB)
slot update_slots: id  0 | task 266 | n_tokens = 16871, memory_seq_rm [16871, end)
slot init_sampler: id  0 | task 266 | init sampler, took 2.89 ms, tokens: text = 16875, total = 16875
slot update_slots: id  0 | task 266 | prompt processing done, n_tokens = 16875, batch.n_tokens = 4
slot update_slots: id  0 | task 266 | skip checkpoint at 16871, expected boundary before user input = 16863
begin: ngram_mod occupancy = 14593/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 266 |
prompt eval time =    8060.61 ms / 13326 tokens (    0.60 ms per token,  1653.22 tokens per second)
       eval time =    2922.18 ms /    71 tokens (   41.16 ms per token,    24.30 tokens per second)
      total time =   10982.80 ms / 13397 tokens
draft acceptance rate = 0.10938 (    7 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 4 313 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 8, dur(b,g,a) = 4.865, 0.524, 0.002 ms
slot      release: id  0 | task 266 | stop processing: n_tokens = 16945, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=57981, token_pos=16947
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.570 (> 0.100 thold), f_keep = 1.000
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 336 | processing task, is_child = 0
slot update_slots: id  0 | task 336 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 29730
slot update_slots: id  0 | task 336 | n_tokens = 16945, memory_seq_rm [16945, end)
slot update_slots: id  0 | task 336 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 16947
slot update_slots: id  0 | task 336 | prompt processing progress, n_tokens = 16947, batch.n_tokens = 2, progress = 0.570030
slot update_slots: id  0 | task 336 | n_tokens = 16947, memory_seq_rm [16947, end)
slot update_slots: id  0 | task 336 | prompt processing progress, n_tokens = 25139, batch.n_tokens = 8192, progress = 0.845577
slot update_slots: id  0 | task 336 | n_tokens = 25139, memory_seq_rm [25139, end)
slot update_slots: id  0 | task 336 | 8192 tokens since last checkpoint at 16863, creating new checkpoint during processing at position 29214
slot update_slots: id  0 | task 336 | prompt processing progress, n_tokens = 29214, batch.n_tokens = 4075, progress = 0.982644
slot update_slots: id  0 | task 336 | skip checkpoint at 25139, expected boundary before user input = 16947
slot update_slots: id  0 | task 336 | n_tokens = 29214, memory_seq_rm [29214, end)
slot update_slots: id  0 | task 336 | prompt processing progress, n_tokens = 29726, batch.n_tokens = 512, progress = 0.999865
slot update_slots: id  0 | task 336 | skip checkpoint at 29214, expected boundary before user input = 16947
slot update_slots: id  0 | task 336 | n_tokens = 29726, memory_seq_rm [29726, end)
slot init_sampler: id  0 | task 336 | init sampler, took 4.74 ms, tokens: text = 29730, total = 29730
slot update_slots: id  0 | task 336 | prompt processing done, n_tokens = 29730, batch.n_tokens = 4
slot update_slots: id  0 | task 336 | skip checkpoint at 29726, expected boundary before user input = 16947
begin: ngram_mod occupancy = 26770/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 336 |
prompt eval time =    9569.65 ms / 12785 tokens (    0.75 ms per token,  1336.00 tokens per second)
       eval time =   20765.44 ms /   454 tokens (   45.74 ms per token,    21.86 tokens per second)
      total time =   30335.09 ms / 13239 tokens
statistics ngram_mod: #calls(b,g,a) = 5 766 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 8, dur(b,g,a) = 8.193, 1.324, 0.002 ms
slot      release: id  0 | task 336 | stop processing: n_tokens = 30183, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=109526, token_pos=30104
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.560 (> 0.100 thold), f_keep = 0.559
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 795 | processing task, is_child = 0
slot update_slots: id  0 | task 795 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 30119
slot update_slots: id  0 | task 795 | n_past = 16873, slot.prompt.tokens.size() = 30183, seq_id = 0, pos_min = 30182, n_swa = 0
slot update_slots: id  0 | task 795 | Checking checkpoint with [16862, 16862] against 16873...
slot update_slots: id  0 | task 795 | restored context checkpoint (pos_min = 16862, pos_max = 16862, n_tokens = 16863, n_past = 16863, size = 149.626 MiB)
slot update_slots: id  0 | task 795 | n_tokens = 16863, memory_seq_rm [16863, end)
slot update_slots: id  0 | task 795 | prompt processing progress, n_tokens = 25055, batch.n_tokens = 8192, progress = 0.831867
slot update_slots: id  0 | task 795 | n_tokens = 25055, memory_seq_rm [25055, end)
slot update_slots: id  0 | task 795 | 8192 tokens since last checkpoint at 16863, creating new checkpoint during processing at position 29603
slot update_slots: id  0 | task 795 | prompt processing progress, n_tokens = 29603, batch.n_tokens = 4548, progress = 0.982868
slot update_slots: id  0 | task 795 | skip checkpoint at 25055, expected boundary before user input = 30104
slot update_slots: id  0 | task 795 | n_tokens = 29603, memory_seq_rm [29603, end)
slot update_slots: id  0 | task 795 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 30104
slot update_slots: id  0 | task 795 | prompt processing progress, n_tokens = 30104, batch.n_tokens = 501, progress = 0.999502
slot update_slots: id  0 | task 795 | skip checkpoint at 29603, expected boundary before user input = 30104
slot update_slots: id  0 | task 795 | n_tokens = 30104, memory_seq_rm [30104, end)
slot update_slots: id  0 | task 795 | prompt processing progress, n_tokens = 30115, batch.n_tokens = 11, progress = 0.999867
slot create_check: id  0 | task 795 | created context checkpoint 3 of 8 (pos_min = 30103, pos_max = 30103, n_tokens = 30104, size = 149.626 MiB)
slot update_slots: id  0 | task 795 | n_tokens = 30115, memory_seq_rm [30115, end)
slot init_sampler: id  0 | task 795 | init sampler, took 5.31 ms, tokens: text = 30119, total = 30119
slot update_slots: id  0 | task 795 | prompt processing done, n_tokens = 30119, batch.n_tokens = 4
slot update_slots: id  0 | task 795 | skip checkpoint at 30115, expected boundary before user input = 30104
begin: ngram_mod occupancy = 27278/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 795 |
prompt eval time =    8979.40 ms / 13256 tokens (    0.68 ms per token,  1476.27 tokens per second)
       eval time =    3708.26 ms /    88 tokens (   42.14 ms per token,    23.73 tokens per second)
      total time =   12687.65 ms / 13344 tokens
draft acceptance rate = 0.12500 (    8 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 6 845 3, #gen drafts = 3, #acc drafts = 3, #gen tokens = 192, #acc tokens = 16, dur(b,g,a) = 11.561, 4.845, 0.416 ms
slot      release: id  0 | task 795 | stop processing: n_tokens = 30206, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=109945, token_pos=30208
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.972 (> 0.100 thold), f_keep = 1.000
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 881 | processing task, is_child = 0
slot update_slots: id  0 | task 881 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 31092
slot update_slots: id  0 | task 881 | n_tokens = 30206, memory_seq_rm [30206, end)
slot update_slots: id  0 | task 881 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 30208
slot update_slots: id  0 | task 881 | prompt processing progress, n_tokens = 30208, batch.n_tokens = 2, progress = 0.971568
slot update_slots: id  0 | task 881 | n_tokens = 30208, memory_seq_rm [30208, end)
slot update_slots: id  0 | task 881 | prompt processing progress, n_tokens = 30576, batch.n_tokens = 368, progress = 0.983404
slot update_slots: id  0 | task 881 | n_tokens = 30576, memory_seq_rm [30576, end)
slot update_slots: id  0 | task 881 | prompt processing progress, n_tokens = 31088, batch.n_tokens = 512, progress = 0.999871
slot update_slots: id  0 | task 881 | skip checkpoint at 30576, expected boundary before user input = 30208
slot update_slots: id  0 | task 881 | n_tokens = 31088, memory_seq_rm [31088, end)
slot init_sampler: id  0 | task 881 | init sampler, took 4.97 ms, tokens: text = 31092, total = 31092
slot update_slots: id  0 | task 881 | prompt processing done, n_tokens = 31092, batch.n_tokens = 4
slot update_slots: id  0 | task 881 | skip checkpoint at 31088, expected boundary before user input = 30208
begin: ngram_mod occupancy = 27973/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 881 |
prompt eval time =     830.98 ms /   886 tokens (    0.94 ms per token,  1066.21 tokens per second)
       eval time =   36471.01 ms /   795 tokens (   45.88 ms per token,    21.80 tokens per second)
      total time =   37301.99 ms /  1681 tokens
statistics ngram_mod: #calls(b,g,a) = 7 1639 3, #gen drafts = 3, #acc drafts = 3, #gen tokens = 192, #acc tokens = 16, dur(b,g,a) = 15.050, 6.134, 0.416 ms
slot      release: id  0 | task 881 | stop processing: n_tokens = 31886, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=115431, token_pos=31825
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.946 (> 0.100 thold), f_keep = 0.945
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 1680 | processing task, is_child = 0
slot update_slots: id  0 | task 1680 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 31844
slot update_slots: id  0 | task 1680 | n_past = 30117, slot.prompt.tokens.size() = 31886, seq_id = 0, pos_min = 31885, n_swa = 0
slot update_slots: id  0 | task 1680 | Checking checkpoint with [30103, 30103] against 30117...
slot update_slots: id  0 | task 1680 | restored context checkpoint (pos_min = 30103, pos_max = 30103, n_tokens = 30104, n_past = 30104, size = 149.626 MiB)
slot update_slots: id  0 | task 1680 | n_tokens = 30104, memory_seq_rm [30104, end)
slot update_slots: id  0 | task 1680 | prompt processing progress, n_tokens = 31328, batch.n_tokens = 1224, progress = 0.983796
slot update_slots: id  0 | task 1680 | n_tokens = 31328, memory_seq_rm [31328, end)
slot update_slots: id  0 | task 1680 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 31825
slot update_slots: id  0 | task 1680 | prompt processing progress, n_tokens = 31825, batch.n_tokens = 497, progress = 0.999403
slot update_slots: id  0 | task 1680 | skip checkpoint at 31328, expected boundary before user input = 31825
slot update_slots: id  0 | task 1680 | n_tokens = 31825, memory_seq_rm [31825, end)
slot update_slots: id  0 | task 1680 | prompt processing progress, n_tokens = 31840, batch.n_tokens = 15, progress = 0.999874
slot create_check: id  0 | task 1680 | created context checkpoint 4 of 8 (pos_min = 31824, pos_max = 31824, n_tokens = 31825, size = 149.626 MiB)
slot update_slots: id  0 | task 1680 | n_tokens = 31840, memory_seq_rm [31840, end)
slot init_sampler: id  0 | task 1680 | init sampler, took 5.29 ms, tokens: text = 31844, total = 31844
slot update_slots: id  0 | task 1680 | prompt processing done, n_tokens = 31844, batch.n_tokens = 4
slot update_slots: id  0 | task 1680 | skip checkpoint at 31840, expected boundary before user input = 31825
begin: ngram_mod occupancy = 28823/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 1680 |
prompt eval time =    1590.88 ms /  1740 tokens (    0.91 ms per token,  1093.73 tokens per second)
       eval time =   13703.82 ms /   303 tokens (   45.23 ms per token,    22.11 tokens per second)
      total time =   15294.70 ms /  2043 tokens
draft acceptance rate = 0.01562 (    1 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 8 1940 4, #gen drafts = 4, #acc drafts = 4, #gen tokens = 256, #acc tokens = 17, dur(b,g,a) = 18.595, 6.556, 0.417 ms
slot      release: id  0 | task 1680 | stop processing: n_tokens = 32146, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=116352, token_pos=32112
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.991 (> 0.100 thold), f_keep = 0.991
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 1987 | processing task, is_child = 0
slot update_slots: id  0 | task 1987 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 32124
slot update_slots: id  0 | task 1987 | n_past = 31842, slot.prompt.tokens.size() = 32146, seq_id = 0, pos_min = 32145, n_swa = 0
slot update_slots: id  0 | task 1987 | Checking checkpoint with [31824, 31824] against 31842...
slot update_slots: id  0 | task 1987 | restored context checkpoint (pos_min = 31824, pos_max = 31824, n_tokens = 31825, n_past = 31825, size = 149.626 MiB)
slot update_slots: id  0 | task 1987 | n_tokens = 31825, memory_seq_rm [31825, end)
slot update_slots: id  0 | task 1987 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 32112
slot update_slots: id  0 | task 1987 | prompt processing progress, n_tokens = 32112, batch.n_tokens = 287, progress = 0.999626
slot update_slots: id  0 | task 1987 | skip checkpoint at 31825, expected boundary before user input = 32112
slot update_slots: id  0 | task 1987 | n_tokens = 32112, memory_seq_rm [32112, end)
slot update_slots: id  0 | task 1987 | prompt processing progress, n_tokens = 32120, batch.n_tokens = 8, progress = 0.999875
slot create_check: id  0 | task 1987 | created context checkpoint 5 of 8 (pos_min = 32111, pos_max = 32111, n_tokens = 32112, size = 149.626 MiB)
slot update_slots: id  0 | task 1987 | n_tokens = 32120, memory_seq_rm [32120, end)
slot init_sampler: id  0 | task 1987 | init sampler, took 5.32 ms, tokens: text = 32124, total = 32124
slot update_slots: id  0 | task 1987 | prompt processing done, n_tokens = 32124, batch.n_tokens = 4
slot update_slots: id  0 | task 1987 | skip checkpoint at 32120, expected boundary before user input = 32112
begin: ngram_mod occupancy = 29157/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 1987 |
prompt eval time =     547.43 ms /   299 tokens (    1.83 ms per token,   546.18 tokens per second)
       eval time =    1478.92 ms /    34 tokens (   43.50 ms per token,    22.99 tokens per second)
      total time =    2026.36 ms /   333 tokens
statistics ngram_mod: #calls(b,g,a) = 9 1973 4, #gen drafts = 4, #acc drafts = 4, #gen tokens = 256, #acc tokens = 17, dur(b,g,a) = 22.176, 6.602, 0.417 ms
slot      release: id  0 | task 1987 | stop processing: n_tokens = 32157, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=116481, token_pos=32140
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 0.999
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 2024 | processing task, is_child = 0
slot update_slots: id  0 | task 2024 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 32154
slot update_slots: id  0 | task 2024 | n_past = 32122, slot.prompt.tokens.size() = 32157, seq_id = 0, pos_min = 32156, n_swa = 0
slot update_slots: id  0 | task 2024 | Checking checkpoint with [32111, 32111] against 32122...
slot update_slots: id  0 | task 2024 | restored context checkpoint (pos_min = 32111, pos_max = 32111, n_tokens = 32112, n_past = 32112, size = 149.626 MiB)
slot update_slots: id  0 | task 2024 | n_tokens = 32112, memory_seq_rm [32112, end)
slot update_slots: id  0 | task 2024 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 32140
slot update_slots: id  0 | task 2024 | prompt processing progress, n_tokens = 32140, batch.n_tokens = 28, progress = 0.999565
slot update_slots: id  0 | task 2024 | skip checkpoint at 32112, expected boundary before user input = 32140
slot update_slots: id  0 | task 2024 | n_tokens = 32140, memory_seq_rm [32140, end)
slot update_slots: id  0 | task 2024 | prompt processing progress, n_tokens = 32150, batch.n_tokens = 10, progress = 0.999876
slot update_slots: id  0 | task 2024 | n_tokens = 32150, memory_seq_rm [32150, end)
slot init_sampler: id  0 | task 2024 | init sampler, took 5.37 ms, tokens: text = 32154, total = 32154
slot update_slots: id  0 | task 2024 | prompt processing done, n_tokens = 32154, batch.n_tokens = 4
slot update_slots: id  0 | task 2024 | skip checkpoint at 32150, expected boundary before user input = 32140
begin: ngram_mod occupancy = 29214/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 2024 |
prompt eval time =     195.19 ms /    42 tokens (    4.65 ms per token,   215.18 tokens per second)
       eval time =    1266.99 ms /    29 tokens (   43.69 ms per token,    22.89 tokens per second)
      total time =    1462.18 ms /    71 tokens
statistics ngram_mod: #calls(b,g,a) = 10 2001 4, #gen drafts = 4, #acc drafts = 4, #gen tokens = 256, #acc tokens = 17, dur(b,g,a) = 25.750, 6.643, 0.417 ms
slot      release: id  0 | task 2024 | stop processing: n_tokens = 32182, truncated = 0
srv  update_slots: all slots are idle

@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented May 11, 2026

Hi @jacekpoplawski, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@jacekpoplawski jacekpoplawski force-pushed the fix-checkpoints-creation branch from d878621 to ea9369c Compare May 11, 2026 01:40
@jacekpoplawski jacekpoplawski changed the title Fix checkpoints creation server: fix checkpoints creation May 11, 2026
@jacekpoplawski
Copy link
Copy Markdown
Contributor Author

CUDA_VISIBLE_DEVICES=0,1,2 ./bin/llama-server -m /mnt/models1/Google/gemma-4-31B-it-UD-Q8_K_XL.gguf --host 0.0.0.0 --ctx-checkpoints 8 -b 8192 --spec-type ngram-mod

Details
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=13864, token_pos=3447
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 3458
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 2942, batch.n_tokens = 2942, progress = 0.850781
slot update_slots: id  3 | task 0 | n_tokens = 2942, memory_seq_rm [2942, end)
slot update_slots: id  3 | task 0 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 3447
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 3447, batch.n_tokens = 505, progress = 0.996819
slot update_slots: id  3 | task 0 | skip checkpoint at 2942, expected boundary before user input = 3447
slot update_slots: id  3 | task 0 | n_tokens = 3447, memory_seq_rm [3447, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 3454, batch.n_tokens = 7, progress = 0.998843
slot create_check: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 0, pos_max = 3446, n_tokens = 3447, size = 2693.009 MiB)
slot update_slots: id  3 | task 0 | n_tokens = 3454, memory_seq_rm [3454, end)
slot init_sampler: id  3 | task 0 | init sampler, took 0.45 ms, tokens: text = 3458, total = 3458
slot update_slots: id  3 | task 0 | prompt processing done, n_tokens = 3458, batch.n_tokens = 4
slot update_slots: id  3 | task 0 | skip checkpoint at 3454, expected boundary before user input = 3447
begin: ngram_mod occupancy = 3409/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 0 |
prompt eval time =    4093.67 ms /  3458 tokens (    1.18 ms per token,   844.72 tokens per second)
       eval time =    3406.95 ms /    73 tokens (   46.67 ms per token,    21.43 tokens per second)
      total time =    7500.62 ms /  3531 tokens
draft acceptance rate = 0.06250 (    4 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 1 68 1, #gen drafts = 1, #acc drafts = 1, #gen tokens = 64, #acc tokens = 4, dur(b,g,a) = 0.493, 0.098, 0.001 ms
slot      release: id  3 | task 0 | stop processing: n_tokens = 3530, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=13864, token_pos=3447
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.874 (> 0.100 thold), f_keep = 0.989
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 73 | processing task, is_child = 0
slot update_slots: id  3 | task 73 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 3993
slot update_slots: id  3 | task 73 | n_tokens = 3490, memory_seq_rm [3490, end)
slot update_slots: id  3 | task 73 | prompt processing progress, n_tokens = 3989, batch.n_tokens = 499, progress = 0.998998
slot update_slots: id  3 | task 73 | skip checkpoint at 3490, expected boundary before user input = 3447
slot update_slots: id  3 | task 73 | n_tokens = 3989, memory_seq_rm [3989, end)
slot init_sampler: id  3 | task 73 | init sampler, took 0.49 ms, tokens: text = 3993, total = 3993
slot update_slots: id  3 | task 73 | prompt processing done, n_tokens = 3993, batch.n_tokens = 4
slot update_slots: id  3 | task 73 | skip checkpoint at 3989, expected boundary before user input = 3447
begin: ngram_mod occupancy = 3945/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 73 |
prompt eval time =     486.93 ms /   503 tokens (    0.97 ms per token,  1033.01 tokens per second)
       eval time =    2965.44 ms /    62 tokens (   47.83 ms per token,    20.91 tokens per second)
      total time =    3452.37 ms /   565 tokens
statistics ngram_mod: #calls(b,g,a) = 2 129 1, #gen drafts = 1, #acc drafts = 1, #gen tokens = 64, #acc tokens = 4, dur(b,g,a) = 1.050, 0.165, 0.001 ms
slot      release: id  3 | task 73 | stop processing: n_tokens = 4054, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=13864, token_pos=3447
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.980 (> 0.100 thold), f_keep = 0.991
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 137 | processing task, is_child = 0
slot update_slots: id  3 | task 137 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 4097
slot update_slots: id  3 | task 137 | n_tokens = 4017, memory_seq_rm [4017, end)
slot update_slots: id  3 | task 137 | prompt processing progress, n_tokens = 4093, batch.n_tokens = 76, progress = 0.999024
slot update_slots: id  3 | task 137 | skip checkpoint at 4017, expected boundary before user input = 3447
slot update_slots: id  3 | task 137 | n_tokens = 4093, memory_seq_rm [4093, end)
slot init_sampler: id  3 | task 137 | init sampler, took 0.50 ms, tokens: text = 4097, total = 4097
slot update_slots: id  3 | task 137 | prompt processing done, n_tokens = 4097, batch.n_tokens = 4
slot update_slots: id  3 | task 137 | skip checkpoint at 4093, expected boundary before user input = 3447
begin: ngram_mod occupancy = 4071/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 137 |
prompt eval time =     122.85 ms /    80 tokens (    1.54 ms per token,   651.21 tokens per second)
       eval time =    6455.73 ms /   132 tokens (   48.91 ms per token,    20.45 tokens per second)
      total time =    6578.58 ms /   212 tokens
draft acceptance rate = 0.01562 (    1 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 3 259 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 1.622, 0.334, 0.002 ms
slot      release: id  3 | task 137 | stop processing: n_tokens = 4228, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=13864, token_pos=3447
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.216 (> 0.100 thold), f_keep = 0.991
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 270 | processing task, is_child = 0
slot update_slots: id  3 | task 270 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 19367
slot update_slots: id  3 | task 270 | n_tokens = 4190, memory_seq_rm [4190, end)
slot update_slots: id  3 | task 270 | prompt processing progress, n_tokens = 12382, batch.n_tokens = 8192, progress = 0.639335
slot update_slots: id  3 | task 270 | n_tokens = 12382, memory_seq_rm [12382, end)
slot update_slots: id  3 | task 270 | 8192 tokens since last checkpoint at 3447, creating new checkpoint during processing at position 18851
slot update_slots: id  3 | task 270 | prompt processing progress, n_tokens = 18851, batch.n_tokens = 6469, progress = 0.973357
slot update_slots: id  3 | task 270 | skip checkpoint at 12382, expected boundary before user input = 3447
slot update_slots: id  3 | task 270 | n_tokens = 18851, memory_seq_rm [18851, end)
slot update_slots: id  3 | task 270 | prompt processing progress, n_tokens = 19363, batch.n_tokens = 512, progress = 0.999793
slot update_slots: id  3 | task 270 | skip checkpoint at 18851, expected boundary before user input = 3447
slot update_slots: id  3 | task 270 | n_tokens = 19363, memory_seq_rm [19363, end)
slot init_sampler: id  3 | task 270 | init sampler, took 2.81 ms, tokens: text = 19367, total = 19367
slot update_slots: id  3 | task 270 | prompt processing done, n_tokens = 19367, batch.n_tokens = 4
slot update_slots: id  3 | task 270 | skip checkpoint at 19363, expected boundary before user input = 3447
begin: ngram_mod occupancy = 15256/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 270 |
prompt eval time =   11658.13 ms / 15177 tokens (    0.77 ms per token,  1301.84 tokens per second)
       eval time =    3870.86 ms /    77 tokens (   50.27 ms per token,    19.89 tokens per second)
      total time =   15528.99 ms / 15254 tokens
statistics ngram_mod: #calls(b,g,a) = 4 335 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 3.670, 0.441, 0.002 ms
slot      release: id  3 | task 270 | stop processing: n_tokens = 19443, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=56567, token_pos=19234
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.180 (> 0.100 thold), f_keep = 0.178
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 19443, total state size = 5119.261 MiB
srv          load:  - looking for better prompt, base f_keep = 0.178, sim = 0.180
srv        update:  - cache state: 1 prompts, 7812.270 MiB (limits: 8192.000 MiB, 262144 tokens, 262144 est)
srv        update:    - prompt 0x5816dafca4c0:   19443 tokens, checkpoints:  1,  7812.270 MiB
srv  get_availabl: prompt cache update took 5754.02 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 351 | processing task, is_child = 0
slot update_slots: id  3 | task 351 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 19244
slot update_slots: id  3 | task 351 | n_past = 3458, slot.prompt.tokens.size() = 19443, seq_id = 3, pos_min = 14835, n_swa = 1024
slot update_slots: id  3 | task 351 | Checking checkpoint with [0, 3446] against 2434...
slot update_slots: id  3 | task 351 | restored context checkpoint (pos_min = 0, pos_max = 3446, n_tokens = 3447, n_past = 3446, size = 2693.009 MiB)
slot update_slots: id  3 | task 351 | n_tokens = 3446, memory_seq_rm [3446, end)
slot update_slots: id  3 | task 351 | prompt processing progress, n_tokens = 11638, batch.n_tokens = 8192, progress = 0.604760
slot update_slots: id  3 | task 351 | n_tokens = 11638, memory_seq_rm [11638, end)
slot update_slots: id  3 | task 351 | prompt processing progress, n_tokens = 18728, batch.n_tokens = 7090, progress = 0.973186
slot update_slots: id  3 | task 351 | n_tokens = 18728, memory_seq_rm [18728, end)
slot update_slots: id  3 | task 351 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 19234
slot update_slots: id  3 | task 351 | prompt processing progress, n_tokens = 19234, batch.n_tokens = 506, progress = 0.999480
slot update_slots: id  3 | task 351 | skip checkpoint at 18728, expected boundary before user input = 19234
slot update_slots: id  3 | task 351 | n_tokens = 19234, memory_seq_rm [19234, end)
slot update_slots: id  3 | task 351 | prompt processing progress, n_tokens = 19240, batch.n_tokens = 6, progress = 0.999792
slot create_check: id  3 | task 351 | created context checkpoint 2 of 8 (pos_min = 14626, pos_max = 19233, n_tokens = 19234, size = 3600.054 MiB)
slot update_slots: id  3 | task 351 | n_tokens = 19240, memory_seq_rm [19240, end)
slot init_sampler: id  3 | task 351 | init sampler, took 3.19 ms, tokens: text = 19244, total = 19244
slot update_slots: id  3 | task 351 | prompt processing done, n_tokens = 19244, batch.n_tokens = 4
slot update_slots: id  3 | task 351 | skip checkpoint at 19240, expected boundary before user input = 19234
begin: ngram_mod occupancy = 15428/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 351 |
prompt eval time =   14532.32 ms / 15798 tokens (    0.92 ms per token,  1087.09 tokens per second)
       eval time =    1013.75 ms /    21 tokens (   48.27 ms per token,    20.72 tokens per second)
      total time =   15546.08 ms / 15819 tokens
statistics ngram_mod: #calls(b,g,a) = 5 355 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 5.697, 0.469, 0.002 ms
slot      release: id  3 | task 351 | stop processing: n_tokens = 19264, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=56567, token_pos=19234
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.590 (> 0.100 thold), f_keep = 0.999
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 377 | processing task, is_child = 0
slot update_slots: id  3 | task 377 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 32621
slot update_slots: id  3 | task 377 | n_tokens = 19244, memory_seq_rm [19244, end)
slot update_slots: id  3 | task 377 | prompt processing progress, n_tokens = 27436, batch.n_tokens = 8192, progress = 0.841053
slot update_slots: id  3 | task 377 | n_tokens = 27436, memory_seq_rm [27436, end)
slot update_slots: id  3 | task 377 | 8192 tokens since last checkpoint at 19234, creating new checkpoint during processing at position 32105
slot update_slots: id  3 | task 377 | prompt processing progress, n_tokens = 32105, batch.n_tokens = 4669, progress = 0.984182
slot update_slots: id  3 | task 377 | skip checkpoint at 27436, expected boundary before user input = 19234
slot update_slots: id  3 | task 377 | n_tokens = 32105, memory_seq_rm [32105, end)
slot update_slots: id  3 | task 377 | prompt processing progress, n_tokens = 32617, batch.n_tokens = 512, progress = 0.999877
slot update_slots: id  3 | task 377 | skip checkpoint at 32105, expected boundary before user input = 19234
slot update_slots: id  3 | task 377 | n_tokens = 32617, memory_seq_rm [32617, end)
slot init_sampler: id  3 | task 377 | init sampler, took 4.94 ms, tokens: text = 32621, total = 32621
slot update_slots: id  3 | task 377 | prompt processing done, n_tokens = 32621, batch.n_tokens = 4
slot update_slots: id  3 | task 377 | skip checkpoint at 32617, expected boundary before user input = 19234
begin: ngram_mod occupancy = 28107/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 377 |
prompt eval time =   12903.45 ms / 13377 tokens (    0.96 ms per token,  1036.70 tokens per second)
       eval time =   33479.54 ms /   636 tokens (   52.64 ms per token,    19.00 tokens per second)
      total time =   46382.99 ms / 14013 tokens
statistics ngram_mod: #calls(b,g,a) = 6 990 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 9.193, 1.463, 0.002 ms
slot      release: id  3 | task 377 | stop processing: n_tokens = 33256, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=108386, token_pos=32971
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.989 (> 0.100 thold), f_keep = 0.981
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 1017 | processing task, is_child = 0
slot update_slots: id  3 | task 1017 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 32982
slot update_slots: id  3 | task 1017 | n_tokens = 32621, memory_seq_rm [32621, end)
slot update_slots: id  3 | task 1017 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 32971
slot update_slots: id  3 | task 1017 | prompt processing progress, n_tokens = 32971, batch.n_tokens = 350, progress = 0.999667
slot update_slots: id  3 | task 1017 | skip checkpoint at 32621, expected boundary before user input = 32971
slot update_slots: id  3 | task 1017 | n_tokens = 32971, memory_seq_rm [32971, end)
slot update_slots: id  3 | task 1017 | prompt processing progress, n_tokens = 32978, batch.n_tokens = 7, progress = 0.999879
slot create_check: id  3 | task 1017 | created context checkpoint 3 of 8 (pos_min = 28648, pos_max = 32970, n_tokens = 32971, size = 3377.394 MiB)
slot update_slots: id  3 | task 1017 | n_tokens = 32978, memory_seq_rm [32978, end)
slot init_sampler: id  3 | task 1017 | init sampler, took 5.53 ms, tokens: text = 32982, total = 32982
slot update_slots: id  3 | task 1017 | prompt processing done, n_tokens = 32982, batch.n_tokens = 4
slot update_slots: id  3 | task 1017 | skip checkpoint at 32978, expected boundary before user input = 32971
begin: ngram_mod occupancy = 28774/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 1017 |
prompt eval time =    2789.51 ms /   361 tokens (    7.73 ms per token,   129.41 tokens per second)
       eval time =    8041.81 ms /   155 tokens (   51.88 ms per token,    19.27 tokens per second)
      total time =   10831.32 ms /   516 tokens
statistics ngram_mod: #calls(b,g,a) = 7 1144 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 12.740, 1.720, 0.002 ms
slot      release: id  3 | task 1017 | stop processing: n_tokens = 33136, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=108386, token_pos=32971
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.997 (> 0.100 thold), f_keep = 0.999
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 1175 | processing task, is_child = 0
slot update_slots: id  3 | task 1175 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 33193
slot update_slots: id  3 | task 1175 | n_tokens = 33108, memory_seq_rm [33108, end)
slot update_slots: id  3 | task 1175 | prompt processing progress, n_tokens = 33189, batch.n_tokens = 81, progress = 0.999879
slot update_slots: id  3 | task 1175 | skip checkpoint at 33108, expected boundary before user input = 32971
slot update_slots: id  3 | task 1175 | n_tokens = 33189, memory_seq_rm [33189, end)
slot init_sampler: id  3 | task 1175 | init sampler, took 5.50 ms, tokens: text = 33193, total = 33193
slot update_slots: id  3 | task 1175 | prompt processing done, n_tokens = 33193, batch.n_tokens = 4
slot update_slots: id  3 | task 1175 | skip checkpoint at 33189, expected boundary before user input = 32971
begin: ngram_mod occupancy = 29007/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 1175 |
prompt eval time =     172.46 ms /    85 tokens (    2.03 ms per token,   492.86 tokens per second)
       eval time =    3546.63 ms /    69 tokens (   51.40 ms per token,    19.46 tokens per second)
      total time =    3719.09 ms /   154 tokens
statistics ngram_mod: #calls(b,g,a) = 8 1212 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 16.295, 1.829, 0.002 ms
slot      release: id  3 | task 1175 | stop processing: n_tokens = 33261, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=108386, token_pos=32971
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.972 (> 0.100 thold), f_keep = 0.999
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 1246 | processing task, is_child = 0
slot update_slots: id  3 | task 1246 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 34193
slot update_slots: id  3 | task 1246 | n_tokens = 33230, memory_seq_rm [33230, end)
slot update_slots: id  3 | task 1246 | prompt processing progress, n_tokens = 33677, batch.n_tokens = 447, progress = 0.984909
slot update_slots: id  3 | task 1246 | n_tokens = 33677, memory_seq_rm [33677, end)
slot update_slots: id  3 | task 1246 | prompt processing progress, n_tokens = 34189, batch.n_tokens = 512, progress = 0.999883
slot update_slots: id  3 | task 1246 | skip checkpoint at 33677, expected boundary before user input = 32971
slot update_slots: id  3 | task 1246 | n_tokens = 34189, memory_seq_rm [34189, end)
slot init_sampler: id  3 | task 1246 | init sampler, took 5.74 ms, tokens: text = 34193, total = 34193
slot update_slots: id  3 | task 1246 | prompt processing done, n_tokens = 34193, batch.n_tokens = 4
slot update_slots: id  3 | task 1246 | skip checkpoint at 34189, expected boundary before user input = 32971
begin: ngram_mod occupancy = 30013/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
slot print_timing: id  3 | task 1246 |
prompt eval time =    1217.33 ms /   963 tokens (    1.26 ms per token,   791.08 tokens per second)
       eval time =   23152.00 ms /   443 tokens (   52.26 ms per token,    19.13 tokens per second)
      total time =   24369.32 ms /  1406 tokens
statistics ngram_mod: #calls(b,g,a) = 9 1654 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 19.960, 2.572, 0.002 ms
slot      release: id  3 | task 1246 | stop processing: n_tokens = 34635, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=113544, token_pos=34470
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.956 (> 0.100 thold), f_keep = 0.952
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 1692 | processing task, is_child = 0
slot update_slots: id  3 | task 1692 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 34487
slot update_slots: id  3 | task 1692 | n_tokens = 32982, memory_seq_rm [32982, end)
slot update_slots: id  3 | task 1692 | prompt processing progress, n_tokens = 33971, batch.n_tokens = 989, progress = 0.985038
slot update_slots: id  3 | task 1692 | n_tokens = 33971, memory_seq_rm [33971, end)
slot update_slots: id  3 | task 1692 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 34470
slot update_slots: id  3 | task 1692 | prompt processing progress, n_tokens = 34470, batch.n_tokens = 499, progress = 0.999507
slot update_slots: id  3 | task 1692 | skip checkpoint at 33971, expected boundary before user input = 34470
slot update_slots: id  3 | task 1692 | n_tokens = 34470, memory_seq_rm [34470, end)
slot update_slots: id  3 | task 1692 | prompt processing progress, n_tokens = 34483, batch.n_tokens = 13, progress = 0.999884
slot create_check: id  3 | task 1692 | created context checkpoint 4 of 8 (pos_min = 30027, pos_max = 34469, n_tokens = 34470, size = 3471.146 MiB)
slot update_slots: id  3 | task 1692 | n_tokens = 34483, memory_seq_rm [34483, end)
slot init_sampler: id  3 | task 1692 | init sampler, took 5.74 ms, tokens: text = 34487, total = 34487
slot update_slots: id  3 | task 1692 | prompt processing done, n_tokens = 34487, batch.n_tokens = 4
slot update_slots: id  3 | task 1692 | skip checkpoint at 34483, expected boundary before user input = 34470
begin: ngram_mod occupancy = 30517/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 1692 |
prompt eval time =    4045.72 ms /  1505 tokens (    2.69 ms per token,   372.00 tokens per second)
       eval time =   36678.33 ms /   675 tokens (   54.34 ms per token,    18.40 tokens per second)
      total time =   40724.04 ms /  2180 tokens
statistics ngram_mod: #calls(b,g,a) = 10 2328 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 23.688, 3.734, 0.002 ms
slot      release: id  3 | task 1692 | stop processing: n_tokens = 35161, truncated = 0
srv  update_slots: all slots are idle

@jacekpoplawski jacekpoplawski marked this pull request as ready for review May 11, 2026 02:03
@ggerganov
Copy link
Copy Markdown
Member

Yes, that seems in a good direction. Have you done testing that it works as expected?

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 11, 2026

This needs autoparser dedicated support for split-marker detection; currently, this will assume that all autoparser models use the ChatML markers (<|im_start|> etc.), which is incorrect.

I'll try to submit the marker detection code ASAP.


const auto message_spans = json_value(data, "message_spans", json::array());
if (message_spans.is_array()) {
int32_t last_user_pos = -1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably use 0 as the sentinel value here, since a checkpoint at pos 0 isn't useful. Should help clean up the other logic too.


if ((size_t) last_user_pos <= prompt.size()) {
const std::string prefix = prompt.substr(0, (size_t) last_user_pos);
const auto prefix_tokens = common_tokenize(vocab, prefix, true, true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a guess, but this will probably create incorrect checkpoints for multimodal models with at least one image in the prompt.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, this breaks after the first image.

@jacekpoplawski
Copy link
Copy Markdown
Contributor Author

Yes, that seems in a good direction. Have you done testing that it works as expected?

It works stable for my usecase: pi, qwen 3.6 27B, 200k ctx, 24 checkpoints

With 8 checkpoints I was able to reproduce forcing full prompt re-processing... but with 24 I can work for hours without issues.

As @aldehir pointed out, this does not work correctly with multimodal prompts. I committed a fallback to the old mechanism for that case.

Should I add a switch to enable this new mechanism as an option, or should I try to support multimodal prompts as well?

I understand that the impact of this change is significant, but the benefits are also significant: agentic coding is much more responsive now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants