Input injection behaviour changed while streaming, after 0.3.32?

I have some code that streams the LLM output while it generates, and injects a `</think>` into the prompt if it detects the model spending too much time thinking. 

Previously, with 0.3.32, calling llm.create_completion with an empty prompt would just continue generation, and it was possible to straightforwardly "inject" text into the generation as the assistant by just putting it in the prompt, and otherwise just use an empty prompt to continue where it left off.

but with newer versions, something has changed and the model just completely forgets the existing context and starts producing pretty much random output as if the `</think>` had been a user message.

To be clear, I'm not looking for a multi-turn conversation; I only want to inject a bit of text into the assistant's *current turn* when it has generated too much text, to essentially force it to stop thinking.

The code in question (slightly modified to remove ComfyUI stuff):
```python
def execute(llm, prompt, think=False, seed=-1, max_output=4096, max_think=4096, settings_json="{}"
    ):
        stop_tokens = ["<end_of_turn>", "<eos>"]
        gen_params = {"stop": stop_tokens, "max_tokens": 1024}

        try:
            settings = json.loads(settings_json)
            gen_params.update(settings)
        except ValueError as e:
            log.warning("Failed to parse settings JSON: %s. Using defaults", e)

        if "presence_penalty" in gen_params:
            gen_params["present_penalty"] = gen_params.pop("presence_penalty")
        if "repetition_penalty" in gen_params:
            gen_params["repeat_penalty"] = gen_params.pop("repetition_penalty")

        if llm.chat_handler:
            llm.chat_handler.extra_template_arguments["enable_thinking"] = think

       
        user_parts = [p for p in prompt["user"] if p["type"] != "image_url"]
        
        base_messages: list[dict] = []
        if prompt["system"]:
            base_messages.append({"role": "system", "content": prompt["system"]})
        base_messages.append({"role": "user", "content": user_parts})

        accumulated = ""
        attempt = 0
        log.info("LLM response:")
        next_prompt = ""
        force_finish = False
        while attempt < _MAX_CONTINUATIONS:
            if force_finish:
                break
            attempt += 1
            if attempt == 1:
                r = llm.create_chat_completion(
                    messages=base_messages,
                    stream=True,
                    **gen_params,
                )
            else:
                accumulated += next_prompt
                print(next_prompt, end="")
                r = llama_cpp.llama_chat_format._convert_text_completion_chunks_to_chat(
                    llm.create_completion(
                        prompt=next_prompt,
                        stream=True,
                        **gen_params,
                    )
                )
                next_prompt = ""
            for chunk in r:
                delta = chunk["choices"][0]["delta"]
                if not "content" in delta:
                    continue
                text = chunk["choices"][0]["delta"]["content"] or ""
                accumulated += text
                finish = chunk["choices"][0]["finish_reason"]
                print(text, end="")
            if think and len(accumulated) >= max_think and _THINK_CLOSE not in accumulated:
                next_prompt = " ...\nI'm overthinking, let's produce the final output.\n</think>"

            if len(accumulated) >= max_output:
                log.info("Generated too much text, aborting")
                force_finish = True
                break
            if think and _THINK_CLOSE not in accumulated:
                continue
            if finish != "length":
                print()
                break
            else:
                print()
                log.warning("Response truncated after thinking due to max tokens")
            print()

        return accumulated
```
Is there a better way to do this with newer versions of llama-cpp-python? It is entirely possible that this code just worked by accident previously.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input injection behaviour changed while streaming, after 0.3.32? #110

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Input injection behaviour changed while streaming, after 0.3.32? #110

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions