Skip to content

Input injection behaviour changed while streaming, after 0.3.32? #110

@asagi4

Description

@asagi4

I have some code that streams the LLM output while it generates, and injects a </think> into the prompt if it detects the model spending too much time thinking.

Previously, with 0.3.32, calling llm.create_completion with an empty prompt would just continue generation, and it was possible to straightforwardly "inject" text into the generation as the assistant by just putting it in the prompt, and otherwise just use an empty prompt to continue where it left off.

but with newer versions, something has changed and the model just completely forgets the existing context and starts producing pretty much random output as if the </think> had been a user message.

To be clear, I'm not looking for a multi-turn conversation; I only want to inject a bit of text into the assistant's current turn when it has generated too much text, to essentially force it to stop thinking.

The code in question (slightly modified to remove ComfyUI stuff):

def execute(llm, prompt, think=False, seed=-1, max_output=4096, max_think=4096, settings_json="{}"
    ):
        stop_tokens = ["<end_of_turn>", "<eos>"]
        gen_params = {"stop": stop_tokens, "max_tokens": 1024}

        try:
            settings = json.loads(settings_json)
            gen_params.update(settings)
        except ValueError as e:
            log.warning("Failed to parse settings JSON: %s. Using defaults", e)

        if "presence_penalty" in gen_params:
            gen_params["present_penalty"] = gen_params.pop("presence_penalty")
        if "repetition_penalty" in gen_params:
            gen_params["repeat_penalty"] = gen_params.pop("repetition_penalty")

        if llm.chat_handler:
            llm.chat_handler.extra_template_arguments["enable_thinking"] = think

       
        user_parts = [p for p in prompt["user"] if p["type"] != "image_url"]
        
        base_messages: list[dict] = []
        if prompt["system"]:
            base_messages.append({"role": "system", "content": prompt["system"]})
        base_messages.append({"role": "user", "content": user_parts})

        accumulated = ""
        attempt = 0
        log.info("LLM response:")
        next_prompt = ""
        force_finish = False
        while attempt < _MAX_CONTINUATIONS:
            if force_finish:
                break
            attempt += 1
            if attempt == 1:
                r = llm.create_chat_completion(
                    messages=base_messages,
                    stream=True,
                    **gen_params,
                )
            else:
                accumulated += next_prompt
                print(next_prompt, end="")
                r = llama_cpp.llama_chat_format._convert_text_completion_chunks_to_chat(
                    llm.create_completion(
                        prompt=next_prompt,
                        stream=True,
                        **gen_params,
                    )
                )
                next_prompt = ""
            for chunk in r:
                delta = chunk["choices"][0]["delta"]
                if not "content" in delta:
                    continue
                text = chunk["choices"][0]["delta"]["content"] or ""
                accumulated += text
                finish = chunk["choices"][0]["finish_reason"]
                print(text, end="")
            if think and len(accumulated) >= max_think and _THINK_CLOSE not in accumulated:
                next_prompt = " ...\nI'm overthinking, let's produce the final output.\n</think>"

            if len(accumulated) >= max_output:
                log.info("Generated too much text, aborting")
                force_finish = True
                break
            if think and _THINK_CLOSE not in accumulated:
                continue
            if finish != "length":
                print()
                break
            else:
                print()
                log.warning("Response truncated after thinking due to max tokens")
            print()

        return accumulated

Is there a better way to do this with newer versions of llama-cpp-python? It is entirely possible that this code just worked by accident previously.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions