I have some code that streams the LLM output while it generates, and injects a </think> into the prompt if it detects the model spending too much time thinking.
Previously, with 0.3.32, calling llm.create_completion with an empty prompt would just continue generation, and it was possible to straightforwardly "inject" text into the generation as the assistant by just putting it in the prompt, and otherwise just use an empty prompt to continue where it left off.
but with newer versions, something has changed and the model just completely forgets the existing context and starts producing pretty much random output as if the </think> had been a user message.
To be clear, I'm not looking for a multi-turn conversation; I only want to inject a bit of text into the assistant's current turn when it has generated too much text, to essentially force it to stop thinking.
The code in question (slightly modified to remove ComfyUI stuff):
def execute(llm, prompt, think=False, seed=-1, max_output=4096, max_think=4096, settings_json="{}"
):
stop_tokens = ["<end_of_turn>", "<eos>"]
gen_params = {"stop": stop_tokens, "max_tokens": 1024}
try:
settings = json.loads(settings_json)
gen_params.update(settings)
except ValueError as e:
log.warning("Failed to parse settings JSON: %s. Using defaults", e)
if "presence_penalty" in gen_params:
gen_params["present_penalty"] = gen_params.pop("presence_penalty")
if "repetition_penalty" in gen_params:
gen_params["repeat_penalty"] = gen_params.pop("repetition_penalty")
if llm.chat_handler:
llm.chat_handler.extra_template_arguments["enable_thinking"] = think
user_parts = [p for p in prompt["user"] if p["type"] != "image_url"]
base_messages: list[dict] = []
if prompt["system"]:
base_messages.append({"role": "system", "content": prompt["system"]})
base_messages.append({"role": "user", "content": user_parts})
accumulated = ""
attempt = 0
log.info("LLM response:")
next_prompt = ""
force_finish = False
while attempt < _MAX_CONTINUATIONS:
if force_finish:
break
attempt += 1
if attempt == 1:
r = llm.create_chat_completion(
messages=base_messages,
stream=True,
**gen_params,
)
else:
accumulated += next_prompt
print(next_prompt, end="")
r = llama_cpp.llama_chat_format._convert_text_completion_chunks_to_chat(
llm.create_completion(
prompt=next_prompt,
stream=True,
**gen_params,
)
)
next_prompt = ""
for chunk in r:
delta = chunk["choices"][0]["delta"]
if not "content" in delta:
continue
text = chunk["choices"][0]["delta"]["content"] or ""
accumulated += text
finish = chunk["choices"][0]["finish_reason"]
print(text, end="")
if think and len(accumulated) >= max_think and _THINK_CLOSE not in accumulated:
next_prompt = " ...\nI'm overthinking, let's produce the final output.\n</think>"
if len(accumulated) >= max_output:
log.info("Generated too much text, aborting")
force_finish = True
break
if think and _THINK_CLOSE not in accumulated:
continue
if finish != "length":
print()
break
else:
print()
log.warning("Response truncated after thinking due to max tokens")
print()
return accumulated
Is there a better way to do this with newer versions of llama-cpp-python? It is entirely possible that this code just worked by accident previously.
I have some code that streams the LLM output while it generates, and injects a
</think>into the prompt if it detects the model spending too much time thinking.Previously, with 0.3.32, calling llm.create_completion with an empty prompt would just continue generation, and it was possible to straightforwardly "inject" text into the generation as the assistant by just putting it in the prompt, and otherwise just use an empty prompt to continue where it left off.
but with newer versions, something has changed and the model just completely forgets the existing context and starts producing pretty much random output as if the
</think>had been a user message.To be clear, I'm not looking for a multi-turn conversation; I only want to inject a bit of text into the assistant's current turn when it has generated too much text, to essentially force it to stop thinking.
The code in question (slightly modified to remove ComfyUI stuff):
Is there a better way to do this with newer versions of llama-cpp-python? It is entirely possible that this code just worked by accident previously.