Skip to content

fix: 移除流式响应中MiniMax思考内容的输出#7314

Open
alei37 wants to merge 6 commits intoAstrBotDevs:masterfrom
alei37:fix/minimax-streaming-thinking
Open

fix: 移除流式响应中MiniMax思考内容的输出#7314
alei37 wants to merge 6 commits intoAstrBotDevs:masterfrom
alei37:fix/minimax-streaming-thinking

Conversation

@alei37
Copy link
Copy Markdown

@alei37 alei37 commented Apr 2, 2026

问题描述

使用 MiniMax(基于 OpenAI URL 格式的 API)在开启流式响应时,模型的思考内容会被当作正文一起输出给用户。关闭流式响应则正常只输出正文。
image
image

根本原因

MiniMax 模型思考内容的发送格式

MiniMax 模型在流式响应时,会将思考内容以 <think>...</think> 标签的形式嵌入在 delta.content 中。例如:

Chunk 1: <think>\n用户
Chunk 2: 地回复...</think>\n\n晚上好!

思考内容跨越多个 chunk 传输时,<think></think> 标签会被分割到不同的 chunk 中。

3. 流式响应的缺陷

_query_stream 方法(流式路径)中,原有代码直接发送 delta.content,没有处理思考标签:

if delta and delta.content:
    completion_text = self._normalize_content(delta.content, strip=False)
    llm_response.result_chain = MessageChain(chain=[Comp.Plain(completion_text)])
    _y = True

当思考标签被分割到不同 chunk 时:

  • Chunk 1: <think>\n用户 → 无法匹配完整正则 → 直接发送
  • Chunk 2: 地回复...</think>\n\n晚上好 → 无法匹配完整正则 → 直接发送

结果:<think>\n用户地回复...</think>\n\n晚上好 被完整发送给用户。

Modifications / 改动点

核心思路

在流式处理中维护一个 thinking_buffer,追踪跨 chunk 的不完整思考标签。

修改文件

astrbot/core/provider/sources/openai_source.py

修改内容

  1. 添加思考缓冲变量(在 _query_stream 方法开始处):
state = ChatCompletionStreamState()

# Track partial thinking tags across chunks for MiniMax-style reasoning
thinking_buffer = ""
in_thinking_block = False

async for chunk in stream:
  1. 重构 delta.content 处理逻辑
if delta and delta.content:
    completion_text = self._normalize_content(delta.content, strip=False)

    # Handle partial <think>...</think> tags that may span multiple chunks (MiniMax)
    if thinking_buffer:
        completion_text = thinking_buffer + completion_text
        thinking_buffer = ""

    thinking_pattern = re.compile(r"<think>(.*?)</think>", re.DOTALL)

    # Extract complete thinking blocks
    for match in thinking_pattern.finditer(completion_text):
        think_content = match.group(1).strip()
        if think_content:
            if llm_response.reasoning_content:
                llm_response.reasoning_content += "\n" + think_content
            else:
                llm_response.reasoning_content = think_content

    # Remove complete thinking blocks
    completion_text = thinking_pattern.sub("", completion_text)

    # Handle incomplete thinking block at chunk boundary
    think_start = completion_text.rfind("<think>")
    think_end = completion_text.rfind("</think>")

    if think_start != -1 and (think_end == -1 or think_end < think_start):
        # Unclosed opening tag found, buffer it
        thinking_buffer = completion_text[think_start:]
        completion_text = completion_text[:think_start]
    elif think_end != -1 and think_end > think_start:
        # Thinking block closed, clear buffer
        thinking_buffer = ""

    completion_text = completion_text.strip()

    if completion_text:
        llm_response.result_chain = MessageChain(chain=[Comp.Plain(completion_text)])
        _y = True

处理流程示例

输入 Chunk 1: <think>\n用户
  → 发现未闭合的 <think>,buffer = "<think>\n用户"
  → yield ""

输入 Chunk 2: 地回复...</think>\n\n晚上好
  → 拼接 buffer: "<think>\n用户地回复...</think>\n\n晚上好"
  → 提取思考内容: "用户地回复..."
  → 移除思考标签后: "晚上好"
  → yield "晚上好"

影响范围

受影响的 Provider

继承自 ProviderOpenAIOfficial 的所有 Provider(使用 OpenAI 兼容格式):

向后兼容性

  • 如果 Provider 不发送 <think>...</think> 格式的思考内容,此修改不会产生任何影响

  • 如果 Provider 在非流式模式下已有思考标签处理逻辑,流式模式下的行为现在保持一致

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

image image

关联问题

#7013
#6647
#6745

Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
    / 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。

  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
    / 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”

  • 🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in requirements.txt and pyproject.toml.
    / 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txtpyproject.toml 文件相应位置。

  • 😮 My changes do not introduce malicious code.
    / 我的更改没有引入恶意代码。

Summary by Sourcery

Handle MiniMax-style reasoning tags in OpenAI-compatible streaming responses to prevent internal thinking content from being sent to end users while still capturing it as structured reasoning metadata.

New Features:

  • Capture <think>...</think> reasoning segments from streaming responses into reasoning_content for OpenAI-compatible providers using MiniMax-style thinking tags.

Bug Fixes:

  • Prevent MiniMax thinking content wrapped in <think>...</think> from being emitted to users in streaming responses, even when tags are split across chunks.

Enhancements:

  • Introduce chunk-level buffering and parsing of thinking tags in _query_stream to align streaming behavior with non-streaming responses.

@auto-assign auto-assign bot requested review from Raven95676 and advent259141 April 2, 2026 16:50
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 2, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The newly introduced in_thinking_block variable is never used and should either be wired into the logic or removed to avoid confusion about the stream parsing state.
  • The thinking_pattern = re.compile(...) is created on every streamed chunk; consider moving this to a module- or class-level constant to avoid repeated compilation in long-running streams.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The newly introduced `in_thinking_block` variable is never used and should either be wired into the logic or removed to avoid confusion about the stream parsing state.
- The `thinking_pattern = re.compile(...)` is created on every streamed chunk; consider moving this to a module- or class-level constant to avoid repeated compilation in long-running streams.

## Individual Comments

### Comment 1
<location path="astrbot/core/provider/sources/openai_source.py" line_range="609-613" />
<code_context>
+                    # We closed a thinking block, clear any buffered content
+                    thinking_buffer = ""
+                
+                # Strip whitespace but preserve structure
+                completion_text = completion_text.strip()
+                
+                # Only yield if there's actual text content remaining
+                if completion_text:
+                    llm_response.result_chain = MessageChain(
+                        chain=[Comp.Plain(completion_text)],
</code_context>
<issue_to_address>
**issue (bug_risk):** Stripping `completion_text` reintroduces the inter-chunk spacing bug this code was originally avoiding.

Using `completion_text.strip()` here can reintroduce the original streaming bug: if one chunk ends with `"hello "` and the next starts with `"world"`, you send `"hello"` then `"world"`, and the client’s concatenation loses the space. This contradicts the earlier decision not to strip streaming chunks. Please avoid `strip()` here, or restrict cleanup to artifacts around `<think>` tags without changing user-visible leading/trailing spaces.
</issue_to_address>

### Comment 2
<location path="astrbot/core/provider/sources/openai_source.py" line_range="539-545" />
<code_context>

         state = ChatCompletionStreamState()
+        
+        # Track partial thinking tags across chunks for MiniMax-style reasoning
+        thinking_buffer = ""
+        in_thinking_block = False

         async for chunk in stream:
</code_context>
<issue_to_address>
**suggestion:** `in_thinking_block` is currently unused and can be removed or wired into the logic.

This flag is written but never read in the streaming loop. If block state tracking is no longer needed, remove it; if it is, integrate it into the buffering/parse logic so it meaningfully complements `thinking_buffer` and `rfind` rather than remaining unused.

```suggestion
        state = ChatCompletionStreamState()

        # Track partial thinking tags across chunks for MiniMax-style reasoning
        thinking_buffer = ""

        async for chunk in stream:
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@dosubot dosubot bot added the area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. label Apr 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements logic to extract and handle reasoning content within tags for streaming responses, specifically supporting MiniMax-style reasoning by buffering partial tags across chunks. Feedback includes removing an unused variable, optimizing regex compilation by moving it outside the loop, and resetting the response chain per iteration to prevent data leakage. Additionally, the yield flag should be set when reasoning content is extracted to ensure chunks are processed, and a strip() call should be removed to preserve essential whitespace between streaming chunks.

Comment on lines +541 to 545
# Track partial thinking tags across chunks for MiniMax-style reasoning
thinking_buffer = ""
in_thinking_block = False

async for chunk in stream:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable in_thinking_block is initialized but never used in the subsequent logic. Additionally, for better performance, the regex pattern should be compiled once outside the streaming loop. Also, since the llm_response object is reused across all chunks in the stream, we should clear the result_chain at the start of each iteration to prevent content from previous chunks from leaking into the current one (which can happen if a chunk contains only reasoning content or metadata).

        # Track partial thinking tags across chunks for MiniMax-style reasoning
        thinking_buffer = ""
        thinking_pattern = re.compile(r"<think>(.*?)</think>", re.DOTALL)

        async for chunk in stream:
            llm_response.result_chain = None

Comment on lines +586 to +592
for match in thinking_pattern.finditer(completion_text):
think_content = match.group(1).strip()
if think_content:
if llm_response.reasoning_content:
llm_response.reasoning_content += "\n" + think_content
else:
llm_response.reasoning_content = think_content
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When reasoning content is successfully extracted from the message body, the _y flag must be set to True. This ensures that the chunk is yielded to the consumer even if the remaining completion_text is empty (e.g., when a chunk contains only the end of a thinking block). Without this, the reasoning content might be delayed or lost.

Suggested change
for match in thinking_pattern.finditer(completion_text):
think_content = match.group(1).strip()
if think_content:
if llm_response.reasoning_content:
llm_response.reasoning_content += "\n" + think_content
else:
llm_response.reasoning_content = think_content
# Extract complete thinking blocks
for match in thinking_pattern.finditer(completion_text):
think_content = match.group(1).strip()
if think_content:
if llm_response.reasoning_content:
llm_response.reasoning_content += "\n" + think_content
else:
llm_response.reasoning_content = think_content
_y = True

thinking_buffer = ""

# Strip whitespace but preserve structure
completion_text = completion_text.strip()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling strip() on every chunk will remove leading and trailing spaces that are essential for correctly joining words split across chunk boundaries. This negates the strip=False setting used in _normalize_content and will cause words to be merged incorrectly (e.g., "Hello " and "world" becoming "Helloworld").

thinking_buffer = ""

# Find all thinking blocks in this chunk
thinking_pattern = re.compile(r"<think>(.*?)</think>", re.DOTALL)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is now redundant as the regex pattern is compiled once outside the loop for efficiency.

Copy link
Copy Markdown
Member

@Soulter Soulter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

place the changes into a helper function would be better

Move inline thinking block extraction logic from streaming loop into a
separate _extract_thinking_blocks helper method for better code
organization and maintainability.
@alei37
Copy link
Copy Markdown
Author

alei37 commented Apr 3, 2026

已修改,提取思考块的部分使用helper function

@alei37 alei37 requested a review from Soulter April 3, 2026 08:12
alei37 added 2 commits April 3, 2026 16:17
- Handle usage in chunks with empty choices (MiniMax sends usage in final chunk with choices=[])
- Yield the usage chunk so caller can capture it
- Accumulate token usage from chunk responses in tool_loop_agent_runner

This fixes token usage display not showing for MiniMax-M2.7 in webui.
@alei37
Copy link
Copy Markdown
Author

alei37 commented Apr 5, 2026

修复了minimax传递token丢失的问题

…sages

When MiniMax sends a usage-only chunk (choices=[]) after content chunks,
the old code yielded the shared llm_response object which still contained
the previous result_chain, causing duplicate messages on the frontend.

Now creates a separate LLMResponse for usage-only chunks to avoid
carrying over stale content.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants