Skip to content

fix: overhaul conversation extraction and add optional LLM validation#7

Open
NotYuSheng wants to merge 1 commit into
mainfrom
fix/message-tagging-and-conversation-grouping
Open

fix: overhaul conversation extraction and add optional LLM validation#7
NotYuSheng wants to merge 1 commit into
mainfrom
fix/message-tagging-and-conversation-grouping

Conversation

@NotYuSheng
Copy link
Copy Markdown
Owner

Closes #1, #2, #3, #4, #5, #6

Summary

Test plan

  • Run telegram_extract.py against a real result.json export and verify multi-turn output
  • Verify conversations split correctly across long silence gaps
  • Verify consecutive messages from same sender are merged into one turn
  • Run convert_to_sharegpt.py and verify ShareGPT output structure
  • Test validator with DIALOGSMITH_LLM_VALIDATE=false — confirm it skips cleanly
  • Test validator with a valid ANTHROPIC_API_KEY — confirm filtering and audit log

🤖 Generated with Claude Code

- Fixes incorrect time gap reference (was using wrong message index)
- Collects all consecutive messages from same sender as one turn (no more dropped messages)
- Splits chats into distinct conversations using a configurable silence gap threshold
- Groups full back-and-forth into multi-turn samples instead of isolated pairs
- Updates convert_to_sharegpt.py for the new multi-turn format
- Adds validator.py: optional LLM pass that scores each sample for coherence
  and quality, controlled via DIALOGSMITH_LLM_VALIDATE env var
- Adds .env.example documenting all new configuration options

Closes #1, #2, #3, #4, #5, #6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-turn conversation support for Telegram data extraction and ShareGPT conversion. Key changes include a refactored extraction script that groups messages into distinct conversations based on time gaps, and a new LLM-based validator that uses Anthropic's API to score conversation coherence and quality. Feedback focuses on resolving a logic redundancy between message chaining and turn merging, improving the robustness of JSON parsing from LLM responses, and parallelizing API requests to enhance validation performance.

Comment on lines +136 to +137
if turns and turns[-1]["role"] == role:
turns[-1]["text"] += "\n" + turn_text
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The MESSAGE_CHAIN_THRESHOLD (currently 30s) is effectively ignored because consecutive turns with the same role are unconditionally merged here. Since collect_turn already stops when the sender changes, any consecutive calls to it within the same conversation will necessarily have the same role and thus be merged. If the goal is to strictly group all consecutive messages from the same sender into a single turn for ShareGPT compatibility, the threshold check in collect_turn (lines 68-69) is redundant. If you intended to split turns based on time, this merge logic should be removed, though that might produce non-alternating roles which are incompatible with some training formats.

Comment thread scripts/validator.py
)

raw = response.content[0].text.strip()
result = json.loads(raw)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Directly calling json.loads() on the LLM response is fragile. LLMs frequently include markdown formatting (e.g., json ... ) or conversational preambles even when instructed to return only JSON. It is more robust to extract the JSON block using a regular expression.

Suggested change
result = json.loads(raw)
import re
match = re.search(r'\{.*\}', raw, re.DOTALL)
if not match:
raise ValueError(f"No JSON object found in LLM response: {raw}")
result = json.loads(match.group())

Comment thread scripts/validator.py
passed = []
filtered = []

for i, turns in enumerate(samples):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop performs LLM validation sequentially, which will be slow for large datasets. Consider using concurrent.futures.ThreadPoolExecutor to parallelize the API requests and significantly improve performance, especially since the validator runs once per sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prompt only captures first message when other person sends multiple in a row

1 participant