fix: overhaul conversation extraction and add optional LLM validation#7
fix: overhaul conversation extraction and add optional LLM validation#7NotYuSheng wants to merge 1 commit into
Conversation
- Fixes incorrect time gap reference (was using wrong message index) - Collects all consecutive messages from same sender as one turn (no more dropped messages) - Splits chats into distinct conversations using a configurable silence gap threshold - Groups full back-and-forth into multi-turn samples instead of isolated pairs - Updates convert_to_sharegpt.py for the new multi-turn format - Adds validator.py: optional LLM pass that scores each sample for coherence and quality, controlled via DIALOGSMITH_LLM_VALIDATE env var - Adds .env.example documenting all new configuration options Closes #1, #2, #3, #4, #5, #6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces multi-turn conversation support for Telegram data extraction and ShareGPT conversion. Key changes include a refactored extraction script that groups messages into distinct conversations based on time gaps, and a new LLM-based validator that uses Anthropic's API to score conversation coherence and quality. Feedback focuses on resolving a logic redundancy between message chaining and turn merging, improving the robustness of JSON parsing from LLM responses, and parallelizing API requests to enhance validation performance.
| if turns and turns[-1]["role"] == role: | ||
| turns[-1]["text"] += "\n" + turn_text |
There was a problem hiding this comment.
The MESSAGE_CHAIN_THRESHOLD (currently 30s) is effectively ignored because consecutive turns with the same role are unconditionally merged here. Since collect_turn already stops when the sender changes, any consecutive calls to it within the same conversation will necessarily have the same role and thus be merged. If the goal is to strictly group all consecutive messages from the same sender into a single turn for ShareGPT compatibility, the threshold check in collect_turn (lines 68-69) is redundant. If you intended to split turns based on time, this merge logic should be removed, though that might produce non-alternating roles which are incompatible with some training formats.
| ) | ||
|
|
||
| raw = response.content[0].text.strip() | ||
| result = json.loads(raw) |
There was a problem hiding this comment.
Directly calling json.loads() on the LLM response is fragile. LLMs frequently include markdown formatting (e.g., json ... ) or conversational preambles even when instructed to return only JSON. It is more robust to extract the JSON block using a regular expression.
| result = json.loads(raw) | |
| import re | |
| match = re.search(r'\{.*\}', raw, re.DOTALL) | |
| if not match: | |
| raise ValueError(f"No JSON object found in LLM response: {raw}") | |
| result = json.loads(match.group()) |
| passed = [] | ||
| filtered = [] | ||
|
|
||
| for i, turns in enumerate(samples): |
There was a problem hiding this comment.
Closes #1, #2, #3, #4, #5, #6
Summary
last_unixtimewithin the chain instead of using the raw previous list indexCONVERSATION_GAP_THRESHOLD(default 1 hour)convert_to_sharegpt.pyfor the new multi-turn format (human/gptrole mapping)scripts/validator.py: optional LLM pass that scores each sample for coherence and quality — controlled viaDIALOGSMITH_LLM_VALIDATEenv var, gracefully skipped when disabled or no API key is set.env.exampledocumentingDIALOGSMITH_LLM_VALIDATE,DIALOGSMITH_LLM_MODEL,ANTHROPIC_API_KEYTest plan
telegram_extract.pyagainst a realresult.jsonexport and verify multi-turn outputconvert_to_sharegpt.pyand verify ShareGPT output structureDIALOGSMITH_LLM_VALIDATE=false— confirm it skips cleanlyANTHROPIC_API_KEY— confirm filtering and audit log🤖 Generated with Claude Code