fix: overhaul conversation extraction and add optional LLM validation by NotYuSheng · Pull Request #7 · NotYuSheng/DialogSmith

NotYuSheng · 2026-05-20T00:09:22Z

Summary

Time gap check references wrong message index #2 Fixed time gap check — now tracks last_unixtime within the chain instead of using the raw previous list index
Prompt only captures first message when other person sends multiple in a row #1 / Skipped non-message events silently reset time gap reference during response cascade #4 Collects all consecutive messages from the same sender as a single turn; non-message events no longer corrupt the time reference
No conversation context — each prompt/response pair is extracted in isolation #3 Splits each chat into distinct conversations using a configurable CONVERSATION_GAP_THRESHOLD (default 1 hour)
No conversation context — each prompt/response pair is extracted in isolation #3 / Output is always single-turn; multi-turn conversations are never grouped #5 Groups full back-and-forth exchanges into multi-turn samples instead of isolated prompt/response pairs
Output is always single-turn; multi-turn conversations are never grouped #5 Updated convert_to_sharegpt.py for the new multi-turn format (human/gpt role mapping)
Add optional LLM validation pass for conversation grouping quality #6 New scripts/validator.py: optional LLM pass that scores each sample for coherence and quality — controlled via DIALOGSMITH_LLM_VALIDATE env var, gracefully skipped when disabled or no API key is set
Added .env.example documenting DIALOGSMITH_LLM_VALIDATE, DIALOGSMITH_LLM_MODEL, ANTHROPIC_API_KEY

Test plan

Run telegram_extract.py against a real result.json export and verify multi-turn output
Verify conversations split correctly across long silence gaps
Verify consecutive messages from same sender are merged into one turn
Run convert_to_sharegpt.py and verify ShareGPT output structure
Test validator with DIALOGSMITH_LLM_VALIDATE=false — confirm it skips cleanly
Test validator with a valid ANTHROPIC_API_KEY — confirm filtering and audit log

🤖 Generated with Claude Code

- Fixes incorrect time gap reference (was using wrong message index) - Collects all consecutive messages from same sender as one turn (no more dropped messages) - Splits chats into distinct conversations using a configurable silence gap threshold - Groups full back-and-forth into multi-turn samples instead of isolated pairs - Updates convert_to_sharegpt.py for the new multi-turn format - Adds validator.py: optional LLM pass that scores each sample for coherence and quality, controlled via DIALOGSMITH_LLM_VALIDATE env var - Adds .env.example documenting all new configuration options Closes #1, #2, #3, #4, #5, #6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces multi-turn conversation support for Telegram data extraction and ShareGPT conversion. Key changes include a refactored extraction script that groups messages into distinct conversations based on time gaps, and a new LLM-based validator that uses Anthropic's API to score conversation coherence and quality. Feedback focuses on resolving a logic redundancy between message chaining and turn merging, improving the robustness of JSON parsing from LLM responses, and parallelizing API requests to enhance validation performance.

gemini-code-assist · 2026-05-20T00:11:21Z

+                    if turns and turns[-1]["role"] == role:
+                        turns[-1]["text"] += "\n" + turn_text


The MESSAGE_CHAIN_THRESHOLD (currently 30s) is effectively ignored because consecutive turns with the same role are unconditionally merged here. Since collect_turn already stops when the sender changes, any consecutive calls to it within the same conversation will necessarily have the same role and thus be merged. If the goal is to strictly group all consecutive messages from the same sender into a single turn for ShareGPT compatibility, the threshold check in collect_turn (lines 68-69) is redundant. If you intended to split turns based on time, this merge logic should be removed, though that might produce non-alternating roles which are incompatible with some training formats.

gemini-code-assist · 2026-05-20T00:11:21Z

+    )
+
+    raw = response.content[0].text.strip()
+    result = json.loads(raw)


Directly calling json.loads() on the LLM response is fragile. LLMs frequently include markdown formatting (e.g., json ... ) or conversational preambles even when instructed to return only JSON. It is more robust to extract the JSON block using a regular expression.

Suggested change

result = json.loads(raw)

import re

match = re.search(r'\{.*\}', raw, re.DOTALL)

if not match:

raise ValueError(f"No JSON object found in LLM response: {raw}")

result = json.loads(match.group())

gemini-code-assist · 2026-05-20T00:11:21Z

+    passed = []
+    filtered = []
+
+    for i, turns in enumerate(samples):


This loop performs LLM validation sequentially, which will be slow for large datasets. Consider using concurrent.futures.ThreadPoolExecutor to parallelize the API requests and significantly improve performance, especially since the validator runs once per sample.

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: overhaul conversation extraction and add optional LLM validation#7

fix: overhaul conversation extraction and add optional LLM validation#7
NotYuSheng wants to merge 1 commit into
mainfrom
fix/message-tagging-and-conversation-grouping

NotYuSheng commented May 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if turns and turns[-1]["role"] == role:
		turns[-1]["text"] += "\n" + turn_text

-    result = json.loads(raw)
+    import re
+    match = re.search(r'\{.*\}', raw, re.DOTALL)
+    if not match:
+        raise ValueError(f"No JSON object found in LLM response: {raw}")
+    result = json.loads(match.group())

Conversation

NotYuSheng commented May 20, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant