Skip to content

feat: echo STT transcripts to thread before agent reply#571

Open
dogzzdogzz wants to merge 2 commits intoopenabdev:mainfrom
dogzzdogzz:feat/stt-transcript-echo
Open

feat: echo STT transcripts to thread before agent reply#571
dogzzdogzz wants to merge 2 commits intoopenabdev:mainfrom
dogzzdogzz:feat/stt-transcript-echo

Conversation

@dogzzdogzz
Copy link
Copy Markdown
Contributor

@dogzzdogzz dogzzdogzz commented Apr 26, 2026

Summary

When STT transcribes a voice message, post the transcript back to the thread (no mentions) before the agent reply so users can verify what was heard. Discord and Slack today; platform-agnostic helper means future adapters get it for free.

  • One thread message per user message: > 🎤 <transcript> per clip.
  • Failure → > 🎤 (transcription failed) line + ⚠️ reaction on the user's original message.
  • Opt-in via [stt] echo_transcript = true (default false, mirrored as stt.echoTranscript in Helm values).

Closes #570.

Originally requested in Discord: https://discord.com/channels/1491295327620169908/1491365150664560881/1497784772230123560

Architecture

stt::post_echo(&Arc<dyn ChatAdapter>, &ChannelRef, &MessageRef, &[EchoEntry], &SttConfig) is the platform-agnostic helper. Discord (src/discord.rs) and Slack (src/slack.rs) collect a Vec<EchoEntry> while iterating audio attachments and call the helper before the dispatcher submit. Gateway-based platforms (LINE / Telegram / future Teams) intentionally not wired today — their protocol carries text only. The helper signature is unchanged when audio plumbing lands there later.

Files changed

  • src/config.rsSttConfig.echo_transcript: bool (default false).
  • src/stt.rsEchoEntry enum, format_echo_message, post_echo with MockAdapter-driven tests.
  • src/discord.rs, src/slack.rs — wire echo into the audio attachment loop, call post_echo before dispatcher.submit.
  • charts/openab/values.yaml, charts/openab/templates/configmap.yaml — expose echoTranscript (default false, hasKey guard renders only when explicitly set).
  • docs/stt.md, docs/config-reference.md — document echo_transcript.
  • docs/superpowers/specs/ and docs/superpowers/plans/ — design spec + TDD-style implementation plan that drove this work.

Test plan

  • cargo test --bin openab — 133/133 pass (10 in stt::tests cover format, post_echo success, failure, mixed, disabled config, empty entries).
  • cargo clippy --all-targets -- -D warnings — clean.
  • helm lint charts/openab — clean.
  • helm template ... with default values does NOT render echo_transcript (relies on Rust serde default); with --set agents.kiro.stt.echoTranscript=true renders echo_transcript = true.
  • Manual smoke test: send a voice message in Discord — verify the bot posts > 🎤 <transcript> before the agent's reply.
  • Manual smoke test: same in Slack.
  • Manual smoke test: simulate STT failure (e.g. revoke API key briefly or attach an unsupported file) — verify the failure line + ⚠️ reaction.

Out of scope / follow-ups

  • LINE / Telegram / Teams via gateway — those need audio plumbing in the gateway protocol first. The helper signature accommodates them when that work lands.
  • Multi-clip ordering: extra_blocks.insert(0, …) reverses transcript order in the agent prompt while echo_entries.push(…) preserves upload order. Pre-existing in the agent-prompt path; out of scope for this PR.

🤖 Generated with Claude Code

@chaodu-agent
Copy link
Copy Markdown
Collaborator

🔃 Review — PR #571: feat: echo STT transcripts to thread before agent reply

Verdict: 🟡 Minor changes requested — well-structured PR with one blocking doc issue.

Four-Question Framework

# Question Answer
1 What problem? Voice messages are transcribed and sent to the agent, but users never see the transcript. If STT mishears, the agent replies to garbage and nobody knows why.
2 How? After STT completes, posts > 🎤 <transcript> to the thread before dispatching the prompt to the agent. Platform-agnostic helper stt::post_echo() works for both Discord and Slack.
3 Alternatives? One-message-per-clip (rejected — too spammy). Inline in agent reply (rejected — mixes concerns). Current approach: one batched echo message per user message.
4 Best approach? Yes. Clean separation, opt-in config, no race conditions (echo is awaited before agent dispatch).

🔴 SUGGESTED CHANGES

1. PR description contradicts the code
The PR body says default=true / opt-out in multiple places. But every piece of code says default=false / opt-in:

  • fn default_echo_transcript() -> bool { false }
  • values.yaml: echoTranscript: false
  • configmap.yaml else-branch: false
  • docs/config-reference.md: default column false
  • Tests assert !cfg.echo_transcript

The code is correct (opt-in is the right call). Please update the PR description to match — reviewers and future readers rely on it as source of truth.

🟡 NITs (non-blocking)
  1. No transcript length capformat_echo_message flattens newlines but doesn't truncate. A 5-minute voice memo could exceed Discord's 2000-char limit and cause send_message to fail silently. Consider truncating at ~1900 chars with ellipsis.
  2. Multi-clip orderingextra_blocks.insert(0, …) reverses transcript order in the agent prompt while echo_entries.push(…) preserves upload order. This is pre-existing behavior (out of scope), but a one-line code comment at the insert(0, …) site would help future readers.
  3. stt_cfg cloneSttConfig contains String fields, so .clone() allocates. Low priority since it runs once per message, but wrapping in Arc (like adapter) would make it cheap.
  4. docs/stt.md example — inline comment could be clearer: # opt-in; omit or set false to disable instead of just # default: false (opt-in).
🟢 INFO (good stuff)
  1. Clean architecturestt::post_echo() takes &Arc<dyn ChatAdapter>, making it adapter-agnostic. Future adapters wire in with zero changes to stt.rs.
  2. Excellent test coverage — 10 new unit tests with MockAdapter covering success, failure, mixed, empty, disabled, newline flattening, and full post_echo flow.
  3. Thoughtful failure UX — failed transcriptions show > 🎤 (transcription failed) in echo AND add ⚠️ reaction to the user's message. Two signals.
  4. Correct serde defaults#[serde(default = "default_echo_transcript")] follows existing SttConfig pattern.
  5. Helm hasKey guard — correctly distinguishes "not set" from "explicitly false" for backwards compatibility.
  6. Both Discord and Slack fully wired — not a stub. Same pattern in both adapters.
  7. No race conditionpost_echo is awaited before router.handle_message, guaranteeing echo appears before agent reply.

Review by 超渡法師 🪬

@masami-agent
Copy link
Copy Markdown
Contributor

PR Review: #571

Reviewed commit: 6bc70f6

Summary

  • Problem: Users cannot verify STT accuracy — transcripts go directly to the agent without user visibility.
  • Approach: Platform-agnostic post_echo helper posts a quoted transcript to the thread before the agent reply; opt-in via echo_transcript config.
  • Risk level: Low

Core Assessment

  1. Problem clearly stated: ✅ (linked issue STT: echo transcript to thread before agent reply (Discord + Slack) #570, clear motivation)
  2. Approach appropriate: ✅ (platform-agnostic helper, adapters collect entries and call one function)
  3. Alternatives considered: N/A (straightforward feature, approach is natural)
  4. Best approach for now: ✅ (minimal, opt-in, non-blocking best-effort design)

Findings


🔴 Critical: Helm template default contradicts PR description and code default

What: The PR description says "Opt-out via [stt] echo_transcript = false (default true)" — implying the feature defaults to on. However:

  • src/config.rs: default_echo_transcript() -> bool { false } → defaults to false
  • charts/openab/values.yaml: echoTranscript: false
  • charts/openab/templates/configmap.yaml: the else branch renders false
  • docs/config-reference.md and docs/stt.md: both say default is false

The code and docs are internally consistent (default = false, opt-in), but the PR description is wrong — it says "default true, mirrored as stt.echoTranscript" and "Opt-out via [stt] echo_transcript = false". This will confuse reviewers and future readers.

Where: PR description body, first bullet point.
Why it matters: Misleading PR description can cause incorrect deployment assumptions.
Fix: Update the PR description to say: "Opt-in via [stt] echo_transcript = true (default false)".


🟡 Minor: warn! import added but tracing::warn was already used inline in discord.rs

What: In src/discord.rs, the diff adds warn to the import list (use tracing::{debug, error, info, warn};). The existing code at line 598 already uses tracing::warn!(...) with the full path. After this change, both styles coexist in the same file — the new code uses the short warn! form (via crate::stt::post_echo which uses tracing::warn internally), but the existing tracing::warn! call at line 598 remains with the full path.

Where: src/discord.rs line 19
Why it matters: Style inconsistency within the same file. Not a bug, but slightly untidy.
Fix: Either keep using tracing::warn! everywhere (remove warn from imports) or update the existing tracing::warn! call to use the short form. Minor — not blocking.


🟡 Minor: extra_blocks.insert(0, ...) ordering note from PR description is worth a code comment

What: The PR description mentions a known quirk: extra_blocks.insert(0, …) reverses transcript order in the agent prompt while echo_entries.push(…) preserves upload order. This is explicitly out of scope, but a brief // NOTE: comment in the code would help future contributors understand the asymmetry.

Where: src/discord.rs ~line 524, src/slack.rs ~line 1008
Why it matters: Without the comment, someone might "fix" the ordering inconsistency without understanding it was intentional.
Fix: Add a one-line comment like // NOTE: insert(0) reverses order for agent prompt; echo_entries preserves upload order (intentional).


🟢 Info: Excellent test coverage

The 10 new tests in src/stt.rs cover all meaningful paths: success, failure, mixed, disabled config, empty entries, and newline flattening. The MockAdapter pattern is clean and reusable. The config tests in src/config.rs verify the default and explicit-false cases. Well done.

🟢 Info: Good defensive design choices

  • Best-effort echo (errors logged and swallowed) — never blocks agent reply
  • Single ⚠️ reaction even with multiple failures (idempotent)
  • Newline flattening prevents broken blockquotes
  • hasKey guard in Helm template correctly distinguishes "unset" from "explicit false"

🟢 Info: Clean separation of concerns

The EchoEntry enum + format_echo_message + post_echo trio is well-factored. Platform adapters only need to collect entries and call one function. Future adapters (LINE/Telegram) get it for free once they have audio plumbing.


Review Summary (Traffic Light)

🟢 INFO

  • CI all green (10/10 checks pass)
  • Test coverage is thorough (10 new unit tests + 2 config tests)
  • Platform-agnostic design is clean and extensible
  • Helm template logic is correct

🟡 NIT

  • PR description says default is true but code defaults to false — description needs update
  • Minor import style inconsistency in discord.rs
  • A code comment about the insert(0) ordering asymmetry would help future contributors

🔴 SUGGESTED CHANGES

  • Fix the PR description to accurately reflect the default (false, opt-in) — this is the only blocking item since it creates confusion about the feature's behavior

Verdict

APPROVE (with request to fix PR description)

The code itself is solid, well-tested, and follows existing patterns. The only real issue is the misleading PR description which should be corrected before merge to avoid deployment confusion.

@masami-agent
Copy link
Copy Markdown
Contributor

Re-review: #571 (deeper pass)

Reviewed commit: 6bc70f6

After a more thorough second pass, here are additional findings I missed in the first review:


🔴 Critical: Multi-clip ordering inconsistency is now user-visible

What: When a user uploads multiple audio clips (A, B, C), the ordering diverges:

  • echo_entries.push(...) → echo displays in upload order: A, B, C
  • extra_blocks.insert(0, ...) → agent receives in reverse order: C, B, A

Before this PR, the reversal was invisible to users (only the agent saw it). Now that echo shows the transcripts, users will see > 🎤 A / > 🎤 B / > 🎤 C but the agent may respond as if it heard them in reverse order. This creates a confusing UX when users try to verify STT accuracy against the agent's reply.

Where: src/discord.rs ~line 524, src/slack.rs ~line 1008
Why it matters: The echo's purpose is to let users verify what the agent heard. If the ordering doesn't match what the agent received, the verification is misleading.
Fix options:

  1. (Minimal) Add a comment acknowledging this and document it as a known limitation
  2. (Better) Change echo_entries to also use insert(0, ...) so echo order matches agent order
  3. (Best, but out of scope) Fix the underlying extra_blocks.insert(0) to use push instead, making both match upload order

The PR description already acknowledges this as out-of-scope, which is fair. But I'd suggest at minimum option 2 — make echo match what the agent actually sees, even if both are reversed from upload order. Otherwise the echo defeats its own purpose for multi-clip messages.

Severity reassessment: Downgrading to 🟡 since multi-clip voice messages are rare in practice, and the PR description explicitly calls this out. But it's worth discussing.


🟡 Minor: hasKey guard in Helm template is dead code

What: The configmap template uses:

echo_transcript = {{ if hasKey ($cfg.stt | default dict) "echoTranscript" }}{{ ($cfg.stt).echoTranscript }}{{ else }}false{{ end }}

But values.yaml already defines echoTranscript: false as a default. After Helm's values merge, hasKey will always be true (the key exists in the merged values). The else branch is unreachable.

Where: charts/openab/templates/configmap.yaml
Why it matters: Dead code suggests the author intended different semantics (perhaps distinguishing "unset" from "explicit false"). In practice it works correctly — just unnecessarily complex.
Fix: Simplify to echo_transcript = {{ ($cfg.stt).echoTranscript | default false }} — or keep as-is if the intent is to support users who --set only specific stt fields without inheriting the full values.yaml defaults (edge case with --values override files that don't include echoTranscript).

Actually, on reflection: if a user provides a custom values file with only stt.enabled: true and stt.apiKey: xxx (no echoTranscript key), Helm deep-merge with the chart's values.yaml will still include echoTranscript: false. So the guard is truly dead code. Not blocking though.


🟢 Info: Ownership semantics are correct

Verified that format!("[Voice message transcript]: {transcript}") borrows transcript (via Display), then echo_entries.push(EchoEntry::Success(transcript)) moves it. No use-after-move. The CI pass confirms this, but worth noting the pattern is intentional and correct.

🟢 Info: Platform consistency is good

Discord's post_echo runs inside tokio::spawn (before handle_message), Slack's runs directly in the async function (before router.handle_message). Both achieve the same semantic: echo appears before agent reply. The architectural difference is due to serenity's sync EventHandler trait requiring a spawn.


Updated Verdict

APPROVE — the code is solid. The multi-clip ordering issue is pre-existing and explicitly out-of-scope. The only true blocker from my first review remains: fix the PR description (says default is true, but code defaults to false).

Summary of all findings across both reviews:

  • 🔴 PR description says default true but code is falsemust fix (misleading)
  • 🟡 Multi-clip echo order doesn't match agent order → acknowledged, out of scope
  • 🟡 hasKey guard is dead code → cosmetic, not blocking
  • 🟡 Import style inconsistency → cosmetic
  • 🟢 Tests, design, ownership, platform consistency all good

@masami-agent
Copy link
Copy Markdown
Contributor

Hi @dogzzdogzz — just checking in. When you get a chance, could you update the PR description to fix the default value mention? The description currently says default true but the code defaults to false (opt-in). That's the only blocking item before we can move forward. Thanks!

When STT transcribes a voice message, optionally post the transcript back
to the thread (no mentions) before the agent reply so users can verify what
was heard. Default is OFF — opt in via [stt] echo_transcript = true.

- New config: [stt] echo_transcript (default false, opt-in)
- New helper: stt::post_echo with platform-agnostic ChatAdapter handle —
  future LINE/Telegram/Teams adapters get echo for free
- Format: > 🎤 <transcript> per clip, all in one thread message
- Failure: > 🎤 (transcription failed) line + ⚠️ reaction on the user msg
- Helm: agents.<name>.stt.echoTranscript (camelCase) wired through configmap
- Docs: docs/stt.md and docs/config-reference.md updated

Rebased on top of openabdev#567 (gateway config rendering).

Tests: 133/133 cargo. helm-unittest: 28/28. Clippy --all-targets -D warnings clean.
@masami-agent
Copy link
Copy Markdown
Contributor

Rebased onto current main and resolved conflicts:

  1. src/discord.rs — Adapted to the new dispatcher pattern (dispatcher.submit() instead of router.handle_message()). post_echo is called inside tokio::spawn before dispatcher.submit() to maintain the "echo before agent reply" guarantee.

  2. src/slack.rs — Same adaptation: post_echo called before dispatcher.submit().

  3. src/config.rs — Kept both the existing gateway/message-processing-mode tests and the new STT echo tests.

  4. src/stt.rs — Added missing origin_event_id: None to test_channel() (field was added to ChannelRef after the original PR was opened).

  5. charts/openab/templates/configmap.yaml — Changed to hasKey guard pattern (only renders echo_transcript when explicitly set in values), matching the style of other optional fields like maxBotTurns.

  6. PR description — Fixed the default value contradiction (was "default true", corrected to "default false" to match the code).

@obrutjack
Copy link
Copy Markdown
Collaborator

Maintainer Fix Pushed (dc33d62)

Fixed two issues and pushed to your branch:

  1. Syntax error in src/config.rsfn stt_echo_transcript_defaults_to_false() was missing a closing } for the previous test function and the #[test] attribute. This caused cargo build to fail with "unclosed delimiter".

  2. cargo fmt — applied formatting across all changed files (22 files, mostly style alignment).

Verification

Check Result
cargo build ✅ Pass
cargo test ✅ 250/250 pass
cargo fmt --check ✅ Clean

PR is now buildable. The needs-rebase label can be removed once CI confirms green.

@obrutjack
Copy link
Copy Markdown
Collaborator

✅ E2E Test — STT Echo Transcript Verified

Deployed PR #571 (commit dc33d62) to a test pod and verified the full STT echo flow with real voice messages.

Test Environment

  • Pod: OAB-Maintainer-Test (K8s, custom Docker image from PR branch)
  • STT Provider: Groq (whisper-large-v3-turbo)
  • Discord: voice messages in Chinese
  • Config: echo_transcript = true

Results

Test Result
cargo build ✅ Pass
cargo test ✅ 250/250 pass
cargo fmt --check ✅ Clean
STT transcription (Chinese voice → text) ✅ Accurate
Echo transcript (> 🎤 <transcript>) before agent reply ✅ Displayed
Agent processes transcript and replies ✅ Correct response
Emoji reactions on voice message ✅ Working
Multiple voice messages in same thread ✅ Each echoed independently

Observations

  1. Echo format works well> 🎤 我们再来测试一下跟这样合作好吗 appears before the agent reply, giving users a chance to verify what was heard.
  2. Chinese STT accuracy is good — Groq Whisper correctly transcribed conversational Chinese with minor simplified/traditional differences.
  3. Config noteecho_transcript needs to be explicitly set to true in the [stt] section. The Helm chart values.yaml exposes it as stt.echoTranscript but the ConfigMap template needs the hasKey guard to render it (confirmed working in this test).

Remaining Manual Tests (from PR description)

  • Discord voice message → echo + reply ✅
  • Slack voice message (not tested — no Slack environment)
  • STT failure path (e.g. revoke API key) — not tested but unit tests cover this

LGTM from E2E perspective. The syntax fix (dc33d62) and this functional test confirm the feature works as designed.

Copy link
Copy Markdown
Collaborator

@obrutjack obrutjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chloe Second Review — PR #571

Summary

Both previous blockers are resolved. Code is solid, well-tested, and ready to merge.

Verification of Prior Review Findings

Finding Status
PR description default mismatch (true vs false) ✅ Fixed
hasKey dead code in Helm template ✅ Fixed (no else branch now — only renders when explicitly set)
Merge conflict ✅ Resolved, mergeable
Multi-clip ordering 🟡 Pre-existing, acknowledged in PR description, not blocking
Import style nit 🟡 Cosmetic, not blocking

Additional Findings

🟡 Minor: No transcript length cap

format_echo_message flattens newlines but does not truncate. A long voice memo could exceed Discord's 2000-char limit, causing send_message to fail silently (error logged and swallowed). The user gets no echo at all.

Suggestion: Truncate at ~1900 chars with , or split using the existing split_message() utility. Not blocking — worth a follow-up issue.

🟢 Security surface is clean

  • No injection risk (transcript goes through format! + newline flattening)
  • No mention leakage (blockquote format, no @ injection path)
  • post_echo is fire-and-forget — cannot block agent pipeline
  • No new dependencies

🟢 Backward compatibility confirmed

  • echo_transcript has #[serde(default)] — existing config.toml files parse fine
  • Helm values.yaml adds key with false default — existing deployments unaffected
  • No breaking changes to ChatAdapter trait

CI Status

All 10/10 checks pass (check ×2, helm-unittest, smoke-test ×7).

Changes Since Last Review (6bc70f6dc33d62)

  • PR description corrected (now says default false, opt-in)
  • Helm template improved (hasKey guard without else branch — relies on Rust serde default)
  • Codebase-wide cargo fmt pass (~15 files, formatting only, no logic changes)
  • Rebased on main (conflict resolved)

Verdict: APPROVE

Reviewed by Chloe (second review)

@obrutjack
Copy link
Copy Markdown
Collaborator

Note: Groq Whisper outputs Simplified Chinese

During testing with Chinese voice messages, the echo transcript displays Simplified Chinese (简体) regardless of the speaker's dialect. This is a Whisper model behavior — the whisper-large-v3-turbo model does not distinguish between Simplified and Traditional Chinese in its output. The language parameter in the Whisper API controls input language hinting, not output script selection.

This is not a bug in this PR — the echo correctly displays the raw STT output as-is.

@obrutjack
Copy link
Copy Markdown
Collaborator

Documentation suggestion: Groq API Key prerequisite

The STT feature requires a Groq API key. Users need to:

  1. Sign up at https://console.groq.com
  2. Create an API key (format: gsk_...)
  3. Set it in config: [stt] api_key = "your-groq-api-key" or via Helm: --set agents.kiro.stt.apiKey="your-key"

Groq free tier includes Whisper STT at no cost. This should be documented in docs/stt.md as a prerequisite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature p2 Medium — planned work pending-screening PR awaiting automated screening

Projects

None yet

Development

Successfully merging this pull request may close these issues.

STT: echo transcript to thread before agent reply (Discord + Slack)

6 participants