Skip to content

feat(gateway): feishu voice message STT via gateway audio attachment#761

Open
wangyuyan-agent wants to merge 5 commits intoopenabdev:mainfrom
wangyuyan-agent:feat/gateway-feishu-voice-stt
Open

feat(gateway): feishu voice message STT via gateway audio attachment#761
wangyuyan-agent wants to merge 5 commits intoopenabdev:mainfrom
wangyuyan-agent:feat/gateway-feishu-voice-stt

Conversation

@wangyuyan-agent
Copy link
Copy Markdown
Contributor

@wangyuyan-agent wangyuyan-agent commented May 6, 2026

Summary

Adds voice message (speech-to-text) support for the Feishu gateway adapter. When a user sends a voice message, the gateway downloads the opus/ogg audio, passes it to core as a base64-encoded "audio" attachment, and core transcribes it via the existing [stt] infrastructure before injecting the transcript into the LLM prompt.

This also introduces the "audio" attachment type to the gateway protocol — making it trivial for LINE/Telegram adapters to add voice support in the future (only the download logic differs per platform).

Feishu user sends voice message
    │
    ▼
Gateway: im.message.receive_v1 (msg_type=audio)
    │  parse content → extract file_key
    │  GET /im/v1/messages/{id}/resources/{key}?type=file
    │  base64 encode → Attachment{type:"audio", mime:"audio/ogg"}
    ▼
WebSocket → Core: GatewayEvent with audio attachment
    │
    ▼
Core: [stt] enabled?
    ├── Yes → decode base64 → stt::transcribe() → inject "[Voice message transcript]: ..."
    │         → LLM processes transcript as text
    └── No  → silently skip (graceful degradation)

⚠️ Dependency

Stacks on #746#744. Please merge in order: #744#746 → this PR. Will rebase onto main once dependencies land.

Prior Art

Feature OpenAB (Discord) OpenClaw Hermes Agent This PR (Feishu)
Voice detection audio/* MIME on attachment Skill-based (plugin) Built-in voice mode msg_type == "audio"
STT engine OpenAI-compatible (Groq default) Extensible providers (Yandex, Whisper, Gemini) Built-in (Whisper, configurable) Same as Discord — reuses core [stt]
Audio format ogg/opus ogg, wav, mp3 ogg/opus opus/ogg (Feishu native)
Where STT runs In adapter (direct core access) In skill process In gateway In core (gateway passes raw audio)
Fallback on failure Silent skip + 🎤 reaction Error message to user Configurable Silent skip (matches Discord)
Config [stt] section Per-skill config voice: YAML Same [stt] — zero new config

Design Trade-offs

Why STT in Core (not Gateway)?

  • Gateway's reqwest doesn't have multipart feature (needed for Whisper API)
  • Gateway would need to hold API keys, manage config, handle retries
  • Core already has stt.rs + media.rs — reuse > rewrite
  • Verdict: Keep gateway lightweight. STT is "understanding", belongs in core.

Why base64 over WebSocket (not streaming/binary)?

  • Feishu voice messages capped at 60s → ~1-2MB opus → ~2.7MB base64
  • Whisper API requires complete file (no streaming input)
  • Binary WS frames would save 33% bandwidth but require protocol changes
  • Verdict: Not worth the complexity for <3MB payloads.

Why no user feedback on STT failure?

  • Discord adapter also silently skips (established pattern)
  • Adding error feedback requires knowing user's language, platform-specific reply formatting
  • Verdict: Match Discord behavior for v1. Can add feedback in follow-up.

Changes

  • gateway/src/adapters/feishu.rs: Allow msg_type=audio, add MediaRef::Audio, add download_feishu_audio(), handle in both WS and webhook paths
  • src/gateway.rs: Add stt: SttConfig to GatewayParams, add "audio" attachment handler (decode → transcribe → inject), warn on decode failure
  • src/main.rs: Pass cfg.stt.clone() to GatewayParams
  • docs/feishu.md: Add audio row to message type table
  • docs/stt.md: Update from Discord-only to multi-platform wording

Configuration

Uses the existing [stt] section — no new configuration:

[stt]
enabled = true
# Default: Groq free tier (auto-detects GROQ_API_KEY env var)
# model = "whisper-large-v3-turbo"
# base_url = "https://api.groq.com/openai/v1"

See docs/stt.md for full setup guide.

Testing

  • Gateway: 102 tests pass
  • Core: 197 tests pass
  • E2E: Feishu private chat → voice message → download → STT → LLM responds ✅

Feishu API Facts

  • Event: msg_type=audio, content: {"file_key":"...", "duration":N}
  • Download: same API as file (/im/v1/messages/{id}/resources/{key}?type=file)
  • Format: opus/ogg — Whisper natively supports, no transcoding
  • Permission: im:message (already required)
  • Size: typical 0.5-2MB for 60s voice (test messages: 5-16KB for 2-7s)

Known Limitations (v1)

  1. STT failure → silent skip (no user feedback). Matches Discord behavior.
  2. Base64 overhead (~33%) — negligible for actual voice messages (<2MB).
  3. No duration-based filtering — very short voice messages (accidental taps) still get transcribed.

Discussion

https://discord.com/channels/1491295327620169908/1500160821567684660

Once the bot replies in a thread, subsequent messages in that thread
bypass @mention gating — matching Discord's default 'involved' mode.

- Add participated_threads cache (HashMap<thread_id, Instant>)
- Bypass mention gating when message is in a participated thread
- Record participation on successful reply to a thread
- TTL controlled by FEISHU_SESSION_TTL_HOURS (default 24h)
- Cache eviction at 1000 entries (oldest-half strategy)
- 3 new tests for participation logic
- Extract check_thread_participated() helper to reduce duplication
- Add comments explaining intentional poisoned-mutex recovery
- Improve eviction: drop TTL-expired entries first, then oldest half
- Add comment clarifying session_ttl_secs=0 disables participation tracking
- Update bot_turns comment: remove TODO, note existing eviction pattern
Add AllowUsers enum (Involved/Mentions/MultibotMentions) controlled by
FEISHU_ALLOW_USER_MESSAGES env var. In multibot-mentions mode, once
another bot is @mentioned in a participated thread, require @mention
for all bots — prevents multiple bots from responding simultaneously.

Multibot detection strategy:
- If FEISHU_TRUSTED_BOT_IDS configured: exact match
- Otherwise: infer from allowed_users (mention not self and not in
  allowed_users → assumed to be another bot)
- Only triggers in threads where bot has already participated

This avoids requiring users to discover per-app open_ids for other bots.
- Add msg_type=audio support to feishu adapter (parse, download, base64 encode)
- Add MediaRef::Audio variant and download_feishu_audio() function
- Add "audio" attachment type to core gateway handler (decode → stt::transcribe)
- Pass SttConfig to gateway handler via GatewayParams
- Update docs/feishu.md and docs/stt.md for multi-platform voice support

Feishu voice messages (opus/ogg) are downloaded by the gateway, passed as
base64-encoded audio attachments to core, and transcribed via the existing
[stt] infrastructure (Groq Whisper by default). This is the first gateway
platform to support audio — LINE/Telegram can reuse the core-side handler.

Tested: 102 gateway tests + 197 core tests pass. E2E verified.
@wangyuyan-agent wangyuyan-agent requested a review from thepagent as a code owner May 6, 2026 15:58
@github-actions github-actions Bot added the pending-screening PR awaiting automated screening label May 6, 2026
@shaun-agent
Copy link
Copy Markdown
Contributor

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Screening report ## Intent

PR #761 adds Feishu voice message support to the OpenAB gateway. The concrete problem is that Feishu users can currently send text-like messages through the gateway, but voice messages are not converted into prompt input, so the agent cannot respond meaningfully to spoken input.

The PR proposes downloading Feishu audio message resources, forwarding them to core as base64 audio attachments, and letting the existing OpenAB STT pipeline transcribe them before prompt injection.

Feat

Feature.

Behavioral change: Feishu gateway messages with msg_type=audio become usable agent input. The gateway extracts the Feishu file_key, downloads the opus/ogg payload, wraps it as an Attachment { type: "audio", mime: "audio/ogg" }, and sends it to core. Core decodes the attachment and, when [stt] is enabled, transcribes it and injects the transcript into the LLM prompt.

The PR also generalizes the gateway protocol by introducing an audio attachment type that future adapters such as LINE or Telegram could reuse.

Who It Serves

Primary beneficiaries: Feishu end users who want to interact with OpenAB agents using voice messages.

Secondary beneficiaries: gateway adapter maintainers, because the PR creates a reusable protocol path for audio attachments instead of making Feishu-specific STT logic a one-off.

Operational beneficiaries: deployers who already use [stt], because the feature claims to require no new configuration.

Rewritten Prompt

Implement Feishu voice-message support through the gateway attachment protocol.

When the Feishu adapter receives im.message.receive_v1 with msg_type == "audio", parse the message content, extract file_key, download the resource from Feishu using the existing message-resource API, and attach the downloaded opus/ogg bytes to the outgoing gateway event as base64 with type: "audio" and an accurate MIME type.

In core gateway handling, recognize audio attachments. If STT is enabled, decode the base64 payload, transcribe it through the existing stt::transcribe path, and inject a clear voice transcript marker into the prompt. If STT is disabled or decoding/transcription fails, degrade without crashing and log enough context for operators to diagnose the failure.

Cover both Feishu websocket and webhook paths. Add focused tests for audio message parsing, resource download behavior, gateway event attachment shape, STT-enabled transcript injection, and graceful behavior when STT is disabled or payload decoding fails. Update Feishu and STT docs to describe multi-platform voice attachment support.

Merge Pitch

This is worth advancing because voice input is a real user-facing capability gap for Feishu deployments, and the proposed architecture mostly reuses OpenAB’s existing STT configuration and transcription path.

Risk profile is moderate. The user-facing feature is straightforward, but the PR touches gateway protocol semantics, Feishu adapter behavior, core prompt construction, and STT error handling. The likely reviewer concern is whether the new generic audio attachment type is well-defined enough for future adapters, and whether core should silently skip failed audio transcription versus surfacing a clearer operator-visible warning.

Best-Practice Comparison

Relevant OpenClaw principles:

  • Explicit delivery routing is relevant. The gateway should pass audio as a typed attachment with enough metadata for core to handle it predictably.
  • Isolated executions are partially relevant. STT should remain inside the existing core transcription boundary rather than embedding provider-specific transcription inside the Feishu adapter.
  • Retry/backoff and run logs are relevant for the Feishu media download path. A failed download should be visible in logs and should not break the whole message pipeline.
  • Durable job persistence and gateway-owned scheduling are not directly relevant. This is event-driven message handling, not scheduled execution.

Relevant Hermes Agent principles:

  • Fresh session per scheduled run is not relevant because this PR handles live inbound messages, not scheduled jobs.
  • Self-contained prompts are relevant in a narrower sense: the injected transcript should be explicit and attributable, such as [Voice message transcript]: ..., so the model understands the source of the text.
  • Atomic writes and file locking are not relevant unless the implementation persists downloaded audio or intermediate state, which it should avoid if possible.
  • Gateway daemon tick model is not relevant to this direct event path.

Overall, the proposed direction fits the reference systems where they emphasize typed handoff, clear execution boundaries, and operator-observable failures. Scheduling and durable job-state principles do not apply.

Implementation Options

Conservative option: Feishu-only audio support using existing STT in core.

Keep the current PR narrow. Add msg_type=audio handling only to Feishu, forward an audio attachment, and let core transcribe it through existing [stt]. Avoid broader protocol redesign beyond documenting the new attachment type.

Balanced option: Formalize gateway audio attachments as a small cross-adapter contract.

Accept Feishu support, but also define the gateway protocol expectations for audio attachments: required fields, MIME handling, max size behavior, error logging, and what core does when STT is disabled. Add reusable helper functions so LINE and Telegram adapters can later plug in only their platform-specific download logic.

Ambitious option: Introduce a media ingestion layer for gateway adapters.

Create a gateway-level media abstraction for files, images, audio, and future media types, with shared download limits, content-type detection, logging, retry policy, and typed conversion into core attachments. Feishu audio becomes the first consumer, but the system is designed for all rich-message platforms.

Comparison Table

Option Speed to ship Complexity Reliability Maintainability User impact Fit for OpenAB right now
Conservative Feishu-only support High Low-Medium Medium Medium High for Feishu users Good
Balanced audio attachment contract Medium Medium High High High for Feishu, enables future adapters Best
Full media ingestion layer Low High Potentially High High if completed well Broader long-term impact Premature unless more media work is queued

Recommendation

Advance the PR using the balanced option.

The feature is valuable enough to move forward, but the merge discussion should focus on making audio attachments a clear gateway contract rather than only a Feishu implementation detail. That gives reviewers a concrete standard to validate: attachment shape, MIME expectations, STT-disabled behavior, failure logging, and test coverage across websocket and webhook paths.

Sequence it as one mergeable step: land Feishu voice support plus the minimal reusable audio attachment contract. Defer a broader media ingestion layer until at least one more adapter needs similar download-and-forward behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pending-screening PR awaiting automated screening

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants