feat(gateway): implement multimodal (image/document/audio) inbound support for LINE and Telegram#757
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds inbound multimodal support (images, text documents, and audio) for the Custom Gateway’s LINE and Telegram adapters, including shared gateway-side image resizing/compression and Core-side audio transcription via the configured STT.
Changes:
- Add gateway-side media utilities (
resize_and_compress, size limits) and wire them into adapters. - Implement Telegram + LINE inbound attachment downloading/encoding and inclusion in
GatewayEventattachments. - Extend Core gateway adapter to decode attachments and (optionally) transcribe inbound audio when STT is enabled.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/main.rs | Pass STT config into the Core gateway adapter params. |
| src/gateway.rs | Convert gateway attachments into Core ContentBlocks, including audio transcription support. |
| gateway/src/media.rs | New shared media module for image resize/compress + download size limits. |
| gateway/src/main.rs | Register the new media module; minor formatting changes. |
| gateway/src/adapters/telegram.rs | Add inbound photo/document/audio handling and media download helpers (currently has compile/logic issues). |
| gateway/src/adapters/line.rs | Add inbound image/audio handling and LINE media download helper (currently has compile issues). |
| gateway/src/adapters/feishu.rs | Refactor to reuse shared media module; mostly formatting. |
| gateway/src/adapters/googlechat.rs | Formatting and test fixture updates to include empty attachments. |
| gateway/src/adapters/teams.rs | Formatting only. |
| gateway/Cargo.lock | Bump openab-gateway lockfile version entry. |
| docs/telegram.md | Document Telegram inbound file/image/audio support. |
| docs/line.md | Document LINE inbound image/audio support. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let Some(msg) = update.message else { | ||
| return axum::http::StatusCode::OK; | ||
| }; | ||
| let Some(text) = msg.text.as_deref() else { | ||
| return axum::http::StatusCode::OK; | ||
| }; | ||
| if text.trim().is_empty() { | ||
| let is_voice = msg.voice.is_some(); | ||
| let is_audio = msg.audio.is_some(); | ||
| let text = msg.text.as_deref().or(msg.caption.as_deref()).unwrap_or(""); | ||
|
|
||
| if text.trim().is_empty() && !is_photo && !is_document && !is_voice && !is_audio { | ||
| return axum::http::StatusCode::OK; | ||
| } | ||
|
|
||
| let mut attachments = Vec::new(); | ||
| if is_photo || is_document || is_voice || is_audio { |
| let max_size = if attachment_type == "image" { | ||
| IMAGE_MAX_DOWNLOAD | ||
| } else { | ||
| AUDIO_MAX_DOWNLOAD | ||
| }; |
| caption: Option<String>, | ||
| #[serde(default)] | ||
| entities: Vec<TelegramEntity>, | ||
| #[serde(default)] |
| let is_text = msg.message_type == "text"; | ||
| let is_audio = msg.message_type == "audio"; | ||
|
|
||
| if !is_text && !is_image && !is_audio { | ||
| continue; | ||
| } |
| let max_size = if attachment_type == "image" { | ||
| IMAGE_MAX_DOWNLOAD | ||
| } else { | ||
| AUDIO_MAX_DOWNLOAD | ||
| }; |
| let new_w = (f64::from(w) * ratio) as u32; | ||
| let new_h = (f64::from(h) * ratio) as u32; |
| "audio" => { | ||
| if stt.enabled { | ||
| use base64::Engine; | ||
| if let Ok(bytes) = base64::engine::general_purpose::STANDARD.decode(&att.data) { | ||
| let client = reqwest::Client::new(); | ||
| if let Some(text) = crate::stt::transcribe( | ||
| &client, | ||
| &stt, | ||
| bytes, | ||
| att.filename.clone(), | ||
| &att.mime_type | ||
| ).await { | ||
| extra_blocks.push(ContentBlock::Text { | ||
| text: format!("[Audio: {}]", text), | ||
| }); | ||
| } | ||
| } | ||
| } |
OpenAB PR ScreeningThis is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Screening report## IntentPR #757 aims to add inbound multimodal message handling to the OpenAB custom gateway for LINE and Telegram. The user-visible problem is that users on those platforms can currently interact reliably with text, but images, voice/audio, and Telegram text documents are not fully accepted, normalized, and forwarded into the agent/core flow. FeatThis is a gateway feature PR. It adds inbound support for:
Outbound multimodal support is explicitly out of scope: text remains supported, while outbound images/audio are not implemented. Who It ServesPrimary beneficiaries:
Rewritten PromptImplement inbound multimodal support for LINE and Telegram in the custom gateway. Requirements:
Review carefully for unrelated adapter churn outside LINE, Telegram, shared media, gateway wiring, and docs. Merge PitchThis PR moves OpenAB closer to parity with real messaging-platform usage: users commonly send screenshots, photos, voice notes, and small documents instead of plain text. Inbound-only support is a sensible first step because it unlocks richer agent context without committing to platform-specific outbound media delivery. Risk profile is moderate. The main reviewer concern should be scope control: the file list includes substantial changes to Feishu, Google Chat, and Teams even though the PR title is LINE and Telegram focused. Review should determine whether those changes are necessary gateway-interface adjustments or unrelated refactor noise. Best-Practice ComparisonOpenClaw principles that apply:
OpenClaw principles that are less central here:
Hermes Agent principles that apply:
Hermes Agent principles that are less central here:
Overall, the most relevant best practices are explicit routing, bounded media handling, failure logging, and keeping expensive processing isolated from fragile webhook request paths. Implementation OptionsOption 1: Conservative, LINE/Telegram-only inbound support Option 2: Balanced, shared gateway media abstraction Option 3: Ambitious, durable async media pipeline Comparison Table
RecommendationAdvance this item, but steer review toward the balanced option. The feature is valuable and user-facing, but the merge discussion should focus on scope discipline: confirm that non-LINE/Telegram adapter changes are required by a shared gateway interface rather than incidental churn. If they are not required, split them out. Recommended sequencing:
|
890bba5 to
cf22796
Compare
🔃 Review: feat(gateway): implement multimodal inbound support for LINE and TelegramWhat problem does this solve?LINE and Telegram users can send images, voice notes, and documents, but the gateway previously only forwarded text messages. This PR enables multimodal inbound — images are resized/compressed, audio is passed through for STT transcription, and text documents are read and forwarded as content. How does it solve it?
What was considered?
Verdict🟡 CHANGES REQUESTED — Good feature, solid implementation, but a few issues need addressing before merge. Detailed notes🟢 INFO — Good patterns:
🔴 SUGGESTED CHANGES:
🟡 NIT (non-blocking):
|
🚀 Multimodal Inbound Support for LINE & Telegram
This PR implements end-to-end multimodal support (images, text documents, and audio/voice) for LINE and Telegram integrations via the Custom Gateway.
Closes #690
Implementation Matrix
Key Features:
gateway/src/media.rsto reduce bandwidth and memory pressure.Discord Discussion URL
https://discord.com/channels/1491295327620169908/1496171374711148665/1499859716409393172
Verification:
cargo checkandcargo test.