feat(gateway): implement multimodal (image/document/audio) inbound support for LINE and Telegram by iamninihuang · Pull Request #757 · openabdev/openab

iamninihuang · 2026-05-06T12:53:01Z

🚀 Multimodal Inbound Support for LINE & Telegram

This PR implements end-to-end multimodal support (images, text documents, and audio/voice) for LINE and Telegram integrations via the Custom Gateway.

Closes #690

Implementation Matrix

Direction	Text	Images	Audio/Voice
Inbound (User → Bot)	✅	✅	✅
Outbound (Bot → User)	✅	❌	❌

Key Features:

Shared Media Module: Centralized image processing (resizing to 1200px, JPEG compression) in gateway/src/media.rs to reduce bandwidth and memory pressure.
Audio/Voice Support: LINE/Telegram audio messages are downloaded and sent to the Core. The Core uses the configured STT (Whisper/Groq) to transcribe voice messages into text for the agent.
Document Support: Telegram text-based files (up to 512KB) are automatically read and passed to the agent as Markdown code blocks.
Platform Integration: Full webhook handling and media downloading implemented for both LINE and Telegram adapters.

Discord Discussion URL
https://discord.com/channels/1491295327620169908/1496171374711148665/1499859716409393172

Verification:

Verified via cargo check and cargo test.
Manual verification of image scaling and audio transcription flow.

…or LINE and Telegram

…and Telegram

Copilot

Pull request overview

This PR adds inbound multimodal support (images, text documents, and audio) for the Custom Gateway’s LINE and Telegram adapters, including shared gateway-side image resizing/compression and Core-side audio transcription via the configured STT.

Changes:

Add gateway-side media utilities (resize_and_compress, size limits) and wire them into adapters.
Implement Telegram + LINE inbound attachment downloading/encoding and inclusion in GatewayEvent attachments.
Extend Core gateway adapter to decode attachments and (optionally) transcribe inbound audio when STT is enabled.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/main.rs	Pass STT config into the Core gateway adapter params.
src/gateway.rs	Convert gateway attachments into Core `ContentBlock`s, including audio transcription support.
gateway/src/media.rs	New shared media module for image resize/compress + download size limits.
gateway/src/main.rs	Register the new `media` module; minor formatting changes.
gateway/src/adapters/telegram.rs	Add inbound photo/document/audio handling and media download helpers (currently has compile/logic issues).
gateway/src/adapters/line.rs	Add inbound image/audio handling and LINE media download helper (currently has compile issues).
gateway/src/adapters/feishu.rs	Refactor to reuse shared media module; mostly formatting.
gateway/src/adapters/googlechat.rs	Formatting and test fixture updates to include empty attachments.
gateway/src/adapters/teams.rs	Formatting only.
gateway/Cargo.lock	Bump `openab-gateway` lockfile version entry.
docs/telegram.md	Document Telegram inbound file/image/audio support.
docs/line.md	Document LINE inbound image/audio support.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    let Some(msg) = update.message else {
        return axum::http::StatusCode::OK;
    };
-    let Some(text) = msg.text.as_deref() else {
-        return axum::http::StatusCode::OK;
-    };
-    if text.trim().is_empty() {
+    let is_voice = msg.voice.is_some();
+    let is_audio = msg.audio.is_some();
+    let text = msg.text.as_deref().or(msg.caption.as_deref()).unwrap_or("");
+
+    if text.trim().is_empty() && !is_photo && !is_document && !is_voice && !is_audio {
        return axum::http::StatusCode::OK;
    }

+    let mut attachments = Vec::new();
+    if is_photo || is_document || is_voice || is_audio {


+    let max_size = if attachment_type == "image" {
+        IMAGE_MAX_DOWNLOAD
+    } else {
+        AUDIO_MAX_DOWNLOAD
+    };


+    caption: Option<String>,
    #[serde(default)]
    entities: Vec<TelegramEntity>,
+    #[serde(default)]


+        let is_text = msg.message_type == "text";
+        let is_audio = msg.message_type == "audio";
+
+        if !is_text && !is_image && !is_audio {
            continue;
        }


+    let max_size = if attachment_type == "image" {
+        IMAGE_MAX_DOWNLOAD
+    } else {
+        AUDIO_MAX_DOWNLOAD
+    };


+        let new_w = (f64::from(w) * ratio) as u32;
+        let new_h = (f64::from(h) * ratio) as u32;


+                                            "audio" => {
+                                                if stt.enabled {
+                                                    use base64::Engine;
+                                                    if let Ok(bytes) = base64::engine::general_purpose::STANDARD.decode(&att.data) {
+                                                        let client = reqwest::Client::new();
+                                                        if let Some(text) = crate::stt::transcribe(
+                                                            &client,
+                                                            &stt,
+                                                            bytes,
+                                                            att.filename.clone(),
+                                                            &att.mime_type
+                                                        ).await {
+                                                            extra_blocks.push(ContentBlock::Text {
+                                                                text: format!("[Audio: {}]", text),
+                                                            });
+                                                        }
+                                                    }
+                                                }


shaun-agent · 2026-05-06T13:01:48Z

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Title: feat(gateway): implement multimodal (image/document/audio) inbound support for LINE and Telegram
Source: feat(gateway): implement multimodal (image/document/audio) inbound support for LINE and Telegram #757
Status: moved to PR-Screening
Generated at: 2026-05-06T13:01:48.169Z
Discord thread: https://discord.com/channels/1488041051187974246/1501569407673962546

Screening report

## Intent

PR #757 aims to add inbound multimodal message handling to the OpenAB custom gateway for LINE and Telegram. The user-visible problem is that users on those platforms can currently interact reliably with text, but images, voice/audio, and Telegram text documents are not fully accepted, normalized, and forwarded into the agent/core flow.

Feat

This is a gateway feature PR.

It adds inbound support for:

LINE images and audio/voice
Telegram images, audio/voice, and small text documents
Shared image preprocessing through gateway/src/media.rs
Core-side delivery of downloaded media so configured STT can transcribe audio into text for the agent
Documentation updates for LINE and Telegram behavior

Outbound multimodal support is explicitly out of scope: text remains supported, while outbound images/audio are not implemented.

Who It Serves

Primary beneficiaries:

Discord/LINE/Telegram end users who expect bots to handle media-rich messages
Deployers operating OpenAB gateways for LINE or Telegram communities
Agent runtime operators who need inbound media normalized before it reaches the core
Maintainers who want shared gateway media handling instead of one-off adapter logic

Rewritten Prompt

Implement inbound multimodal support for LINE and Telegram in the custom gateway.

Requirements:

Accept inbound image messages from LINE and Telegram.
Download media through each platform’s authenticated media APIs.
Normalize images through a shared gateway media helper: resize large images to a maximum dimension of 1200px and JPEG-compress them before forwarding.
Accept inbound LINE and Telegram audio/voice messages and forward them to core so the configured STT provider can transcribe them for the agent.
Accept Telegram text documents up to 512KB and pass their contents to the agent as Markdown code blocks.
Keep outbound multimodal delivery out of scope.
Preserve existing text-message behavior across all gateway adapters.
Add or update docs for LINE and Telegram.
Add tests or focused verification covering media download, size limits, image normalization, and unsupported media cases.

Review carefully for unrelated adapter churn outside LINE, Telegram, shared media, gateway wiring, and docs.

Merge Pitch

This PR moves OpenAB closer to parity with real messaging-platform usage: users commonly send screenshots, photos, voice notes, and small documents instead of plain text. Inbound-only support is a sensible first step because it unlocks richer agent context without committing to platform-specific outbound media delivery.

Risk profile is moderate. The main reviewer concern should be scope control: the file list includes substantial changes to Feishu, Google Chat, and Teams even though the PR title is LINE and Telegram focused. Review should determine whether those changes are necessary gateway-interface adjustments or unrelated refactor noise.

Best-Practice Comparison

OpenClaw principles that apply:

Explicit delivery routing is relevant because downloaded media must be associated with the correct platform conversation and forwarded into the correct core/session path.
Isolated executions are relevant if media processing or STT handoff can be expensive or failure-prone.
Run logs and retry/backoff are partially relevant for media downloads, especially platform API failures and transient network errors.

OpenClaw principles that are less central here:

Gateway-owned scheduling and durable job persistence are not the main concern unless media processing is moved async or retried after webhook acknowledgement.

Hermes Agent principles that apply:

Fresh session per scheduled run is not directly relevant, but the broader principle of self-contained prompts does apply: document contents and transcribed audio should arrive with enough context for the agent to understand the user action.
Atomic writes and file locking are relevant only if downloaded media or intermediate state is persisted to disk.

Hermes Agent principles that are less central here:

Gateway daemon tick model is not a natural fit for webhook-driven LINE and Telegram inbound media unless the implementation evolves toward queued background processing.

Overall, the most relevant best practices are explicit routing, bounded media handling, failure logging, and keeping expensive processing isolated from fragile webhook request paths.

Implementation Options

Option 1: Conservative, LINE/Telegram-only inbound support
Keep changes tightly scoped to line.rs, telegram.rs, shared media.rs, core gateway payload wiring, and docs. Reject or ignore unsupported media types with clear logs. Avoid broad adapter refactors.

Option 2: Balanced, shared gateway media abstraction
Introduce a common inbound media representation used by LINE and Telegram now, with adapter-safe compatibility for other platforms. Centralize image normalization, document limits, MIME checks, and audio forwarding while minimizing changes to unrelated adapters.

Option 3: Ambitious, durable async media pipeline
Treat inbound media as gateway jobs: acknowledge webhooks quickly, persist media metadata, process downloads/transforms asynchronously, retry transient failures, and emit structured run logs. This would align more closely with OpenClaw-style durable execution but is a larger architectural step.

Comparison Table

Option	Speed to ship	Complexity	Reliability	Maintainability	User impact	Fit for OpenAB right now
Conservative LINE/Telegram-only	High	Low-Medium	Medium	Medium	High for LINE/Telegram users	Strong if PR scope needs tightening
Shared gateway media abstraction	Medium	Medium	Medium-High	High	High, with future adapter reuse	Best balance for this PR
Durable async media pipeline	Low	High	High	Medium-High	High, especially at scale	Likely too large for this PR

Recommendation

Advance this item, but steer review toward the balanced option.

The feature is valuable and user-facing, but the merge discussion should focus on scope discipline: confirm that non-LINE/Telegram adapter changes are required by a shared gateway interface rather than incidental churn. If they are not required, split them out.

Recommended sequencing:

Merge inbound LINE/Telegram media support with shared media helpers and clear limits.
Keep outbound media out of scope.
Add follow-up issues for async/durable media processing, richer retry logging, and outbound multimodal support.

chaodu-agent · 2026-05-06T15:03:22Z

🔃 Review: feat(gateway): implement multimodal inbound support for LINE and Telegram

What problem does this solve?

LINE and Telegram users can send images, voice notes, and documents, but the gateway previously only forwarded text messages. This PR enables multimodal inbound — images are resized/compressed, audio is passed through for STT transcription, and text documents are read and forwarded as content.

How does it solve it?

Shared media.rs module — extracts resize_and_compress() and size constants from feishu.rs into a shared module. All adapters now use the same image processing pipeline.
LINE adapter — handles image and audio message types, downloads content via LINE Content API, processes images, and attaches to the GatewayEvent.
Telegram adapter — handles photos (picks largest), documents (text-only, ≤512KB), voice notes, and audio files via Telegram getFile API.
Core gateway handler (src/gateway.rs) — adds audio attachment handling: decodes base64, calls existing crate::stt::transcribe(), and injects [Audio: <text>] as a content block.
SttConfig passed to gateway params — enables the gateway adapter to use STT when audio arrives.

What was considered?

Outbound multimodal is explicitly out of scope (noted in the implementation matrix)
Follows existing patterns from the Feishu adapter (which already had image/file support)
Size limits: 10MB images, 20MB audio, 512KB documents
Text document allowlist uses the same extension list as Feishu

Verdict

🟡 CHANGES REQUESTED — Good feature, solid implementation, but a few issues need addressing before merge.

Detailed notes

🟢 INFO — Good patterns:

Extracting resize_and_compress into shared media.rs is the right call — DRY across 3 adapters
Audio transcription reuses existing crate::stt::transcribe() — no new STT code
Size limit checks are done both on Content-Length header AND actual bytes (defense in depth)
GIF passthrough preserved in the shared module

🔴 SUGGESTED CHANGES:

Feishu formatting changes inflate the diff — The PR includes ~200 lines of pure rustfmt reformatting in feishu.rs and googlechat.rs (line wrapping, argument alignment). These are unrelated to the multimodal feature and make the actual changes harder to review. Please split formatting into a separate commit or PR, or at minimum note it clearly in the PR description.
reqwest::Client::new() created per-request in LINE/Telegram download functions — Each media download creates a fresh reqwest::Client. This skips connection pooling and adds TLS handshake overhead. The Feishu adapter passes a shared &reqwest::Client from the adapter struct. LINE and Telegram should do the same — pass the existing client from the webhook handler state.
Missing SttConfig clone derivation check — src/main.rs adds stt: cfg.stt.clone(). Verify that SttConfig derives Clone. If it already does (likely), this is fine. If not, this won't compile.
Telegram download_telegram_document — missing final bytes of diff — The diff appears truncated. Please confirm the function handles the case where the downloaded file is not valid UTF-8 (should use String::from_utf8_lossy like Feishu does).
Version bump in gateway/Cargo.lock — Bumps gateway from 0.1.0 to 0.4.0. Is this intentional? If so, the corresponding Cargo.toml version should also be bumped. If this is just a lock file artifact, please verify.

🟡 NIT (non-blocking):

LINE audio default MIME type "audio/x-m4a" — the standard MIME for m4a is "audio/mp4" or "audio/x-m4a". Both work, but "audio/mp4" is more widely recognized.
Consider logging the file size after successful download for observability (LINE and Telegram adapters currently only log on failure).

ijbhxhu added 3 commits May 6, 2026 20:29

feat(gateway)!: implement multimodal (image/document/audio) support f…

c814503

…or LINE and Telegram

feat(audio): implement inbound audio/voice support with STT for LINE …

2d251ab

…and Telegram

docs: document Audio/Voice support for LINE and Telegram

5d76f5b

iamninihuang requested a review from thepagent as a code owner May 6, 2026 12:53

Copilot AI review requested due to automatic review settings May 6, 2026 12:53

github-actions Bot added pending-screening PR awaiting automated screening closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels May 6, 2026

iamninihuang mentioned this pull request May 6, 2026

feat(gateway): support images and audio for LINE/Telegram #690

Open

Copilot started reviewing on behalf of iamninihuang May 6, 2026 12:53 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

github-actions Bot removed the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label May 6, 2026

fix: resolve clippy warnings and missing fields in gateway adapters

cf22796

iamninihuang force-pushed the feat/690-multimodal-line-telegram branch from 890bba5 to cf22796 Compare May 6, 2026 13:10

github-actions Bot added the pending-maintainer label May 6, 2026

chaodu-agent added pending-contributor and removed pending-maintainer labels May 6, 2026

feat(gateway): address review comments for multimodal inbound

e5dd849

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gateway): implement multimodal (image/document/audio) inbound support for LINE and Telegram#757

feat(gateway): implement multimodal (image/document/audio) inbound support for LINE and Telegram#757
iamninihuang wants to merge 5 commits intoopenabdev:mainfrom
iamninihuang:feat/690-multimodal-line-telegram

iamninihuang commented May 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

shaun-agent commented May 6, 2026

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

chaodu-agent commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		let new_w = (f64::from(w) * ratio) as u32;
		let new_h = (f64::from(h) * ratio) as u32;

Conversation

iamninihuang commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Multimodal Inbound Support for LINE & Telegram

Implementation Matrix

Key Features:

Verification:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

shaun-agent commented May 6, 2026

OpenAB PR Screening

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

chaodu-agent commented May 6, 2026

🔃 Review: feat(gateway): implement multimodal inbound support for LINE and Telegram

What problem does this solve?

How does it solve it?

What was considered?

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iamninihuang commented May 6, 2026 •

edited

Loading