Skip to content

feat(gateway): implement multimodal (image/document/audio) inbound support for LINE and Telegram#757

Open
iamninihuang wants to merge 5 commits intoopenabdev:mainfrom
iamninihuang:feat/690-multimodal-line-telegram
Open

feat(gateway): implement multimodal (image/document/audio) inbound support for LINE and Telegram#757
iamninihuang wants to merge 5 commits intoopenabdev:mainfrom
iamninihuang:feat/690-multimodal-line-telegram

Conversation

@iamninihuang
Copy link
Copy Markdown
Contributor

@iamninihuang iamninihuang commented May 6, 2026

🚀 Multimodal Inbound Support for LINE & Telegram

This PR implements end-to-end multimodal support (images, text documents, and audio/voice) for LINE and Telegram integrations via the Custom Gateway.

Closes #690

Implementation Matrix

Direction Text Images Audio/Voice
Inbound (User → Bot)
Outbound (Bot → User)

Key Features:

  1. Shared Media Module: Centralized image processing (resizing to 1200px, JPEG compression) in gateway/src/media.rs to reduce bandwidth and memory pressure.
  2. Audio/Voice Support: LINE/Telegram audio messages are downloaded and sent to the Core. The Core uses the configured STT (Whisper/Groq) to transcribe voice messages into text for the agent.
  3. Document Support: Telegram text-based files (up to 512KB) are automatically read and passed to the agent as Markdown code blocks.
  4. Platform Integration: Full webhook handling and media downloading implemented for both LINE and Telegram adapters.

Discord Discussion URL
https://discord.com/channels/1491295327620169908/1496171374711148665/1499859716409393172

Verification:

  • Verified via cargo check and cargo test.
  • Manual verification of image scaling and audio transcription flow.

@iamninihuang iamninihuang requested a review from thepagent as a code owner May 6, 2026 12:53
Copilot AI review requested due to automatic review settings May 6, 2026 12:53
@github-actions github-actions Bot added pending-screening PR awaiting automated screening closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels May 6, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds inbound multimodal support (images, text documents, and audio) for the Custom Gateway’s LINE and Telegram adapters, including shared gateway-side image resizing/compression and Core-side audio transcription via the configured STT.

Changes:

  • Add gateway-side media utilities (resize_and_compress, size limits) and wire them into adapters.
  • Implement Telegram + LINE inbound attachment downloading/encoding and inclusion in GatewayEvent attachments.
  • Extend Core gateway adapter to decode attachments and (optionally) transcribe inbound audio when STT is enabled.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/main.rs Pass STT config into the Core gateway adapter params.
src/gateway.rs Convert gateway attachments into Core ContentBlocks, including audio transcription support.
gateway/src/media.rs New shared media module for image resize/compress + download size limits.
gateway/src/main.rs Register the new media module; minor formatting changes.
gateway/src/adapters/telegram.rs Add inbound photo/document/audio handling and media download helpers (currently has compile/logic issues).
gateway/src/adapters/line.rs Add inbound image/audio handling and LINE media download helper (currently has compile issues).
gateway/src/adapters/feishu.rs Refactor to reuse shared media module; mostly formatting.
gateway/src/adapters/googlechat.rs Formatting and test fixture updates to include empty attachments.
gateway/src/adapters/teams.rs Formatting only.
gateway/Cargo.lock Bump openab-gateway lockfile version entry.
docs/telegram.md Document Telegram inbound file/image/audio support.
docs/line.md Document LINE inbound image/audio support.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 93 to +105
let Some(msg) = update.message else {
return axum::http::StatusCode::OK;
};
let Some(text) = msg.text.as_deref() else {
return axum::http::StatusCode::OK;
};
if text.trim().is_empty() {
let is_voice = msg.voice.is_some();
let is_audio = msg.audio.is_some();
let text = msg.text.as_deref().or(msg.caption.as_deref()).unwrap_or("");

if text.trim().is_empty() && !is_photo && !is_document && !is_voice && !is_audio {
return axum::http::StatusCode::OK;
}

let mut attachments = Vec::new();
if is_photo || is_document || is_voice || is_audio {
Comment on lines +344 to +348
let max_size = if attachment_type == "image" {
IMAGE_MAX_DOWNLOAD
} else {
AUDIO_MAX_DOWNLOAD
};
caption: Option<String>,
#[serde(default)]
entities: Vec<TelegramEntity>,
#[serde(default)]
Comment on lines +94 to 99
let is_text = msg.message_type == "text";
let is_audio = msg.message_type == "audio";

if !is_text && !is_image && !is_audio {
continue;
}
Comment on lines +304 to +308
let max_size = if attachment_type == "image" {
IMAGE_MAX_DOWNLOAD
} else {
AUDIO_MAX_DOWNLOAD
};
Comment thread gateway/src/media.rs
Comment on lines +23 to +24
let new_w = (f64::from(w) * ratio) as u32;
let new_h = (f64::from(h) * ratio) as u32;
Comment thread src/gateway.rs Outdated
Comment on lines +676 to +693
"audio" => {
if stt.enabled {
use base64::Engine;
if let Ok(bytes) = base64::engine::general_purpose::STANDARD.decode(&att.data) {
let client = reqwest::Client::new();
if let Some(text) = crate::stt::transcribe(
&client,
&stt,
bytes,
att.filename.clone(),
&att.mime_type
).await {
extra_blocks.push(ContentBlock::Text {
text: format!("[Audio: {}]", text),
});
}
}
}
@github-actions github-actions Bot removed the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label May 6, 2026
@shaun-agent
Copy link
Copy Markdown
Contributor

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Screening report ## Intent

PR #757 aims to add inbound multimodal message handling to the OpenAB custom gateway for LINE and Telegram. The user-visible problem is that users on those platforms can currently interact reliably with text, but images, voice/audio, and Telegram text documents are not fully accepted, normalized, and forwarded into the agent/core flow.

Feat

This is a gateway feature PR.

It adds inbound support for:

  • LINE images and audio/voice
  • Telegram images, audio/voice, and small text documents
  • Shared image preprocessing through gateway/src/media.rs
  • Core-side delivery of downloaded media so configured STT can transcribe audio into text for the agent
  • Documentation updates for LINE and Telegram behavior

Outbound multimodal support is explicitly out of scope: text remains supported, while outbound images/audio are not implemented.

Who It Serves

Primary beneficiaries:

  • Discord/LINE/Telegram end users who expect bots to handle media-rich messages
  • Deployers operating OpenAB gateways for LINE or Telegram communities
  • Agent runtime operators who need inbound media normalized before it reaches the core
  • Maintainers who want shared gateway media handling instead of one-off adapter logic

Rewritten Prompt

Implement inbound multimodal support for LINE and Telegram in the custom gateway.

Requirements:

  • Accept inbound image messages from LINE and Telegram.
  • Download media through each platform’s authenticated media APIs.
  • Normalize images through a shared gateway media helper: resize large images to a maximum dimension of 1200px and JPEG-compress them before forwarding.
  • Accept inbound LINE and Telegram audio/voice messages and forward them to core so the configured STT provider can transcribe them for the agent.
  • Accept Telegram text documents up to 512KB and pass their contents to the agent as Markdown code blocks.
  • Keep outbound multimodal delivery out of scope.
  • Preserve existing text-message behavior across all gateway adapters.
  • Add or update docs for LINE and Telegram.
  • Add tests or focused verification covering media download, size limits, image normalization, and unsupported media cases.

Review carefully for unrelated adapter churn outside LINE, Telegram, shared media, gateway wiring, and docs.

Merge Pitch

This PR moves OpenAB closer to parity with real messaging-platform usage: users commonly send screenshots, photos, voice notes, and small documents instead of plain text. Inbound-only support is a sensible first step because it unlocks richer agent context without committing to platform-specific outbound media delivery.

Risk profile is moderate. The main reviewer concern should be scope control: the file list includes substantial changes to Feishu, Google Chat, and Teams even though the PR title is LINE and Telegram focused. Review should determine whether those changes are necessary gateway-interface adjustments or unrelated refactor noise.

Best-Practice Comparison

OpenClaw principles that apply:

  • Explicit delivery routing is relevant because downloaded media must be associated with the correct platform conversation and forwarded into the correct core/session path.
  • Isolated executions are relevant if media processing or STT handoff can be expensive or failure-prone.
  • Run logs and retry/backoff are partially relevant for media downloads, especially platform API failures and transient network errors.

OpenClaw principles that are less central here:

  • Gateway-owned scheduling and durable job persistence are not the main concern unless media processing is moved async or retried after webhook acknowledgement.

Hermes Agent principles that apply:

  • Fresh session per scheduled run is not directly relevant, but the broader principle of self-contained prompts does apply: document contents and transcribed audio should arrive with enough context for the agent to understand the user action.
  • Atomic writes and file locking are relevant only if downloaded media or intermediate state is persisted to disk.

Hermes Agent principles that are less central here:

  • Gateway daemon tick model is not a natural fit for webhook-driven LINE and Telegram inbound media unless the implementation evolves toward queued background processing.

Overall, the most relevant best practices are explicit routing, bounded media handling, failure logging, and keeping expensive processing isolated from fragile webhook request paths.

Implementation Options

Option 1: Conservative, LINE/Telegram-only inbound support
Keep changes tightly scoped to line.rs, telegram.rs, shared media.rs, core gateway payload wiring, and docs. Reject or ignore unsupported media types with clear logs. Avoid broad adapter refactors.

Option 2: Balanced, shared gateway media abstraction
Introduce a common inbound media representation used by LINE and Telegram now, with adapter-safe compatibility for other platforms. Centralize image normalization, document limits, MIME checks, and audio forwarding while minimizing changes to unrelated adapters.

Option 3: Ambitious, durable async media pipeline
Treat inbound media as gateway jobs: acknowledge webhooks quickly, persist media metadata, process downloads/transforms asynchronously, retry transient failures, and emit structured run logs. This would align more closely with OpenClaw-style durable execution but is a larger architectural step.

Comparison Table

Option Speed to ship Complexity Reliability Maintainability User impact Fit for OpenAB right now
Conservative LINE/Telegram-only High Low-Medium Medium Medium High for LINE/Telegram users Strong if PR scope needs tightening
Shared gateway media abstraction Medium Medium Medium-High High High, with future adapter reuse Best balance for this PR
Durable async media pipeline Low High High Medium-High High, especially at scale Likely too large for this PR

Recommendation

Advance this item, but steer review toward the balanced option.

The feature is valuable and user-facing, but the merge discussion should focus on scope discipline: confirm that non-LINE/Telegram adapter changes are required by a shared gateway interface rather than incidental churn. If they are not required, split them out.

Recommended sequencing:

  1. Merge inbound LINE/Telegram media support with shared media helpers and clear limits.
  2. Keep outbound media out of scope.
  3. Add follow-up issues for async/durable media processing, richer retry logging, and outbound multimodal support.

@iamninihuang iamninihuang force-pushed the feat/690-multimodal-line-telegram branch from 890bba5 to cf22796 Compare May 6, 2026 13:10
@chaodu-agent
Copy link
Copy Markdown
Collaborator

🔃 Review: feat(gateway): implement multimodal inbound support for LINE and Telegram

What problem does this solve?

LINE and Telegram users can send images, voice notes, and documents, but the gateway previously only forwarded text messages. This PR enables multimodal inbound — images are resized/compressed, audio is passed through for STT transcription, and text documents are read and forwarded as content.

How does it solve it?

  1. Shared media.rs module — extracts resize_and_compress() and size constants from feishu.rs into a shared module. All adapters now use the same image processing pipeline.
  2. LINE adapter — handles image and audio message types, downloads content via LINE Content API, processes images, and attaches to the GatewayEvent.
  3. Telegram adapter — handles photos (picks largest), documents (text-only, ≤512KB), voice notes, and audio files via Telegram getFile API.
  4. Core gateway handler (src/gateway.rs) — adds audio attachment handling: decodes base64, calls existing crate::stt::transcribe(), and injects [Audio: <text>] as a content block.
  5. SttConfig passed to gateway params — enables the gateway adapter to use STT when audio arrives.

What was considered?

  • Outbound multimodal is explicitly out of scope (noted in the implementation matrix)
  • Follows existing patterns from the Feishu adapter (which already had image/file support)
  • Size limits: 10MB images, 20MB audio, 512KB documents
  • Text document allowlist uses the same extension list as Feishu

Verdict

🟡 CHANGES REQUESTED — Good feature, solid implementation, but a few issues need addressing before merge.

Detailed notes

🟢 INFO — Good patterns:

  • Extracting resize_and_compress into shared media.rs is the right call — DRY across 3 adapters
  • Audio transcription reuses existing crate::stt::transcribe() — no new STT code
  • Size limit checks are done both on Content-Length header AND actual bytes (defense in depth)
  • GIF passthrough preserved in the shared module

🔴 SUGGESTED CHANGES:

  1. Feishu formatting changes inflate the diff — The PR includes ~200 lines of pure rustfmt reformatting in feishu.rs and googlechat.rs (line wrapping, argument alignment). These are unrelated to the multimodal feature and make the actual changes harder to review. Please split formatting into a separate commit or PR, or at minimum note it clearly in the PR description.

  2. reqwest::Client::new() created per-request in LINE/Telegram download functions — Each media download creates a fresh reqwest::Client. This skips connection pooling and adds TLS handshake overhead. The Feishu adapter passes a shared &reqwest::Client from the adapter struct. LINE and Telegram should do the same — pass the existing client from the webhook handler state.

  3. Missing SttConfig clone derivation checksrc/main.rs adds stt: cfg.stt.clone(). Verify that SttConfig derives Clone. If it already does (likely), this is fine. If not, this won't compile.

  4. Telegram download_telegram_document — missing final bytes of diff — The diff appears truncated. Please confirm the function handles the case where the downloaded file is not valid UTF-8 (should use String::from_utf8_lossy like Feishu does).

  5. Version bump in gateway/Cargo.lock — Bumps gateway from 0.1.0 to 0.4.0. Is this intentional? If so, the corresponding Cargo.toml version should also be bumped. If this is just a lock file artifact, please verify.

🟡 NIT (non-blocking):

  • LINE audio default MIME type "audio/x-m4a" — the standard MIME for m4a is "audio/mp4" or "audio/x-m4a". Both work, but "audio/mp4" is more widely recognized.
  • Consider logging the file size after successful download for observability (LINE and Telegram adapters currently only log on failure).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pending-contributor pending-screening PR awaiting automated screening

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(gateway): support images and audio for LINE/Telegram

5 participants