Skip to content

[TRTLLM-12533][refactor] Move Media IO modality loading into MediaIO Interfaces#14010

Open
aswinvisva wants to merge 1 commit into
NVIDIA:mainfrom
aswinvisva:avisva/media_io_refactor
Open

[TRTLLM-12533][refactor] Move Media IO modality loading into MediaIO Interfaces#14010
aswinvisva wants to merge 1 commit into
NVIDIA:mainfrom
aswinvisva:avisva/media_io_refactor

Conversation

@aswinvisva
Copy link
Copy Markdown
Collaborator

@aswinvisva aswinvisva commented May 11, 2026

Summary by CodeRabbit

Release Notes

  • Improvements
    • Standardized media loading across images, audio, and video for more consistent behavior.
    • Enhanced support for multiple input formats including bytes, base64-encoded data, URLs, and local file paths.
    • Improved reliability and error handling in multimodal content processing.

Review Change Stack

Description

This PR created the MediaIO interface for video, image and audio. It added just the modality-specific logic for merging media IO kwargs at the request level and the server level. This PR moves some more modality-specific logic for loading IO data (URLs, data, local file) and postprocessing the IO data into the MediaIO interface.

Test Coverage

Tests added to the test_chat_utils.py which covers IO loading, updated the tests to load with the new MediaIO functions.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…interfaces

Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

📝 Walkthrough

Walkthrough

This PR refactors the media I/O system from a kwargs-resolution function to a generic BaseMediaIO[_MediaT] factory-based architecture. It introduces URL-scheme dispatch (http/https, data: URIs, local files), standardized abstract methods across Image/Audio/Video implementations, and integrates the factory into chat message parsing for all modality types and base64 audio decoding.

Changes

Media I/O Refactoring

Layer / File(s) Summary
Generic Media I/O Architecture
tensorrt_llm/inputs/media_io.py
Adds generic typing (TypeVar _MediaT) and transforms BaseMediaIO to BaseMediaIO(ABC, Generic[_MediaT]) with factory method create(...) and merge_kwargs(...) for shallow config merging.
Async Load Dispatch & Helpers
tensorrt_llm/inputs/media_io.py
Implements async_load(url) with URL-scheme dispatch: http/https fetch, data: base64 URI parsing, local file/bare-path delegation. Adds abstract methods load_bytes, load_base64, load_file and async helper load_base64_async(...).
Image Media I/O Implementation
tensorrt_llm/inputs/media_io.py
Implements load_bytes, load_base64, load_file entry points and centralizes output via _postprocess(): returns torch tensor (with torchvision.transforms.ToTensor on configured device) when format="pt", otherwise returns PIL image.
Audio Media I/O Implementation
tensorrt_llm/inputs/media_io.py
Implements decode entry points using soundfile for bytes/base64/files. Overrides async_load(url) to accept only http/https and local paths, rejecting other schemes. Normalizes file:// URIs via _normalize_file_uri.
Video Media I/O Implementation
tensorrt_llm/inputs/media_io.py
Adds constructor parameters num_frames, fps, format, device, extract_audio. Implements load_bytes (writes to temp .mp4), load_base64, load_file entry points. Preserves merge_kwargs coupling: dropping fps or num_frames defaults when only one is overridden by request.
Media I/O Factory & Lookup
tensorrt_llm/serve/chat_utils.py
Introduces _make_media_io(modality, server_config, request_kwargs) to select BaseMediaIO subclass from MEDIA_IO_REGISTRY and construct via create(...) with merged server/request configuration. Replaces removed resolve_media_io_kwargs(...) function.
URL-Based Content Loading
tensorrt_llm/serve/chat_utils.py
Updates parse_chat_message_content_part for image_url, video_url, audio_url parts: replaces separate async_load_* calls with _make_media_io(...) factory and media_io.async_load(url), logging merged kwargs before load.
Base64 Audio Decoding
tensorrt_llm/serve/chat_utils.py
Updates input_audio branch to decode base64 via audio_io.load_base64_async("audio/raw", audio_data) instead of async_load_audio(..., is_base64=True). Removes stale imports for async_load_* functions.
Documentation Reference
tensorrt_llm/inputs/utils.py
Updates comment in MultimodalDataTracker.__init__ to reference merge_media_io_kwargs (replacing resolve_media_io_kwargs) for where per-request media_io_kwargs overrides merge with server defaults.
Test Coverage & Validation
tests/unittest/llmapi/apps/test_chat_utils.py
Updates imports to use _make_media_io. Refactors exception-propagation tests to patch *MediaIO.async_load methods. Adds configuration tests validating shallow-merge behavior and video fps/num_frames coupling. Rewrites loader-forwarding tests to spy on _make_media_io and assert merged config baked into constructed MediaIO instances.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main refactoring: moving media IO modality loading logic into the MediaIO interfaces.
Description check ✅ Passed The description adequately explains the context and changes, referencing the prior PR, describing the refactoring approach, and listing test coverage updates.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tensorrt_llm/serve/chat_utils.py (1)

94-105: 💤 Low value

Consider using MediaModality Literal type for the modality parameter.

The modality parameter accepts any string, but MEDIA_IO_REGISTRY only contains specific keys ("image", "video", "audio"). Using the MediaModality Literal type (already defined in tensorrt_llm.inputs.media_io) would provide compile-time type safety.

🔧 Proposed fix
+from tensorrt_llm.inputs.media_io import MEDIA_IO_REGISTRY, BaseMediaIO, MediaModality

 def _make_media_io(
-    modality: str,
+    modality: MediaModality,
     server_config: Optional[MultimodalServerConfig],
     request_kwargs: Optional[Dict[str, Dict[str, Any]]],
 ) -> BaseMediaIO:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/serve/chat_utils.py` around lines 94 - 105, The _make_media_io
function currently types modality as str but should use the narrower
MediaModality literal type to match MEDIA_IO_REGISTRY keys; update the signature
of _make_media_io to accept modality: MediaModality, import MediaModality from
tensorrt_llm.inputs.media_io, and adjust any callers if needed to satisfy the
stronger type (or add explicit casts where caller values are validated) so the
registry lookup and create call remain unchanged.
tensorrt_llm/inputs/media_io.py (2)

124-128: 💤 Low value

Consider renaming format parameter to avoid shadowing the builtin.

The format parameter shadows Python's built-in format() function. Consider renaming to image_format, output_format, or data_format for clarity and to avoid the shadowing.

♻️ Proposed fix
-    def __init__(self, format: str = "pt", device: str = "cpu", **kwargs: Any) -> None:
-        assert format in ["pt", "pil"], "format must be either Pytorch or PIL"
-        super().__init__(format=format, device=device, **kwargs)
-        self.format = format
+    def __init__(self, output_format: str = "pt", device: str = "cpu", **kwargs: Any) -> None:
+        assert output_format in ["pt", "pil"], "format must be either Pytorch or PIL"
+        super().__init__(format=output_format, device=device, **kwargs)
+        self.format = output_format
         self.device = device

As per coding guidelines, avoid shadowing builtins in Python code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/inputs/media_io.py` around lines 124 - 128, The constructor
parameter name format shadows Python's built-in format()—rename it (e.g.,
image_format) in the __init__ signature, update the assertion to check
image_format in ["pt","pil"], pass image_format as the keyword to
super().__init__(...) instead of format, and set self.image_format (or chosen
name) instead of self.format; also update any other references to self.format or
the parameter elsewhere in the class/module to the new name to avoid the builtin
shadowing.

186-208: 💤 Low value

Consider renaming format parameter to avoid shadowing the builtin.

The format parameter shadows Python's built-in format() function. Consider renaming to video_format, output_format, or data_format.

♻️ Proposed fix
     def __init__(
         self,
         num_frames: int = 10,
         fps: int = 30,
-        format: str = "pt",
+        output_format: str = "pt",
         device: str = "cpu",
         extract_audio: bool = False,
         **kwargs: Any,
     ) -> None:
-        assert format in ["pt", "pil"], "format must be either Pytorch or PIL"
+        assert output_format in ["pt", "pil"], "format must be either Pytorch or PIL"
         super().__init__(
             num_frames=num_frames,
             fps=fps,
-            format=format,
+            format=output_format,
             device=device,
             extract_audio=extract_audio,
             **kwargs,
         )
         self.num_frames = num_frames
         self.fps = fps
-        self.format = format
+        self.format = output_format
         self.device = device
         self.extract_audio = extract_audio

As per coding guidelines, avoid shadowing builtins in Python code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/inputs/media_io.py` around lines 186 - 208, The constructor
currently uses the parameter name `format`, shadowing Python's built-in
`format()`; rename this parameter to something like `video_format` (or
`output_format`) in the __init__ signature, update the assert to check
`video_format in ["pt","pil"]`, pass `video_format=video_format` to the
super().__init__ call (replacing the old format=...), and update the instance
attribute assignment (self.video_format = video_format) and any internal uses of
`self.format` to `self.video_format`; also update any callers/tests that
construct this class to use the new parameter name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tensorrt_llm/inputs/media_io.py`:
- Around line 124-128: The constructor parameter name format shadows Python's
built-in format()—rename it (e.g., image_format) in the __init__ signature,
update the assertion to check image_format in ["pt","pil"], pass image_format as
the keyword to super().__init__(...) instead of format, and set
self.image_format (or chosen name) instead of self.format; also update any other
references to self.format or the parameter elsewhere in the class/module to the
new name to avoid the builtin shadowing.
- Around line 186-208: The constructor currently uses the parameter name
`format`, shadowing Python's built-in `format()`; rename this parameter to
something like `video_format` (or `output_format`) in the __init__ signature,
update the assert to check `video_format in ["pt","pil"]`, pass
`video_format=video_format` to the super().__init__ call (replacing the old
format=...), and update the instance attribute assignment (self.video_format =
video_format) and any internal uses of `self.format` to `self.video_format`;
also update any callers/tests that construct this class to use the new parameter
name.

In `@tensorrt_llm/serve/chat_utils.py`:
- Around line 94-105: The _make_media_io function currently types modality as
str but should use the narrower MediaModality literal type to match
MEDIA_IO_REGISTRY keys; update the signature of _make_media_io to accept
modality: MediaModality, import MediaModality from tensorrt_llm.inputs.media_io,
and adjust any callers if needed to satisfy the stronger type (or add explicit
casts where caller values are validated) so the registry lookup and create call
remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0670a46b-59a9-41e4-b7d5-fc1a3e8bcf9b

📥 Commits

Reviewing files that changed from the base of the PR and between e197b69 and d64377d.

📒 Files selected for processing (4)
  • tensorrt_llm/inputs/media_io.py
  • tensorrt_llm/inputs/utils.py
  • tensorrt_llm/serve/chat_utils.py
  • tests/unittest/llmapi/apps/test_chat_utils.py


from tensorrt_llm.inputs.utils import (
VideoData,
_get_aiohttp_session,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think these should moved in here? Conceptually, it makes more sense, since media_io is a more descriptive module name for media loading code than utils.

It also means we don't have to import protected functions.

import soundfile
from PIL import Image

from tensorrt_llm.inputs.utils import (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#13800 moved this to multimodal_data instead of utils.


Receives the original URL string (not a parsed path) so that
subclasses can match the modality-specific normalization done by the
legacy free functions.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's not mention these "legacy free functions".

"""
...

async def async_load(self, url: str) -> _MediaT:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

else:
raise ValueError(f"Unsupported URL scheme: {parsed.scheme!r}")

async def load_base64_async(self, media_type: str, data: str) -> _MediaT:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary at the interface level, or can callers take care of this?

super().__init__(**kwargs)

async def async_load(self, url: str) -> Tuple[np.ndarray, int]:
# Override to match `async_load_audio`, which does not accept `data:`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand this part - why do we need to overload this?

extract_audio=extract_audio,
**kwargs,
)
self.num_frames = num_frames
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question here re whether these need to be public.

"""

def __init__(self, **kwargs: Any) -> None:
self._kwargs = kwargs
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only real use for this is to log debug statements in chat_utils.py - could the logging be moved to create instead? It can include the cls.__name__ in the debug string.

media_io_cls = MEDIA_IO_REGISTRY.get(modality, BaseMediaIO)
return media_io_cls.merge_kwargs(default_kwargs, runtime_kwargs)
return media_io_cls.create(
(server_kwargs or {}).get(modality),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could just instantiate server_kwargs = ... else {} at line 100.

"tensorrt_llm.serve.chat_utils.async_load_image", return_value=MagicMock()
) as mock_loader:

from tensorrt_llm.serve import chat_utils as _chat_utils
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move to module-level?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants