[TRTLLM-12533][refactor] Move Media IO modality loading into MediaIO Interfaces by aswinvisva · Pull Request #14010 · NVIDIA/TensorRT-LLM

aswinvisva · 2026-05-11T22:29:09Z

Summary by CodeRabbit

Release Notes

Improvements
- Standardized media loading across images, audio, and video for more consistent behavior.
- Enhanced support for multiple input formats including bytes, base64-encoded data, URLs, and local file paths.
- Improved reliability and error handling in multimodal content processing.

Description

This PR created the MediaIO interface for video, image and audio. It added just the modality-specific logic for merging media IO kwargs at the request level and the server level. This PR moves some more modality-specific logic for loading IO data (URLs, data, local file) and postprocessing the IO data into the MediaIO interface.

Test Coverage

Tests added to the test_chat_utils.py which covers IO loading, updated the tests to load with the new MediaIO functions.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…interfaces Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>

coderabbitai · 2026-05-11T22:35:03Z

📝 Walkthrough

Walkthrough

This PR refactors the media I/O system from a kwargs-resolution function to a generic BaseMediaIO[_MediaT] factory-based architecture. It introduces URL-scheme dispatch (http/https, data: URIs, local files), standardized abstract methods across Image/Audio/Video implementations, and integrates the factory into chat message parsing for all modality types and base64 audio decoding.

Changes

Media I/O Refactoring

Layer / File(s)	Summary
Generic Media I/O Architecture `tensorrt_llm/inputs/media_io.py`	Adds generic typing (`TypeVar _MediaT`) and transforms `BaseMediaIO` to `BaseMediaIO(ABC, Generic[_MediaT])` with factory method `create(...)` and `merge_kwargs(...)` for shallow config merging.
Async Load Dispatch & Helpers `tensorrt_llm/inputs/media_io.py`	Implements `async_load(url)` with URL-scheme dispatch: http/https fetch, `data:` base64 URI parsing, local file/bare-path delegation. Adds abstract methods `load_bytes`, `load_base64`, `load_file` and async helper `load_base64_async(...)`.
Image Media I/O Implementation `tensorrt_llm/inputs/media_io.py`	Implements `load_bytes`, `load_base64`, `load_file` entry points and centralizes output via `_postprocess()`: returns torch tensor (with `torchvision.transforms.ToTensor` on configured device) when `format="pt"`, otherwise returns PIL image.
Audio Media I/O Implementation `tensorrt_llm/inputs/media_io.py`	Implements decode entry points using `soundfile` for bytes/base64/files. Overrides `async_load(url)` to accept only http/https and local paths, rejecting other schemes. Normalizes `file://` URIs via `_normalize_file_uri`.
Video Media I/O Implementation `tensorrt_llm/inputs/media_io.py`	Adds constructor parameters `num_frames`, `fps`, `format`, `device`, `extract_audio`. Implements `load_bytes` (writes to temp `.mp4`), `load_base64`, `load_file` entry points. Preserves `merge_kwargs` coupling: dropping `fps` or `num_frames` defaults when only one is overridden by request.
Media I/O Factory & Lookup `tensorrt_llm/serve/chat_utils.py`	Introduces `_make_media_io(modality, server_config, request_kwargs)` to select `BaseMediaIO` subclass from `MEDIA_IO_REGISTRY` and construct via `create(...)` with merged server/request configuration. Replaces removed `resolve_media_io_kwargs(...)` function.
URL-Based Content Loading `tensorrt_llm/serve/chat_utils.py`	Updates `parse_chat_message_content_part` for `image_url`, `video_url`, `audio_url` parts: replaces separate `async_load_*` calls with `_make_media_io(...)` factory and `media_io.async_load(url)`, logging merged kwargs before load.
Base64 Audio Decoding `tensorrt_llm/serve/chat_utils.py`	Updates `input_audio` branch to decode base64 via `audio_io.load_base64_async("audio/raw", audio_data)` instead of `async_load_audio(..., is_base64=True)`. Removes stale imports for `async_load_*` functions.
Documentation Reference `tensorrt_llm/inputs/utils.py`	Updates comment in `MultimodalDataTracker.__init__` to reference `merge_media_io_kwargs` (replacing `resolve_media_io_kwargs`) for where per-request `media_io_kwargs` overrides merge with server defaults.
Test Coverage & Validation `tests/unittest/llmapi/apps/test_chat_utils.py`	Updates imports to use `_make_media_io`. Refactors exception-propagation tests to patch `*MediaIO.async_load` methods. Adds configuration tests validating shallow-merge behavior and video fps/num_frames coupling. Rewrites loader-forwarding tests to spy on `_make_media_io` and assert merged config baked into constructed `MediaIO` instances.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main refactoring: moving media IO modality loading logic into the MediaIO interfaces.
Description check	✅ Passed	The description adequately explains the context and changes, referencing the prior PR, describing the refactoring approach, and listing test coverage updates.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

tensorrt_llm/serve/chat_utils.py (1)

94-105: 💤 Low value

Consider using MediaModality Literal type for the modality parameter.

The modality parameter accepts any string, but MEDIA_IO_REGISTRY only contains specific keys ("image", "video", "audio"). Using the MediaModality Literal type (already defined in tensorrt_llm.inputs.media_io) would provide compile-time type safety.

🔧 Proposed fix

+from tensorrt_llm.inputs.media_io import MEDIA_IO_REGISTRY, BaseMediaIO, MediaModality

 def _make_media_io(
-    modality: str,
+    modality: MediaModality,
     server_config: Optional[MultimodalServerConfig],
     request_kwargs: Optional[Dict[str, Dict[str, Any]]],
 ) -> BaseMediaIO:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/serve/chat_utils.py` around lines 94 - 105, The _make_media_io
function currently types modality as str but should use the narrower
MediaModality literal type to match MEDIA_IO_REGISTRY keys; update the signature
of _make_media_io to accept modality: MediaModality, import MediaModality from
tensorrt_llm.inputs.media_io, and adjust any callers if needed to satisfy the
stronger type (or add explicit casts where caller values are validated) so the
registry lookup and create call remain unchanged.

tensorrt_llm/inputs/media_io.py (2)

124-128: 💤 Low value

Consider renaming format parameter to avoid shadowing the builtin.

The format parameter shadows Python's built-in format() function. Consider renaming to image_format, output_format, or data_format for clarity and to avoid the shadowing.

♻️ Proposed fix

-    def __init__(self, format: str = "pt", device: str = "cpu", **kwargs: Any) -> None:
-        assert format in ["pt", "pil"], "format must be either Pytorch or PIL"
-        super().__init__(format=format, device=device, **kwargs)
-        self.format = format
+    def __init__(self, output_format: str = "pt", device: str = "cpu", **kwargs: Any) -> None:
+        assert output_format in ["pt", "pil"], "format must be either Pytorch or PIL"
+        super().__init__(format=output_format, device=device, **kwargs)
+        self.format = output_format
         self.device = device

As per coding guidelines, avoid shadowing builtins in Python code.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/inputs/media_io.py` around lines 124 - 128, The constructor
parameter name format shadows Python's built-in format()—rename it (e.g.,
image_format) in the __init__ signature, update the assertion to check
image_format in ["pt","pil"], pass image_format as the keyword to
super().__init__(...) instead of format, and set self.image_format (or chosen
name) instead of self.format; also update any other references to self.format or
the parameter elsewhere in the class/module to the new name to avoid the builtin
shadowing.

186-208: 💤 Low value

Consider renaming format parameter to avoid shadowing the builtin.

The format parameter shadows Python's built-in format() function. Consider renaming to video_format, output_format, or data_format.

♻️ Proposed fix

     def __init__(
         self,
         num_frames: int = 10,
         fps: int = 30,
-        format: str = "pt",
+        output_format: str = "pt",
         device: str = "cpu",
         extract_audio: bool = False,
         **kwargs: Any,
     ) -> None:
-        assert format in ["pt", "pil"], "format must be either Pytorch or PIL"
+        assert output_format in ["pt", "pil"], "format must be either Pytorch or PIL"
         super().__init__(
             num_frames=num_frames,
             fps=fps,
-            format=format,
+            format=output_format,
             device=device,
             extract_audio=extract_audio,
             **kwargs,
         )
         self.num_frames = num_frames
         self.fps = fps
-        self.format = format
+        self.format = output_format
         self.device = device
         self.extract_audio = extract_audio

As per coding guidelines, avoid shadowing builtins in Python code.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/inputs/media_io.py` around lines 186 - 208, The constructor
currently uses the parameter name `format`, shadowing Python's built-in
`format()`; rename this parameter to something like `video_format` (or
`output_format`) in the __init__ signature, update the assert to check
`video_format in ["pt","pil"]`, pass `video_format=video_format` to the
super().__init__ call (replacing the old format=...), and update the instance
attribute assignment (self.video_format = video_format) and any internal uses of
`self.format` to `self.video_format`; also update any callers/tests that
construct this class to use the new parameter name.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tensorrt_llm/inputs/media_io.py`:
- Around line 124-128: The constructor parameter name format shadows Python's
built-in format()—rename it (e.g., image_format) in the __init__ signature,
update the assertion to check image_format in ["pt","pil"], pass image_format as
the keyword to super().__init__(...) instead of format, and set
self.image_format (or chosen name) instead of self.format; also update any other
references to self.format or the parameter elsewhere in the class/module to the
new name to avoid the builtin shadowing.
- Around line 186-208: The constructor currently uses the parameter name
`format`, shadowing Python's built-in `format()`; rename this parameter to
something like `video_format` (or `output_format`) in the __init__ signature,
update the assert to check `video_format in ["pt","pil"]`, pass
`video_format=video_format` to the super().__init__ call (replacing the old
format=...), and update the instance attribute assignment (self.video_format =
video_format) and any internal uses of `self.format` to `self.video_format`;
also update any callers/tests that construct this class to use the new parameter
name.

In `@tensorrt_llm/serve/chat_utils.py`:
- Around line 94-105: The _make_media_io function currently types modality as
str but should use the narrower MediaModality literal type to match
MEDIA_IO_REGISTRY keys; update the signature of _make_media_io to accept
modality: MediaModality, import MediaModality from tensorrt_llm.inputs.media_io,
and adjust any callers if needed to satisfy the stronger type (or add explicit
casts where caller values are validated) so the registry lookup and create call
remain unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0670a46b-59a9-41e4-b7d5-fc1a3e8bcf9b

📥 Commits

Reviewing files that changed from the base of the PR and between e197b69 and d64377d.

📒 Files selected for processing (4)

tensorrt_llm/inputs/media_io.py
tensorrt_llm/inputs/utils.py
tensorrt_llm/serve/chat_utils.py
tests/unittest/llmapi/apps/test_chat_utils.py

2ez4bz · 2026-05-12T03:27:10Z

+
+from tensorrt_llm.inputs.utils import (
+    VideoData,
+    _get_aiohttp_session,


Do you think these should moved in here? Conceptually, it makes more sense, since media_io is a more descriptive module name for media loading code than utils.

It also means we don't have to import protected functions.

2ez4bz · 2026-05-12T03:27:25Z

+import soundfile
+from PIL import Image
+
+from tensorrt_llm.inputs.utils import (


#13800 moved this to multimodal_data instead of utils.

2ez4bz · 2026-05-12T03:28:18Z

+
+        Receives the original URL string (not a parsed path) so that
+        subclasses can match the modality-specific normalization done by the
+        legacy free functions.


Nit: let's not mention these "legacy free functions".

2ez4bz · 2026-05-12T03:28:42Z

+        """
+        ...
+
+    async def async_load(self, url: str) -> _MediaT:


2ez4bz · 2026-05-12T03:30:16Z

+        else:
+            raise ValueError(f"Unsupported URL scheme: {parsed.scheme!r}")
+
+    async def load_base64_async(self, media_type: str, data: str) -> _MediaT:


Is this necessary at the interface level, or can callers take care of this?

2ez4bz · 2026-05-12T03:37:38Z

+        super().__init__(**kwargs)
+
+    async def async_load(self, url: str) -> Tuple[np.ndarray, int]:
+        # Override to match `async_load_audio`, which does not accept `data:`


Sorry, I don't understand this part - why do we need to overload this?

2ez4bz · 2026-05-12T03:37:56Z

+            extract_audio=extract_audio,
+            **kwargs,
+        )
+        self.num_frames = num_frames


Similar question here re whether these need to be public.

2ez4bz · 2026-05-12T03:40:55Z

    """

+    def __init__(self, **kwargs: Any) -> None:
+        self._kwargs = kwargs


It seems the only real use for this is to log debug statements in chat_utils.py - could the logging be moved to create instead? It can include the cls.__name__ in the debug string.

2ez4bz · 2026-05-12T06:20:09Z

    media_io_cls = MEDIA_IO_REGISTRY.get(modality, BaseMediaIO)
-    return media_io_cls.merge_kwargs(default_kwargs, runtime_kwargs)
+    return media_io_cls.create(
+        (server_kwargs or {}).get(modality),


Nit: could just instantiate server_kwargs = ... else {} at line 100.

2ez4bz · 2026-05-12T06:20:51Z

-            "tensorrt_llm.serve.chat_utils.async_load_image", return_value=MagicMock()
-        ) as mock_loader:
+
+        from tensorrt_llm.serve import chat_utils as _chat_utils


Nit: move to module-level?

[TRTLLM-12533][refactor] Move Media IO modality loading into MediaIO …

d64377d

…interfaces Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>

aswinvisva requested review from a team as code owners May 11, 2026 22:29

aswinvisva requested review from 2ez4bz, Wanli-Jiang and hchings May 11, 2026 22:29

github-actions Bot assigned aswinvisva May 11, 2026

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

2ez4bz reviewed May 12, 2026

View reviewed changes

Conversation

aswinvisva commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 11, 2026

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aswinvisva commented May 11, 2026 •

edited by coderabbitai Bot

Loading