Skip to content

Implemented generic multimodal chat handler.#125

Open
alcoftTAO wants to merge 11 commits into
JamePeng:mainfrom
TAO71-AI:mtmd
Open

Implemented generic multimodal chat handler.#125
alcoftTAO wants to merge 11 commits into
JamePeng:mainfrom
TAO71-AI:mtmd

Conversation

@alcoftTAO
Copy link
Copy Markdown

@alcoftTAO alcoftTAO commented May 4, 2026

  • Implemented a generic/global multimodal chat handler.

What does it do?

It automatically uses the model's chat template and replaces all of the model's multimodal tags with the media_marker tag.

This allows a much easier implementation for multimodal models, since the chat template doesn't need to be hard-coded for each model.

How to use it?

It is as simple as passing the clip_model_path parameter to the Llama class when created.

Note

Using the previous implementation (e.g. Qwen35ChatHandler) still works.

I'm also looking forward to implement more model architectures. Please, reply if you want me to implement any.

@JamePeng
Copy link
Copy Markdown
Owner

JamePeng commented May 5, 2026

You can take a look at how to improve the injection process. #110

@JamePeng
Copy link
Copy Markdown
Owner

JamePeng commented May 13, 2026

It seems there's no work on how to perform URL injection for multimedia; simply replacing it with a media marker isn't enough.

This code also needs to be removed:

if hasattr(llama, 'input_ids'):
    llama.input_ids.fill(0)

Architecture-based tag guessing should not default unknown models to Qwen-style tags. Prefer detecting media tags from the actual chat template, or better, avoid tag guessing by normalizing OpenAI content parts into placeholders before rendering.

KNOWN_MEDIA_TAGS = [
    "<|image_pad|>",
    "<|audio_pad|>",
    "<|video_pad|>",
    "<|image|>",
    "<|audio|>",
    "<|video|>",
    "[IMG]",
]

and

self._chat_format_parser_tags = [
    tag for tag in KNOWN_MEDIA_TAGS
    if tag in self.chat_format
]

In addition, a check is needed to ensure that the number of replacement markers matches the number of incoming media.

@alcoftTAO alcoftTAO marked this pull request as draft May 14, 2026 15:19
@alcoftTAO
Copy link
Copy Markdown
Author

@JamePeng What do you think of this code?

@JamePeng JamePeng force-pushed the main branch 8 times, most recently from e1caafb to 628373c Compare May 16, 2026 12:09
@JamePeng
Copy link
Copy Markdown
Owner

You can test the multimodal usage of qwen3vl, qwen3.5/3.6, and gemma4.
In particular, check if the omni function of gemma4 is affected.

JamePeng and others added 3 commits May 19, 2026 19:25
Signed-off-by: JamePeng <jame_peng@sina.com>
- Add a PowerShell step to the Windows CI workflow to locate and copy
  `libomp140.x86_64.dll` from the Visual Studio redistributables.
- Place the runtime DLL into the `llama_cpp\lib` package directory.

This ensures that the dynamically loaded `ggml-cpu-*.dll` variants
(which are built with LLVM OpenMP on Windows) have their required
dependencies packaged in the wheel. Without this,
`ggml_backend_load_all_from_path()` can silently fail to load the CPU
backends at runtime on end-user machines.

Signed-off-by: JamePeng <jame_peng@sina.com>
@JamePeng JamePeng force-pushed the main branch 2 times, most recently from 78ead7c to 615e45a Compare May 23, 2026 05:58
@JamePeng
Copy link
Copy Markdown
Owner

@alcoftTAO Hello, could you resolve some file conflicts? It seems like adding unrelated files...

@alcoftTAO
Copy link
Copy Markdown
Author

@alcoftTAO Hello, could you resolve some file conflicts? It seems like adding unrelated files...

I'm working on this.

@alcoftTAO
Copy link
Copy Markdown
Author

I think it's fixed. Please let me know if anything is wrong.

@alcoftTAO
Copy link
Copy Markdown
Author

You can test the multimodal usage of qwen3vl, qwen3.5/3.6, and gemma4. In particular, check if the omni function of gemma4 is affected.

Both image and audio capabilities work as expected.

@alcoftTAO alcoftTAO marked this pull request as ready for review May 27, 2026 23:55
@JamePeng
Copy link
Copy Markdown
Owner

JamePeng commented May 28, 2026

It's suggested that clip_model_path: Optional[str] = None be uniformly changed to mmproj_path: Optional[str] = None, because it's no longer just mtmd that's used for simple clipping.

It's also worth considering adding the chat_template_override: Optional[str] = None feature. I've tested it with embedding or rerank models that don't have a chat template, allowing users to easily write their own based on their token ID and pass in a temporary chat template to achieve this functionality.

Alternatively, it's unnecessary to pass in the entire metadata. You can pass in the pre-processed template_choices at the end of llama.init to speed up template retrieval, or use the chat template wrapper I added to LlamaModel last night, with name=None to retrieve the default template. Both methods can speed up and reduce unnecessary memory transfers and throughput, thus improving performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants