Skip to content

Conversation

@SignalRT
Copy link
Collaborator

@SignalRT SignalRT commented Jan 19, 2026

Summary:
This PR delivers a full multimodal chat pipeline in LLama.Web: PDF and Word document ingestion with text extraction, image and audio uploads, native in‑browser audio recording (preview/attach/discard), plus streaming response
rendering with Markdown support.

Key Features:

  • Streaming chat responses rendered incrementally.
  • Markdown rendering in the UI (including code blocks, lists, etc.).
  • Multimodal inference pipeline with MTMD support wired into session execution.
  • PDF ingestion with text extraction and truncation safeguards.
  • Word (DOCX) ingestion with text extraction from document XML.
  • Image uploads supported end‑to‑end (validation, storage, rendering in chat).
  • Audio uploads supported end‑to‑end (validation, storage, playback in chat).
  • In‑browser audio recording (MediaRecorder) with preview + attach/discard workflow.
  • Capability‑aware UI (shows whether text/vision/audio are supported per model).
  • Download models automatically and shows the progress

Implementation Highlights

  • Attachment service handles file validation, storage, and extraction (PDF/DOCX).
  • Model session builds prompts with attached media and enforces capability checks.
  • Chat UI renders images/audio and guides users on supported inputs.
  • Captures audio and converts it to a browser file for existing upload flow.
  • Streaming tokens update the UI while Markdown is rendered on the fly.

Capability to upload images and ask about the images

image

Model auto-download + Capability to upload files and ask about the files
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant