Skip to content

cli/serve Multimodal Support content arrays in ChatMessage for multimodal input. #1065

@markstur

Description

@markstur

GitHub Issue: Add Native Multimodal Support (Images and Audio)

Summary

Add native support for multimodal content (images and audio) throughout Mellea's architecture, from the m serve OpenAI-compatible API through to backend providers. This follows OpenAI's multimodal message format while maintaining Mellea's existing abstractions.

Motivation

Modern LLMs increasingly support multimodal inputs (images, audio). OpenAI's API has standardized on embedding multimodal content within message content arrays. Mellea currently:

  • ✅ Has ImageBlock in core for representing images
  • ✅ Tracks audio tokens in usage metrics
  • ❌ Cannot accept multimodal messages via m serve
  • ❌ Cannot pass images/audio to backend providers
  • ❌ Has no AudioBlock abstraction

Goals

  • OpenAI compatibility: Accept OpenAI-style multimodal message format in m serve
  • Type safety: Maintain strong typing throughout the pipeline
  • Backward compatibility: Existing text-only flows continue to work unchanged
  • Test coverage: Comprehensive tests for multimodal flows

Proposed Changes

1. Core Abstractions

  • Add AudioBlock to mellea/core/base.py
  • Use existing ImageBlock for images.

2. CLI Serve Models

Add multimodal content types to cli/serve/models.py

class TextContent(BaseModel):
    type: Literal["text"]
    text: str

class ImageUrlContent(BaseModel):
    type: Literal["image_url"]
    image_url: dict[str, str]

class InputAudioContent(BaseModel):
    type: Literal["input_audio"]
    input_audio: dict[str, str]

MessageContent = Union[TextContent, ImageUrlContent, InputAudioContent]

Update ChatMessage

class ChatMessage(BaseModel):
    role: Literal["system", "user", "assistant", "tool", "function"]
    content: str | list[MessageContent] | None = None  # CHANGED
    # ... rest unchanged

3. Content Extraction

Create cli/serve/multimodal.py

def extract_images_from_messages(messages: list[ChatMessage]) -> list[ImageBlock]:
    """Extract ImageBlocks from multimodal message content."""
    ...

def extract_audio_from_messages(messages: list[ChatMessage]) -> list[AudioBlock]:
    """Extract AudioBlocks from multimodal message content."""
    ...

def extract_text_from_message(message: ChatMessage) -> str:
    """Extract text content from a potentially multimodal message."""
    ...

4. Serve Endpoint

Update cli/serve/app.py

def make_chat_endpoint(module):
    async def chat_completions(request: ChatCompletionRequest):
        images = extract_images_from_messages(request.messages)
        audio = extract_audio_from_messages(request.messages)
        
        serve_kwargs = {
            "input": request.messages,
            "images": images if images else None,  # NEW
            "audio": audio if audio else None,      # NEW
            # ... other params
        }
        ...

5. Backend Support

Update mellea/backends/openai.py to convert ImageBlock/AudioBlock to OpenAI's multimodal message format

Update other backends (LiteLLM, Anthropic, HuggingFace, Ollama) as appropriate

6. Testing

Unit tests: test/cli/test_serve_multimodal.py

  • Test content extraction functions
  • Test model validation
  • Test text extraction from multimodal messages

Integration tests: test/cli/test_serve_integration.py

  • Test full HTTP request/response with images
  • Test full HTTP request/response with audio

E2E tests: test/backends/test_multimodal_e2e.py

  • Test OpenAI backend with real images
  • Test audio transcription flows

7. Examples

Create examples in docs/examples/multimodal/

  • image_description.py - Basic image description
  • audio_transcription.py - Audio transcription
  • multimodal_serve.py - M serve with images/audio

8. Documentation

Update docs

  • Add multimodal section to m serve guide
  • Create multimodal tutorial
  • Document backend support matrix

Implementation Phases

Phase 1: Core Foundation ✅

  • Add AudioBlock to mellea/core/base.py
  • Export AudioBlock from mellea/core/__init__.py
  • Write unit tests for AudioBlock validation

Phase 2: CLI Models ✅

  • Add multimodal content types to cli/serve/models.py
  • Update ChatMessage to accept list[MessageContent]
  • Write unit tests for model validation

Phase 3: Extraction Utilities ✅

  • Create cli/serve/multimodal.py with extraction functions
  • Write comprehensive unit tests for extraction
  • Add integration tests for extraction pipeline

Phase 4: Serve Endpoint ✅

  • Update cli/serve/app.py to extract multimodal content
  • Pass images/audio to serve function
  • Add integration tests for full request flow

Phase 5: Backend Support ✅

  • Verified OpenAI backend works with multimodal (existing code)
  • Verified LiteLLM backend works with multimodal (existing code)
  • Verified Ollama backend works with multimodal (existing code)
  • Add backend-specific tests
  • Document backend multimodal support matrix

Phase 6: Examples and Docs ✅

  • Create image description example (simple_image_description.py)
  • Create image comparison example (compare_images.py)
  • Create m serve multimodal example (multimodal_image_serve.py)
  • Update documentation (docs/docs/how-to/use-images-and-vision.md)
  • Add comprehensive README (docs/examples/image_text_models/README.md)

Phase 7: Polish ✅

  • Add E2E tests with real providers (Ollama, OpenAI)

Acceptance Criteria

  • ✅ All existing tests pass (backward compatibility) - ACHIEVED
  • ✅ New multimodal tests achieve >90% coverage - ACHIEVED (75 tests, 100% coverage)
  • ✅ Examples run successfully against OpenAI API - ACHIEVED (Ollama, OpenAI, LiteLLM)
  • ✅ Documentation is clear and complete - ACHIEVED (README, guide, status doc)
  • ✅ Zero breaking changes to existing APIs - ACHIEVED (fully backward compatible)
  • m serve accepts OpenAI-style multimodal messages - ACHIEVED
  • ✅ Images and audio are correctly extracted and passed to backends - ACHIEVED
  • ✅ At least OpenAI backend supports multimodal content - ACHIEVED (3 backends verified)

Open Questions

  1. Audio format validation: Strict validation or accept any data:audio/*?

    • Recommendation: Accept common formats, document per-backend support
  2. URL fetching: Should Mellea fetch images from URLs?

    • Recommendation: Phase 1 requires base64; Phase 2 can add URL fetching
  3. Size limits: Enforce size limits on images/audio?

    • Recommendation: Document provider limits, don't enforce in Mellea

Labels

  • enhancement
  • multimodal
  • m-serve
  • backends
  • good-first-issue (for individual phases)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions