UtsukushiiAI - Generative Manga-to-Shorts (MMV) Animation Platform

Overview

UtsukushiiAI (Japanese: 美しい - "Beautiful") is an open-source, local-first generative media platform designed to automate the creation of high-energy, beat-synced "Manga Music Videos" (MMVs).

Designed for transparency and privacy, UtsukushiiAI runs entirely on your local hardware. It utilizes Computer Vision (CV) to segment static manga pages and Generative AI to animate them—without ever requiring your sensitive creative assets to leave your machine.

Core Objectives

Local-First Processing: Keep your storage and compute on your own machine.
Environment-Driven Configuration: Setup your own MongoDB URIs and third-party API keys (YouTube, OpenAI, etc.) strictly via .env files.
Rhythm Intelligence: Precisely sync visual transitions to audio beats automatically.
Privacy by Design: User data stays on their machine unless explicitly sent to third-party APIs.
Efficiency: Reduce manual video editing time from 4 hours to < 2 minutes using local ML pipelines.

Features

The Forge (Upload)

Multi-modal drag-and-drop for Manga (PDF, PNG, JPG) and Music (MP3, WAV)
Automatic manga page extraction from PDF files
YouTube audio download via yt-dlp for "Vibe-Matching" music

The Canvas Studio

Interactive Bounding Box (BBox) adjustment for AI-detected panels
Real-time preview of panel segmentation
Layer management for foreground/background elements

Rhythm Timeline

Visual waveform display with automated beat-marker detection
Millisecond-accurate sync point placement
Drag-and-drop transition effects

Render Engine

YOLOv12 panel detection with Manga109 fine-tuning
MiDaS depth estimation for 3D "Wiggle" parallax effects
Stable Video Diffusion (SVD) for character animation
FFmpeg/MoviePy video composition

Export Options

Multiple aspect ratios: 9:16 (Stories/Reels), 16:9 (YouTube), 1:1 (Instagram)
Quality presets: Draft, Standard, High, Ultra
Direct upload to social platforms

Tech Stack

Frontend

Technology	Purpose
Next.js 15	App Router, React Server Components
TypeScript	Type safety across full stack
Zustand	Global state management
Remotion	Frame-accurate video preview
Framer Motion	Cinematic UI animations
Tailwind CSS	Utility-first styling

Backend (Orchestration)

Technology	Purpose
Express.js	REST API server
Socket.io	Real-time progress updates
JWT	Authentication
File System	Local storage management

Backend (ML Inference)

Technology	Purpose
FastAPI	High-performance ML API
Python 3.11+	ML runtime
PyTorch	Deep learning framework
YOLOv12	Panel detection
SAM 2	Instance segmentation
MiDaS	Depth estimation
Stable Video Diffusion	Character animation
Librosa	Audio analysis
FFmpeg	Video processing

Data & Infrastructure

Technology	Purpose
MongoDB	Project metadata, panel JSON
Redis	Render queue, session cache
Local Storage	Object storage (disk)
Docker	Containerization

Quick Start

Prerequisites

Node.js 20.x+
Python 3.11+
Docker & Docker Compose
MongoDB 6.x
Redis 7.x
Local Storage (Disk Space)

Local Development Setup

Clone the repository

git clone https://github.com/utsukushii/utsukushii-ai.git
cd utsukushii-ai

Environment Configuration

# Copy environment files
cp apps/web/.env.example apps/web/.env.local
cp apps/api/.env.example apps/api/.env
cp apps/worker/.env.example apps/worker/.env

# Edit with your credentials
# ALL infrastructure must be configured here before starting!
# Required: MONGODB_URI, REDIS_URL, YOUTUBE_API_KEY, OPENAI_API_KEY, etc.

Start with Docker Compose

# Start all services
docker-compose up -d

# Or start individual services
docker-compose up -d mongodb redis

Install and Run Locally

# Frontend
cd apps/web
npm install
npm run dev

# Backend API
cd apps/api
npm install
npm run dev

# ML Worker
cd apps/worker
pip install -r requirements.txt
uvicorn main:app --reload

Access the Application

Architecture

UtsukushiiAI follows a Polyglot Microservices architecture:

┌─────────────────────────────────────────────────────────────┐
│                        Frontend (Next.js)                    │
│                  http://localhost:3000                       │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                   API Gateway (Express.js)                  │
│                  http://localhost:4000                       │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│  │  Auth API   │  │ Project API │  │   Render API        │ │
│  └─────────────┘  └─────────────┘  └─────────────────────┘ │
└─────────────────────────┬───────────────────────────────────┘
                          │
         ┌────────────────┼────────────────┐
         ▼                ▼                ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────────┐
│  MongoDB    │  │    Redis    │  │  Local Storage  │
│  (State)    │  │   (Queue)   │  │  (Storage)      │
└─────────────┘  └─────────────┘  └─────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│               ML Worker (FastAPI + Python)                 │
│                  http://localhost:8000                      │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐ │
│  │Detector  │  │Segmenter │  │DepthEst  │  │AudioAnalyzer│ │
│  └──────────┘  └──────────┘  └──────────┘  └────────────┘ │
│  ┌──────────────────────────────────────────────────────┐ │
│  │              Video Composer (FFmpeg)                 │ │
│  └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Design Patterns

Event-Driven: Real-time progress via WebSockets
CQRS: Separate read/write models for project data
Repository Pattern: Data abstraction layer
Factory Pattern: Component creation for different video formats
Observer Pattern: Job status updates

Project Structure

utsukushii-ai/
├── apps/
│   ├── web/                    # Next.js 15 Frontend
│   │   ├── src/
│   │   │   ├── app/           # App Router pages
│   │   │   ├── components/    # React components
│   │   │   ├── hooks/         # Custom hooks
│   │   │   ├── lib/           # Utilities
│   │   │   ├── stores/        # Zustand stores
│   │   │   └── types/         # TypeScript types
│   │   └── public/            # Static assets
│   │
│   ├── api/                   # Express.js Backend
│   │   ├── src/
│   │   │   ├── controllers/  # Route handlers
│   │   │   ├── middleware/   # Express middleware
│   │   │   ├── models/        # Business logic
│   │   │   ├── routes/        # API routes
│   │   │   ├── services/      # Core services
│   │   │   └── utils/         # Utilities
│   │   └── tests/             # API tests
│   │
│   └── worker/               # FastAPI ML Worker
│       ├── src/
│       │   ├── models/        # ML models
│       │   ├── pipelines/    # Processing pipelines
│       │   ├── services/      # ML services
│       │   └── utils/         # Utilities
│       ├── models/            # Downloaded model weights
│       └── tests/             # ML tests
│
├── packages/
│   ├── shared/               # Shared types & utilities
│   │   ├── types/           # TypeScript types
│   │   └── utils/           # Shared utilities
│   │
│   ├── database/            # MongoDB connection & schemas
│   │
│   ├── cache/               # Redis client & utilities
│   │
│   └── storage/             # Local storage utilities
│
├── tools/
│   ├── scripts/             # Build & deployment scripts
│   └── configs/            # Configuration files
│
├── docs/                    # Documentation
│   ├── architecture/
│   ├── api/
│   └── assets/
│
├── docker-compose.yml       # Local development
├── docker-compose.prod.yml  # Production deployment
├── turbo.json              # Turborepo config
├── package.json            # Root package.json
└── README.md               # This file

API Documentation

Authentication

Endpoint	Method	Description
`/api/auth/register`	POST	Register new user
`/api/auth/login`	POST	Login user
`/api/auth/logout`	POST	Logout user
`/api/auth/refresh`	POST	Refresh JWT token
`/api/auth/me`	GET	Get current user

Projects

Endpoint	Method	Description
`/api/projects`	GET	List user projects
`/api/projects`	POST	Create new project
`/api/projects/:id`	GET	Get project details
`/api/projects/:id`	PUT	Update project
`/api/projects/:id`	DELETE	Delete project
`/api/projects/:id/panels`	GET	Get project panels
`/api/projects/:id/panels`	POST	Add panel to project
`/api/projects/:id/panels/:panelId`	PUT	Update panel
`/api/projects/:id/panels/:panelId`	DELETE	Delete panel

| Method | Description Rendering

| Endpoint | | ----------------------------- | ------ | ----------------------- | | /api/render/start | POST | Start render job | | /api/render/:jobId | GET | Get render status | | /api/render/:jobId | DELETE | Cancel render job | | /api/render/:jobId/download | GET | Download rendered video |

Presigned URLs

Endpoint	Method	Description
`/api/upload/direct`	POST	Get direct upload URL
`/api/upload/complete`	POST	Confirm upload complete

WebSocket Events

Event	Direction	Description
`render:progress`	Server→Client	Render progress update
`render:complete`	Server→Client	Render completed
`render:error`	Server→Client	Render error
`panel:detected`	Server→Client	Panel detection complete

ML Pipeline

Stage 1: Panel Detection (YOLOv12)

Input: Manga image (PNG/JPG/PDF page)
Output: Bounding box coordinates for each panel
Model: Fine-tuned on Manga109 dataset
Coordinates: Normalized (0.0 - 1.0) for scale invariance

Stage 2: Instance Segmentation (SAM 2)

Input: Panel bounding boxes + original image
Output: Binary masks for characters/foreground
Model: SAM 2 (Segment Anything 2)

Stage 3: Depth Estimation (MiDaS)

Input: Original image
Output: Depth map for parallax effect
Effect: "Wiggle" 3D animation based on depth

Stage 4: Audio Analysis (Librosa)

Input: Audio file (MP3/WAV)
Output: BPM, onset timestamps, beat markers
Algorithm: Harmonic-Percussive Source Separation (HPSS)

Stage 5: Character Animation (SVD)

Input: Segmented character images
Output: Animated frames with subtle movement
Model: Stable Video Diffusion

Stage 6: Video Composition (FFmpeg)

Input: All layers + audio + beat markers
Output: Final MP4 video (9:16, 30fps)
Effects: Glow, glitch, transitions synced to beats

Deployment

Docker Compose

docker-compose up -d

Environment Variables

See .env.example files in each app directory for required variables.

Contributing

We welcome contributions! Please read our Contributing Guide for details.

Development Workflow

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Style

Use TypeScript for all new code
Follow ESLint and Prettier configurations
Write unit tests for new features
Document public APIs with JSDoc

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Manga109 Dataset for training data
Ultralytics for YOLOv12
Meta AI for SAM 2
Stability AI for Stable Video Diffusion
The community for feedback and support

Made with ❤️ by the UtsukushiiAI Team

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
apps		apps
docker		docker
docs		docs
packages		packages
tools/scripts		tools/scripts
.gitignore		.gitignore
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json
tsconfig.json		tsconfig.json
turbo.json		turbo.json

Folders and files

Latest commit

History

Repository files navigation

UtsukushiiAI - Generative Manga-to-Shorts (MMV) Animation Platform

Table of Contents

Overview

Core Objectives

Features

The Forge (Upload)

The Canvas Studio

Rhythm Timeline

Render Engine

Export Options

Tech Stack

Frontend

Backend (Orchestration)

Backend (ML Inference)

Data & Infrastructure

Quick Start

Prerequisites

Local Development Setup

Architecture

Design Patterns

Project Structure

API Documentation

Authentication

Projects

| Method | Description Rendering

Presigned URLs

WebSocket Events

ML Pipeline

Stage 1: Panel Detection (YOLOv12)

Stage 2: Instance Segmentation (SAM 2)

Stage 3: Depth Estimation (MiDaS)

Stage 4: Audio Analysis (Librosa)

Stage 5: Character Animation (SVD)

Stage 6: Video Composition (FFmpeg)

Deployment

Docker Compose

Environment Variables

Contributing

Development Workflow

Code Style

License

Acknowledgments

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages