ComfyUI-FFMPEGA

The ultimate video editing suite for ComfyUI — edit with natural language or hands-on manual controls.

Use AI to describe edits in plain English, or take full manual control with the Effects Builder and text presets — no LLM required.

Features • Examples • Installation • Quick Start • Prompt Guide • Skills • LLM Setup • Troubleshooting • Contributing • Changelog

🚀 What's New in v2.9.1 (March 3, 2026)

AI Audio Generation, AI Lip Sync, FLUX Klein Editing, Modular Architecture, CI/CD & Type Safety

✨ AI Object Removal & Editing (FLUX Klein 4B): New auto_mask:effect=remove and auto_mask:effect=edit powered by FLUX Klein 4B (Apache 2.0). AI-powered per-frame object removal replaces LaMa with higher-quality results. New edit effect enables text-guided video changes (e.g. "change hair to red", "replace background with beach"). Reference-image conditioning, 4-step inference, temporal smoothing, ~8–13 GB VRAM.
🔊 AI Audio Generation (generate_audio): New skill powered by MMAudio — synthesizes synchronized audio/foley from video and/or text descriptions. Supports video-to-audio, text-to-audio, and automatic long video chunking with crossfade. 11 natural language aliases (foley, sound_effects, v2a, etc.).
👄 AI Lip Sync (lip_sync): New skill powered by MuseTalk V15 — synchronizes lip movements to match provided audio. Video+audio and image+audio inputs, multi-face support, batch inference. Zero new pip dependencies. 7 aliases (lipsync, dub, dubbing, sync_lips, talking_head, lip_dub, voice_sync). Subprocess isolation for CUDA memory safety.
🏗️ Modular Node Architecture: Broke up monolithic agent_node.py into 6 focused modules — input_resolver, execution_engine, output_handler, batch_processor, nollm_modes, and pipeline_assembler. Agent node is now a thin orchestrator.
🔒 Pyright Type Checking: Added pyrightconfig.json with type stubs for ComfyUI modules. Fixed 78 type errors across core/, skills/, mcp/.
🧱 Platform Abstraction (core/platform.py): All ComfyUI boundary interactions extracted into a single adapter module. Core logic no longer imports ComfyUI directly.
📋 Structured Logging (core/logging.py): JSONFormatter for machine-readable logs, get_logger() convenience function, and LogTimer context manager. Enable via FFMPEGA_LOG_JSON=1.
🧪 CI Workflow & 799 Tests: New .github/workflows/ci.yml with Pytest + Pyright on every PR/push. 12 new integration tests running real FFmpeg pipelines. Test suite expanded from 656 → 799 tests, 0 failures.
📖 CONTRIBUTING.md & Pre-commit: New contributor docs with architecture overview and PR guidelines. Pre-commit hooks with Ruff lint/format and Pyright.
🧠 Subprocess Isolation: Audio generation runs in a subprocess to prevent CUDA memory leaks — same proven pattern as SAM3. Falls back to in-process with VRAM offloading.
⚡ Native Safetensors & Memory-Efficient Loading: Uses comfy.utils.load_torch_file for direct .safetensors loading, plus accelerate's zero-copy model init — no 3× memory spike during model loading.
🪞 Model Mirror Repositories: MMAudio on AEmotionStudio/mmaudio-models (fp16, ~5.5 GB). MuseTalk UNet on AEmotionStudio/musetalk-models (fp16 1.6 GB / fp32 3.2 GB). Mirror-first download with upstream HuggingFace fallback.
🔗 Mask Points Chaining: SaveVideoNode and LoadVideoPathNode now pass mask_points data through the node chain — upstream segmentation points propagate to downstream processing.
♿ Point Selector Accessibility: ARIA roles, aria-modal, aria-live regions, and hidden decorative emojis for screen reader support.
🔧 Bug Fixes: LaMa cache false-negative, log propagation leak, token usage display, platform import robustness.

Previous: v2.8.0 — Effects Builder, Manual Mode, SAM3 Hardening & 15+ Bug Fixes

🏗️ Effects Builder Node: New companion node for manual effect composition — select up to 3 skills with params, combine with raw FFmpeg filters, and use presets. No LLM required.
🎬 Effects Builder Presets: Right-click the Effects Builder for quick access to all 18 built-in presets, plus save/load/delete custom presets and a "Clear All Effects" reset.
📝 Text Node Presets: Right-click the FFMPEGA Text node for 10 built-in presets (SRT Subtitle Example, Cinematic Subtitles, Watermark, Title Card, Social Caption, Meme Text, Lower Third, Copyright Notice, Credits Roll, Chapter Marker) with example text. Custom save/load/delete and "Clear Text" reset.
💬 No-LLM Text Support: Connect a Text node to the Agent in no-LLM manual mode (without an Effects Builder) and it auto-generates a text overlay or subtitle pipeline.
🏛️ Manual Mode (Default): New manual no-LLM mode — set llm_model to none and use the Effects Builder to edit videos without any AI. This is now the default no_llm_mode.
🎙️ Whisper No-LLM Modes: transcribe and karaoke_subtitles in the no_llm_mode dropdown — run Whisper directly without an LLM. Also available as Effects Builder presets.
🧠 SAM3 Subprocess Isolation: SAM3 now runs in a separate subprocess to prevent CUDA memory leaks. Multiple OOM fixes, point prompt support, and real-time progress streaming.
🔒 MCP Security: Fixed path traversal vulnerabilities in MCP tools. SkillComposer parameter hardening.
🔗 Effects Builder Multi-Input: Fixed concat, grid, xfade, and all multi-input skills in the Effects Builder. Extra inputs (video_b, etc.) now correctly injected.
📐 Concat Resolution Fix: Concat/xfade/slideshow now default to input resolution instead of hardcoded 1920×1080.
🔧 Dynamic Slot Root Cause Fix: video_b no longer disappears on page refresh.
⚡ Performance: Ultrafast temp video encoding, optimized pipe buffers, SAM3 VRAM offloading.
🐛 Template Placeholder Fix: Unsubstituted template placeholders no longer crash ffmpeg.

Previous: v2.7.1 — Advanced Options Toggle, Dynamic Input Fix & Default Refinements

⚙️ Advanced Options Toggle: New Simple/Advanced toggle hides power-user settings (preview_mode, crf, encoding_preset, video_path, subtitle_path, batch_mode) behind a single switch. The node is now much more compact by default.
🔧 Dynamic Input Persistence: Fixed dynamic input slots (e.g. video_b auto-appearing when video_a is connected) not restoring when loading saved workflows or images.
🎛️ Default Refinements: Whisper defaults to CPU (avoids VRAM pressure), token tracking enabled by default, PTC mode and SAM3 CPU inputs removed from UI.
⚠️ SAM3 Checkpoint Warnings: Detects wrong checkpoint format and logs reconversion instructions.
🧹 Code Cleanup: Removed ~1100 lines of dead code, added ValidationError for path validation, token log rotation for usage_log.jsonl.

Previous: v2.7.0 — SAM3 Auto-Mask, LaMa Inpainting & PTC

🎭 SAM3 Auto-Mask: New remove skill powered by Segment Anything Model 3 — describe what to remove in natural language (e.g. "remove the person") and SAM3 generates per-frame masks automatically.
🖌️ LaMa Inpainting: AI-powered video object removal using LaMa (Large Mask Inpainting). Temporal Gaussian smoothing reduces frame-to-frame flicker. Falls back to black fill if LaMa is not installed.
🟢 Greenscreen: New greenscreen skill — uses SAM3 masks to replace backgrounds with solid colors or transparency (WebM output).
⚡ Programmatic Tool Calling (PTC): New execute_code tool lets LLMs write a single Python script that orchestrates multiple tool calls in one pass, reducing round-trips from ~6 to 1. Three modes: off, auto, on.
🔒 PTC Sandbox: Hardened with 25+ escape vector blocks — dunder introspection, module access, traceback traversal, dynamic attribute access, and chr() construction bypasses.
🧪 656 Tests: Expanded test suite from 516 → 656 with 0 failures. 34 new PTC executor tests, security hardening tests, and corrected mock contracts.

Previous: v2.6.5 — Whisper Auto-Transcription, Karaoke Subtitles & Letterbox Fixes

🎙️ Auto-Transcribe: New auto_transcribe skill — transcribes video audio with OpenAI Whisper and burns SRT subtitles directly into the output. Supports multi-video concat with correct cross-clip timing.
🎤 Karaoke Subtitles: New karaoke_subtitles skill — word-by-word progressive-fill karaoke effect using Whisper's word-level timestamps and ASS \kf tags.
⚙️ Whisper Controls: Choose your Whisper model size (tiny → large-v3) and device (gpu/cpu) via node settings — trade off speed vs. accuracy, or offload to CPU on low-VRAM systems.
📐 Letterbox Fix: Replaced crop+pad with drawbox for letterboxing, preserving video content. Handles both letterbox and pillarbox cases correctly.
🧹 Code Dedup: Extracted shared ffmpeg_escape_path, color_to_ass_bgr, and _run_transcription helpers — 3 deduplication refactors reducing maintenance surface.
🔧 9 Bug Fixes: Whisper memory leak, xfade subtitle timing, ASS escaping, hex color validation, aspect ratio div-by-zero, and more.

Previous: v2.6.0 — HandlerResult, Compose Decomposition & 516 Tests

🏗️ HandlerResult Contract: All 9 handler modules now return a formal HandlerResult dataclass, replacing ad-hoc tuples. Backward-compatible with existing code.
🔧 Compose Decomposition: Extracted 5 orchestration methods from the 600+ line compose() into pure, testable static methods.
🎵 PiP Audio Mixing: picture_in_picture now supports audio_mix to blend both audio tracks via ffmpeg's amix.
🔄 CLI Retry: CLI connectors retry on transient failures with exponential backoff (3 attempts).
📝 TextInput Node: New node for subtitle and text overlay workflows with auto SRT detection.
🔒 Security: Sanitized text overlay enable parameter, fixed path traversal on output dirs, fixed weak UUID entropy.
🧪 516 Tests: Expanded test suite from 481 → 516 with 0 failures. New handler unit tests, skill combination tests, and orchestration helper tests.

Previous: v2.5.0 — PiP Overlay Fixes, VL Model Vision & Border Support

🖼️ PiP Border Support: picture_in_picture now accepts border and border_color parameters.
👁️ Ollama VL Auto-Embedding: Vision-language models automatically receive 3 video frames in the initial message.
🔗 PiP Alias Fix: Models using pip, picture-in-picture, etc. now correctly resolve to picture_in_picture.
🔧 Ollama VL Verification: Fixed 400 error when verifying output with Ollama VL models.

Previous: v2.4.0 — Pipeline Chaining, Animated Overlays & Zero-Memory Image Paths

🖼️ Zero-Memory Image Paths: Image inputs passed as file paths instead of decoded tensors.
🎯 Overlay Animation: overlay_image supports animation=bounce and more motion presets.
🔗 Pipeline Chaining Fixes: Fixed filter graph chaining for multi-skill pipelines.
🏗️ Handler Module Extraction: Skill handlers split into skills/handlers/ modules.
🔒 Security Hardening: Extended FFMPEG parameter sanitization.
⚡ Performance: frames_to_tensor pre-allocates memory.

Previous: v2.3.0 — Token Tracking, LUT Color Grading & Vision

📊 Token Usage Tracking: Opt-in track_tokens and log_usage toggles — monitor prompt/completion tokens, LLM calls, tool calls, and elapsed time.
🎨 LUT Color Grading: 8 bundled cinematic .cube LUT files. Drop custom LUTs into luts/ for automatic discovery.
🖼️ Vision System: Multimodal frame analysis — the agent extracts frames and "sees" the video.
🔊 Audio Analysis: Volume (dB), EBU R128 loudness (LUFS), and silence detection.
🤖 Real Token Stats: Gemini CLI and Claude CLI return native token counts via JSON output.

📄 See CHANGELOG.md for the complete version history.

NotebookLM Overview: Exploring the features and capabilities of ComfyUI-FFMPEGA. (Click to watch on YouTube)

✨ Features

🗣️ Natural Language Editing Describe edits in plain text: "Make it cinematic with a fade in", "Speed up 2x", "VHS look with grain". The AI agent interprets your prompt and builds the FFMPEG pipeline automatically.	🏗️ Manual Mode — No AI Required Use the Effects Builder to visually compose up to 3 effects with parameters. Add text overlays and subtitles via preset-powered Text nodes. Full editing control with zero LLM dependency.
🤖 Multi-LLM Support Works with Ollama (local, free), OpenAI, Anthropic, Google Gemini, and CLI tools (Gemini CLI, Claude Code, Cursor Agent, Qwen Code). Use any local model — Llama 3.1, Qwen3, Mistral, and more. Or skip the LLM entirely.	🎨 200+ Skills 200+ video editing skills across visual effects, audio processing, spatial transforms, temporal edits, encoding, cinematic presets, vintage looks, social media, creative effects, text animations, editing & composition, audio visualization, multi-input operations, transitions, concat, split screen, and AI-powered skills (Whisper transcription, SAM3 masking, MMAudio generation, MuseTalk lip sync).
🎨 Right-Click Presets 18 built-in Effects Builder presets and 10 Text node presets with example content. Save/load/delete your own custom presets. One-click clear to reset.	⚡ Batch & Preview Process multiple videos with the same instruction. Generate quick low-res previews before committing to full renders. Quality presets from draft to lossless.

🎬 Examples

See what FFMPEGA can do — each example shows the prompt or preset used, the input clip, and the result.

4×4 Video Grid

Prompt: Concatenate these clips in a 4x4 grid

Before	After

Crossfade Transitions + Bouncing Logo

Prompt: Concatenate these clips with a crossfade transition between each, add a fade in at the start and fade out at the end and a bouncing image_path_a at 10% size and 30% opacity

Before	After

Color Grade + Text Overlay + Compression

Prompt: Color grade with the cinematic teal orange LUT, normalize audio, add a text "water" in the bottom right corner, compress for web at 720p

Before	After

Picture-in-Picture with Audio Mix

Prompt: Place video_b in the bottom-right corner at 25% size with a white border over video_a and mix the audio

Before	After

Vintage Film Look + Subtitles

Prompt: Normalize audio to -14 LUFS, add a warm vintage film look with grain overlay, and burn in these subtitles

Before	After

Datamosh Glitch Effects

Prompt: Apply datamosh glitch effect, add chromatic aberration with strong RGB split, pixelate slightly, add ghost trails

Before	After

Cinematic Teal & Orange

Prompt: Add a cinematic teal and orange color grade, apply a subtle vignette, and fade in from black

Before	After

Neon Glow Edge Detection

Prompt: Apply a neon glow edge detection effect, add chromatic aberration, and slow the video to 0.5x speed with smooth motion

Before	After

Colorhold Noir

Prompt: Use colorhold to keep only the red, desaturate everything else, boost contrast to 1.5, add a strong vignette, apply noir style

Before	After

Green Screen Removal

Prompt: Remove the green screen with chroma key, despill the green edges, sharpen slightly

Before	After

📦 Installation

Requirements

ComfyUI (latest)
Python 3.10+
FFMPEG installed and in PATH (install guide)
Node.js 18+ (required for CLI tools: Gemini CLI, Claude CLI, Qwen CLI — download)
Ollama (optional, for local LLM inference — download)

Option 1: ComfyUI Manager (Recommended)

Open ComfyUI Manager
Search for ComfyUI-FFMPEGA
Click Install

Option 2: Manual Install

Linux / macOS

cd /path/to/ComfyUI/custom_nodes
git clone https://github.com/AEmotionStudio/ComfyUI-FFMPEGA.git
cd ComfyUI-FFMPEGA
pip install -r requirements.txt

Windows (PowerShell)

cd C:\path\to\ComfyUI\custom_nodes
git clone https://github.com/AEmotionStudio/ComfyUI-FFMPEGA.git
cd ComfyUI-FFMPEGA
pip install -r requirements.txt

Note: Use whichever Python package manager your ComfyUI venv uses (pip, uv pip, etc.). The above commands assume pip is available in your ComfyUI virtual environment.

Restart ComfyUI after installation.

🚀 Quick Start

Add an FFMPEG Agent node to your workflow
Connect a video path or use the input field
Enter a natural language prompt
Select your LLM model
Run the workflow

💡 Tip: Right-click the FFMPEG Agent node to open the FFMPEGA Presets context menu — 200+ categorized effects you can apply with a single click, no prompt typing needed. Great for quick edits or discovering what's available.

Example Prompts

Prompt	What It Does
`"Make it cinematic with a vignette"`	Adds letterbox, color grade, and edge darkening
`"Speed up 2x, keep the audio pitch"`	Doubles speed with pitch-corrected audio
`"Make it look like old VHS footage"`	Adds noise, color shift, scan lines
`"Trim first 5 seconds, resize to 720p"`	Cuts intro and scales down
`"Underwater look with echo on audio"`	Blue tint, blur, and audio echo
`"Pixelate it like an 8-bit game"`	Mosaic/pixel art effect
`"Add 'Subscribe!' text at the bottom"`	Text overlay with positioning
`"Cyberpunk style with neon glow"`	High-contrast neon aesthetic
`"Normalize audio, compress for web"`	Loudness normalization + web optimization
`"Spin the video clockwise"`	Continuous animated rotation
`"Add camera shake"`	Random shake/earthquake effect
`"Fade in from black and out at the end"`	Smooth intro/outro transitions
`"Wipe reveal from the left"`	Directional wipe reveal animation
`"Make it pulse like a heartbeat"`	Rhythmic zoom breathing effect
`"Show audio waveform at the bottom"`	Audio visualization overlay
`"Arrange these images in a grid"`	Multi-image grid collage
`"Create a slideshow with fades"`	Image slideshow with transitions
`"Overlay the logo in the corner"`	Picture-in-picture / watermark
`"Create a side-by-side comparison"`	Video next to image in 2-column grid
`"Create a slideshow starting with the video"`	Video first, then image slides
`"Overlay images in the corners"`	Multiple images auto-placed in corners
`"Split screen, use audio from audio_b"`	Side-by-side video with specific audio track
`"Split screen, mix both audio tracks"`	Side-by-side with both audio tracks blended

💬 Prompt Guide

When using an LLM, the AI agent interprets your natural language and maps it to skills with specific parameters. Here's how to get the best results. (For manual editing without an LLM, see the Effects Builder and Text Input sections.)

Specifying Exact Values

You can request specific parameter values and the agent will use them directly:

Prompt	What the Agent Does
`"Set brightness to 0.3"`	`brightness:value=0.3`
`"Blur with strength 20"`	`blur:radius=20`
`"Speed up to 3x"`	`speed:factor=3.0`
`"Crop to 1280x720"`	`crop:width=1280,height=720`
`"Deband with threshold 0.3 and range 32"`	`deband:threshold=0.3,range=32`
`"CRF 18, slow preset"`	`quality:crf=18,preset=slow`
`"Fade in for 3 seconds"`	`fade:type=in,duration=3`

What Works Well ✅

Explicit numbers: "brightness 0.2", "speed 1.5x", "CRF 20" — the agent maps these directly
Named presets: "VHS look", "cinematic style", "noir" — triggers multi-step preset pipelines
Chaining operations: "Trim first 5 seconds, resize to 720p, add vignette" — executes in order
Descriptive goals: "Make it look warmer", "Remove the green screen" — the agent picks the right skills
Technical terms: "denoise", "deband", "normalize audio" — maps to exact FFmpeg filters

What Might Not Work as Expected ⚠️

Vague intensity words: "Make it very blurry" or "a little brighter" — the agent has to guess what number "very" or "a little" means. Tip: use a specific value instead: "blur with radius 15"
Out-of-range values: Parameters are auto-clamped to their valid range. If you ask for "brightness 5.0" it caps at the max (1.0)
Complex compositing: Multi-layer effects with precise timing may need to be broken into separate passes
Format-dependent features: Some effects (like transparency) require specific output formats. H.264/MP4 doesn't support alpha channels

🧠 AI Models (Auto-Downloaded)

Some skills use AI models that auto-download on first use. You can disable automatic downloads with the allow_model_downloads toggle on the FFMPEG Agent node — runs requiring a missing model will fail with a clear message and a manual download link.

All models are mirrored to first-party AEmotionStudio HuggingFace repos for supply chain resilience. Downloads try the AEmotionStudio mirror first, then fall back to upstream sources.

Model	Size	Stored In	Triggered By	Manual Download
SAM3 (Segment Anything 3)	~300 MB	`ComfyUI/models/SAM3/`	`auto_mask` skill, `sam3_masking` no-LLM mode, Effects Builder SAM3 target	AEmotionStudio/sam3 — download `sam3.safetensors`
Whisper large-v3	~3 GB	`ComfyUI/models/whisper/`	`auto_transcribe`, `karaoke_subtitles` skills, `transcribe` / `karaoke_subtitles` no-LLM modes	AEmotionStudio/whisper-models
Whisper medium	~1.5 GB	`ComfyUI/models/whisper/`	Same as above (set `whisper_model` to `medium`)	Same as above
Whisper small	~500 MB	`ComfyUI/models/whisper/`	Same as above (set `whisper_model` to `small`)	Same as above
Whisper base	~150 MB	`ComfyUI/models/whisper/`	Same as above (set `whisper_model` to `base`)	Same as above
Whisper tiny	~75 MB	`ComfyUI/models/whisper/`	Same as above (set `whisper_model` to `tiny`)	Same as above
LaMa (Large Mask Inpainting)	~200 MB	`~/.cache/torch/hub/checkpoints/`	`auto_mask:effect=remove` (legacy fallback)	AEmotionStudio/lama-inpainting — download `big-lama.pt`
FLUX Klein 4B (Editing/Removal)	~15 GB (bf16)	`ComfyUI/models/flux_klein/`	`auto_mask:effect=remove`, `auto_mask:effect=edit`	AEmotionStudio/flux-klein
MMAudio (Video-to-Audio)	~5.5 GB	`ComfyUI/models/mmaudio/`	`generate_audio` skill	AEmotionStudio/mmaudio-models
MuseTalk (Lip Sync)	~1.6 GB (fp16)	`ComfyUI/models/musetalk/`	`lip_sync` skill	AEmotionStudio/musetalk-models
U²-Net (rembg)	~170 MB	`~/.u2net/`	`remove_background` skill	Install with `pip install 'comfyui-ffmpega[masking]'` — model auto-fetched by rembg

Note

Models are only downloaded when you use the corresponding skill for the first time. Core FFmpeg editing skills (200+ of them) require zero model downloads.

🎛️ Nodes

FFMPEGA provides 8 nodes that work together:

Tip

One task per run. Instead of cramming multiple edits into a single prompt, focus each run on one editing task — then feed the output back into FFMPEGA for the next. This keeps context low and model focus high, leading to significantly better results. Chain FFMPEGA Agent → Save Video → Load Video Path → FFMPEGA Agent for multi-step workflows.

Warning

Low VRAM? Skills that load AI models (SAM3 masking, Whisper transcription, LaMa inpainting) each consume significant VRAM. On GPUs with limited memory, limit each run to one model-loading task — e.g. do your SAM3 removal pass first, save the result, then run Whisper subtitles as a separate pass.

FFMPEG Agent — The main node. Translates natural language into FFMPEG commands.

Required Inputs

Input	Type	Description
`video_path`	STRING	Absolute path to source video. Used as ffmpeg input unless `images_a` is connected.
`prompt`	STRING	Natural language editing instruction (e.g. "Add cinematic letterbox", "Speed up 2x"). Not required in `manual` mode.
`llm_model`	DROPDOWN	AI model selection — local Ollama models, CLI tools, or cloud APIs. Select `none` for no-LLM mode.
`no_llm_mode`	DROPDOWN	Mode when `llm_model` is `none`: `manual` (Effects Builder, default), `sam3_masking`, `transcribe`, `karaoke_subtitles`.
`quality_preset`	DROPDOWN	Output quality: `draft`, `standard`, `high`, `lossless`.
`seed`	INT	Change to force re-execution with the same prompt. Supports randomize control.

Optional Inputs

Input	Type	Description
`images_a`	IMAGE	Video frames from upstream (e.g. Load Video). Auto-expands: `images_b`, `images_c`...
`image_a`	IMAGE	Extra image input for multi-input skills (grid, slideshow, overlay). Auto-expands: `image_b`, `image_c`...
`audio_a`	AUDIO	Audio input for muxing or multi-audio workflows. Auto-expands: `audio_b`, `audio_c`...
`video_a`	STRING	File path to extra video for concat, split screen, grid, xfade. Zero memory. Auto-expands: `video_b`, `video_c`...
`image_path_a`	STRING	File path to image for overlay, grid, slideshow. Zero memory. Auto-expands: `image_path_b`...
`text_a`	STRING	Text input from FFMPEGA Text node for subtitles, overlays, watermarks. Auto-expands: `text_b`...
`pipeline_json`	STRING	Connect from FFMPEGA Effects Builder. In `manual` mode the pipeline is executed directly; with an LLM it provides skill hints.
`subtitle_path`	STRING	Direct path to a `.srt` or `.ass` subtitle file.
`advanced_options`	BOOLEAN	Simple/Advanced toggle — shows preview mode, CRF, encoding preset, batch processing when enabled.
`preview_mode`	BOOLEAN	Quick low-res preview (480p, 10s) instead of full render.
`save_output`	BOOLEAN	Save video + workflow PNG to output folder.
`output_path`	STRING	Custom output file/folder path. Empty = ComfyUI default.
`ollama_url`	STRING	Ollama server URL (default: `http://localhost:11434`).
`api_key`	STRING	API key for cloud models (GPT, Claude, Gemini). Auto-redacted from outputs.
`custom_model`	STRING	Exact model name when `llm_model` is set to `custom`.
`crf`	INT	Override CRF (0 = lossless, 23 = default, 51 = worst). -1 uses `quality_preset`.
`encoding_preset`	DROPDOWN	Override x264/x265 speed preset (`ultrafast` → `veryslow`). `auto` follows `quality_preset`.
`use_vision`	BOOLEAN	Embed video frames as images for vision-capable LLMs. Off = numeric color analysis only.
`verify_output`	BOOLEAN	Agent inspects output after rendering and auto-corrects if it doesn't match intent.

Whisper / SAM3 Inputs

Input	Type	Description
`whisper_device`	DROPDOWN	Device for Whisper model: `cpu` (default, avoids VRAM pressure) or `gpu` (faster, ~3 GB VRAM).
`whisper_model`	DROPDOWN	Whisper model size: `large-v3` (default, most accurate), `medium`, `small`, `base`, `tiny`.
`sam3_max_objects`	INT	Max objects SAM3 tracks per frame (1–20, default 5). Lower = less VRAM.
`sam3_det_threshold`	FLOAT	Minimum detection confidence for SAM3 (0.0–1.0, default 0.70). Higher = fewer objects.
`mask_output_type`	DROPDOWN	`black_white` (raw mask for compositing) or `colored_overlay` (SAM3-style preview).
`mask_points`	STRING	JSON point selection data from Load Video Path's Point Selector. Guides SAM3 with click-to-select.

Batch Processing Inputs

Input	Type	Description
`batch_mode`	BOOLEAN	Process all matching videos in `video_folder` with the same prompt. Single LLM call.
`video_folder`	STRING	Folder containing videos to batch process.
`file_pattern`	DROPDOWN	File pattern to match (`.mp4`, `.mov`, `.`, etc.).
`max_concurrent`	INT	Maximum simultaneous encodes in batch mode (1–16, default 4).

Output	Description
`images`	All frames from the output video as a batched image tensor
`audio`	Audio extracted from output video (or passed through from `audio_a`)
`video_path`	Absolute path to the rendered output video file
`command_log`	The ffmpeg command(s) that were executed
`analysis`	LLM interpretation, pipeline steps, and warnings

FFMPEGA Effects Builder — Compose video effects visually without an LLM.

Select up to 3 skills with parameters, add raw FFmpeg filters, and use presets. Outputs a pipeline JSON that connects to the FFMPEG Agent's pipeline_json input.

Built-in presets: 🎬 Cinematic Look, 📼 VHS Retro, 🎥 High Quality Export, 🎵 Clean Audio, ✨ Glow + Saturation, 🎙️ Auto Subtitles, 🎤 Karaoke Subtitles, 🌑 Fade In + Out, 🪞 Mirror Horizontal, 🎬 Ken Burns Zoom.

Input	Type	Description
`preset`	DROPDOWN	Quick-start preset. Auto-fills effect slots and params. Set to `none` to build your own.
`effect_1`	DROPDOWN	First effect. Categorized by type (🎨 Visual, ⏱️ Temporal, 📐 Spatial, 🔊 Audio, 📦 Encoding, ✨ Outcome).
`effect_1_params`	STRING	JSON parameters for effect 1 (e.g. `{"strength": 5}`). Auto-filled from defaults.
`effect_2`	DROPDOWN	Second effect (chained after effect 1).
`effect_2_params`	STRING	JSON parameters for effect 2.
`effect_3`	DROPDOWN	Third effect (chained after effect 2).
`effect_3_params`	STRING	JSON parameters for effect 3.
`raw_ffmpeg`	STRING	Raw FFmpeg `-vf` filter string applied after skill effects.
`sam3_target`	STRING	SAM3 text target — apply effects only to the masked region. Leave empty for full-frame.
`sam3_effect`	DROPDOWN	Effect for SAM3-detected region: `blur`, `pixelate`, `remove`, `edit`, `grayscale`, `highlight`, `greenscreen`, `transparent`.

Output	Description
`pipeline_json`	JSON pipeline — connect to FFMPEG Agent's `pipeline_json` input

Frame Extract (FFMPEGA) — Extract individual frames from a video as image tensors.

Input	Type	Description
`video_path`	STRING	Absolute path to the video to extract frames from.
`fps`	FLOAT	Extraction rate (0.1–60.0). `1.0` = one frame per second.
`start_time`	FLOAT	(optional) Start time in seconds (default: 0).
`duration`	FLOAT	(optional) Duration to extract from, in seconds (default: 10).
`max_frames`	INT	(optional) Max frames to return (default: 100, max: 1000).

Output	Description
`frames`	Extracted video frames as a batched image tensor

Tip: Connect frames output to the FFMPEG Agent's images_a input to build pipelines that analyze frames before editing.

Load Image Path (FFMPEGA) — Zero-memory image loader, outputs a file path instead of a tensor.

Outputs the image file path as a STRING instead of decoding into a ~6 MB IMAGE tensor. Connect to FFMPEGA Agent's image_path_a / image_path_b / … slots so ffmpeg reads the file directly.

Input	Type	Description
`image`	FILE PICKER	Select an image from ComfyUI's input directory or upload a new one.

Output	Description
`image_path`	Absolute file path to the selected image

Load Video Path (FFMPEGA) — Zero-memory video input with inline preview and metadata.

Validates the video file exists and outputs the path as a STRING — loads ZERO frames into memory. Features inline video preview, metadata display (fps, duration, resolution), and VHS-style trim parameters.

Input	Type	Description
`video`	FILE PICKER	Select or upload a video file.
`force_rate`	FLOAT	Override FPS (0 = use source).
`skip_first_frames`	INT	Frames to skip from start.
`frame_load_cap`	INT	Max frames to use (0 = all).
`select_every_nth`	INT	Select every Nth frame (1 = every frame).

Output	Description
`video_path`	Validated video file path
`frame_count`	Total usable frames after trim
`fps`	Effective FPS
`duration`	Effective duration in seconds

Save Video (FFMPEGA) — Zero-memory video output with inline preview.

Takes a video path (from FFMPEGA Agent or Load Video Path), copies the file to ComfyUI's output directory, and shows a preview. No re-encoding — just a file copy.

Input	Type	Description
`video_path`	STRING	Path to video file (typically from FFMPEGA Agent's output).
`filename_prefix`	STRING	Prefix for saved filename. Supports `%date:yyyy-MM-dd%`.
`overwrite`	BOOLEAN	(optional) Overwrite existing file vs auto-increment counter.

Text Input (FFMPEGA) — Flexible text input for subtitles, overlays, and watermarks.

Auto-detects whether text is SRT subtitles, a short watermark, or overlay text. Outputs JSON-encoded metadata that the FFMPEGA Agent node parses for burn_subtitles, text_overlay, or watermark skills.

Right-click Presets: 10 built-in presets with example text (SRT Subtitle Example, Cinematic Subtitles, Bold Watermark, Title Card, Social Caption, Meme Text, Lower Third, Copyright Notice, Credits Roll, Chapter Marker). Save/load/delete custom presets. "Clear Text" resets all fields to defaults.

No-LLM Text Mode: Connect a Text node to the Agent's text_a input in no-LLM manual mode (without an Effects Builder). The Agent auto-generates a text overlay or subtitle pipeline from the Text node's mode, position, font size, and color settings.

Input	Type	Description
`text`	STRING	Text content — plain text, multi-line subtitles, or full SRT format with timestamps.
`auto_mode`	BOOLEAN	(optional) Auto-detect mode from content (default: on).
`mode`	DROPDOWN	(optional) Override mode: `subtitle`, `overlay`, `watermark`, `title_card`, `raw`.
`position`	DROPDOWN	(optional) Text placement: `center`, `top`, `bottom`, `bottom_right`, etc. `auto` = mode default.
`font_size`	INT	(optional) Font size in px (0 = auto: 24 subtitle, 48 overlay, 20 watermark).
`font_color`	STRING	(optional) Text color as hex (#RRGGBB, default: white).
`start_time`	FLOAT	(optional) Start time in seconds (default: 0).
`end_time`	FLOAT	(optional) End time in seconds (-1 = full duration).

Output	Description
`text_output`	JSON-encoded text with metadata — connect to `text_a`, `text_b`, etc. on the FFMPEGA Agent node

Video to Path (FFMPEGA) — Convert IMAGE tensor to a temp video file path.

Bridge node between Load Video nodes (which output IMAGE tensors) and FFMPEGA Agent's video_a/video_b/video_c slots. Encodes frames to a temp video file, then releases the tensor so ComfyUI can free the memory.

Input	Type	Description
`images`	IMAGE	Video frames from a Load Video node.
`fps`	INT	(optional) FPS for the output video (default: 24).

Output	Description
`video_path`	File path to the temp video

🎯 Skill System

FFMPEGA includes a comprehensive skill system with 200+ operations organized into categories. Use them in two ways: let the AI agent select skills from your prompt, or pick them yourself with the Effects Builder — no LLM needed.

📄 See SKILLS_REFERENCE.md for the complete skill reference with all parameters and example prompts.

🧪 See SKILL_TEST_PROMPTS.md for ready-to-use copy-and-paste test prompts for every skill.

🎨 Visual Effects (30 skills)

Skill	Description
`brightness`	Adjust brightness (-1.0 to 1.0)
`contrast`	Adjust contrast (0.0 to 3.0)
`saturation`	Adjust color saturation (0.0 to 3.0)
`hue`	Shift color hue (-180 to 180)
`sharpen`	Increase sharpness
`blur`	Apply blur effect
`denoise`	Reduce noise/grain (light, medium, strong)
`vignette`	Darken edges for cinematic focus
`fade`	Fade in/out to black
`colorbalance`	Adjust shadows/midtones/highlights
`noise`	Add film grain
`curves`	Apply color curve presets (vintage, cross_process, etc.)
`text_overlay`	Add text with position, color, size, font
`invert`	Invert colors (photo negative)
`edge_detect`	Edge detection / sketch look
`pixelate`	Mosaic / 8-bit pixel effect
`gamma`	Gamma correction
`exposure`	Exposure adjustment
`chromakey`	Green screen removal
`colorkey`	Key out any arbitrary color and replace with a background
`colorhold`	Keep only a selected color, desaturate everything else (spot color)
`lumakey`	Key out regions based on brightness (luma)
`despill`	Remove green/blue color spill from chroma-keyed edges
`deband`	Remove color banding artifacts
`white_balance`	Adjust color temperature (2000K–12000K)
`shadows_highlights`	Separately adjust shadows and highlights
`split_tone`	Warm highlights, cool shadows
`deflicker`	Remove fluorescent/timelapse flicker
`unsharp_mask`	Fine-grained luma/chroma sharpening
`remove_background`	Remove backgrounds using AI (rembg)

⏱️ Temporal (9 skills)

Skill	Description
`trim`	Cut a segment by time
`speed`	Change playback speed (0.1x to 10x)
`reverse`	Play backwards
`loop`	Repeat video
`fps`	Change frame rate (1 to 120)
`scene_detect`	Auto-detect scene changes
`silence_remove`	Remove silent segments
`time_remap`	Gradual speed ramp
`freeze_frame`	Freeze a frame at a timestamp

📐 Spatial (8 skills)

Skill	Description
`resize`	Scale to specific dimensions
`crop`	Crop video region
`rotate`	Rotate by degrees
`flip`	Mirror horizontal/vertical
`pad`	Add padding / letterbox
`aspect`	Change aspect ratio (16:9, 4:3, 1:1, 9:16, 21:9)
`auto_crop`	Detect and remove black borders
`scale_2x`	Quick upscale with algo choice (2x, 4x)

🔊 Audio (27 skills)

Skill	Description
`volume`	Adjust audio level
`normalize`	Normalize loudness
`fade_audio`	Audio fade in/out
`remove_audio`	Strip all audio
`extract_audio`	Extract audio only
`bass` / `treble`	Boost/cut frequencies
`pitch`	Shift pitch by semitones
`echo`	Add echo / reverb
`equalizer`	Adjust specific frequency band
`stereo_swap`	Swap L/R channels
`mono`	Convert to mono
`audio_speed`	Change audio speed only
`chorus`	Chorus thickening effect
`flanger`	Sweeping jet flanger
`lowpass` / `highpass`	Frequency filters
`audio_reverse`	Reverse audio track
`compress_audio`	Dynamic range compression
`noise_reduction`	Remove background noise
`audio_crossfade`	Smooth audio crossfade
`audio_delay`	Add delay/offset to audio
`ducking`	Audio dynamic compression
`dereverb`	Remove room echo/reverb
`split_audio`	Extract left/right channel
`audio_normalize_loudness`	EBU R128 loudness normalization
`replace_audio`	Replace original audio track
`mix_audio`	Mix/blend audio tracks from two inputs (both audible)
`audio_bitrate`	Set audio encoding bitrate

📦 Encoding (13 skills)

Skill	Description
`compress`	Reduce file size (light, medium, heavy)
`convert`	Change codec (h264, h265, vp9, av1)
`quality`	Set CRF and encoding preset
`bitrate`	Set video/audio bitrate
`web_optimize`	Fast-start for web streaming
`container`	Change format (mp4, mkv, avi, mov, webm)
`pixel_format`	Set pixel format (yuv420p, yuv444p, etc.)
`hwaccel`	Hardware acceleration (cuda, vaapi, qsv)
`audio_codec`	Set audio codec (aac, mp3, opus, flac)
`frame_rate_interpolation`	Motion-interpolated FPS conversion
`frame_interpolation`	Smooth slow motion via motion interpolation
`two_pass`	Two-pass encoding for better quality
`hls_package`	HLS adaptive streaming packaging

🎬 Cinematic Presets (14 skills)

Skill	Description
`cinematic`	Hollywood film look — teal-orange grading
`blockbuster`	Michael Bay style — high contrast, dramatic
`documentary`	Clean, natural documentary look
`indie_film`	Indie art-house — faded, low contrast
`commercial`	Bright, clean corporate video
`dream_sequence`	Dreamy, soft, ethereal atmosphere
`action`	Fast-paced action movie grading
`romantic`	Soft, warm romantic mood
`sci_fi`	Cool blue sci-fi atmosphere
`dark_moody`	Dark, atmospheric, moody feel
`color_grade`	Cinematic color grading (teal_orange, warm, cool)
`color_temperature`	Adjust color temperature (warm/cool)
`letterbox`	Cinematic widescreen letterbox bars
`film_grain`	Film grain texture (light, medium, heavy)

📼 Vintage & Retro (9 skills)

Skill	Description
`vintage`	Classic old film look (50s–90s)
`vhs`	VHS tape aesthetic
`sepia`	Classic sepia/brown tone
`super8`	Super 8mm film look
`polaroid`	Polaroid instant photo
`faded`	Washed-out, faded look
`old_tv`	CRT television aesthetic
`damaged_film`	Aged/weathered film
`noir`	Film noir — B&W, high contrast

📱 Social Media (9 skills)

Skill	Description
`social_vertical`	TikTok / Reels / Shorts (9:16)
`social_square`	Instagram feed (1:1)
`youtube`	YouTube optimized
`twitter`	Twitter/X optimized
`gif`	Convert to animated GIF
`thumbnail`	Extract thumbnail frame
`caption_space`	Add space for captions
`watermark`	Overlay logo/watermark
`intro_outro`	Add intro/outro segments

✨ Creative Effects (14 skills)

Skill	Description
`neon`	Neon glow — vibrant edges and colors
`horror`	Dark, desaturated, grainy horror atmosphere
`underwater`	Blue tint, blur, darker underwater look
`sunset`	Golden hour warm glow
`cyberpunk`	Neon tones, high contrast cyberpunk
`comic_book`	Bold colors, comic/pop art style
`miniature`	Tilt-shift toy model effect
`surveillance`	Security camera / CCTV look
`music_video`	Punchy colors, contrast, vignette
`anime`	Anime / cel-shaded cartoon
`lofi`	Lo-fi chill aesthetic
`thermal`	Thermal / heat vision camera
`posterize`	Reduce color palette / screen-print
`emboss`	Emboss / relief surface effect

🧪 Special Effects (36 skills)

Skill	Description
`meme`	Deep-fried meme aesthetic
`glitch`	Digital glitch / databend
`mirror`	Mirror / kaleidoscope effect
`slow_zoom`	Slow push-in zoom
`black_and_white`	B&W with style options
`day_for_night`	Simulate nighttime from daytime
`dreamy`	Soft, ethereal dream look
`hdr_look`	Simulated HDR dynamic range
`datamosh`	Glitch art / motion vector visualization
`radial_blur`	Radial / zoom blur effect
`grain_overlay`	Cinematic film grain with intensity control
`burn_subtitles`	Hardcode subtitles
`selective_color`	Isolate specific colors
`perspective`	Perspective transform
`lut_apply`	Apply LUT color grading
`lens_correction`	Fix lens distortion
`fill_borders`	Fill black borders
`deshake`	Quick stabilization
`deinterlace`	Remove interlacing
`halftone`	Newspaper dot pattern
`false_color`	Pseudocolor heat map
`frame_blend`	Temporal frame blending
`tilt_shift`	Tilt-shift miniature effect
`color_channel_swap`	Color channel remapping
`ghost_trail`	Temporal motion trails
`glow`	Bloom / soft glow effect
`sketch`	Pencil drawing / ink line art
`chromatic_aberration`	RGB channel offset / color fringing
`boomerang`	Looping boomerang effect
`ken_burns`	Slow zoom pan for photos
`slowmo`	Smooth slow motion
`stabilize`	Remove camera shake
`timelapse`	Dramatic speed-up for timelapse
`zoom`	Zoom in/out effect
`scroll`	Scroll video vertically/horizontally
`monochrome`	Monochrome with optional tint

🎬 Transitions (3 skills)

Skill	Description
`fade_to_black`	Fade in from + fade out to black
`fade_to_white`	Fade in from + fade out to white
`flash`	Camera flash at a specific timestamp

🌀 Motion (5 skills)

Skill	Description
`spin`	Continuous animated rotation
`shake`	Camera shake / earthquake (light, medium, heavy)
`pulse`	Rhythmic breathing zoom effect
`bounce`	Vertical bouncing animation
`drift`	Slow cinematic pan (left, right, up, down)

🔮 Reveal Effects (3 skills)

Skill	Description
`iris_reveal`	Circle expanding from center
`wipe`	Directional wipe from black
`slide_in`	Slide video in from edge

🎵 Audio Visualization (1 skill)

Skill	Description
`waveform`	Audio waveform overlay (line, point, cline modes)

🔗 Multi-Input & Composition (10 skills)

Skill	Description
`grid`	Arrange video + images in a grid layout (xstack). Auto-includes video as first cell.
`slideshow`	Create slideshow from images with fade transitions. Optionally starts with the main video.
`overlay_image`	Picture-in-picture / watermark overlay. Supports multiple overlays auto-placed in corners. Accepts `animation=bounce` for motion.
`concat`	Concatenate video segments sequentially. Connect multiple videos/images to join them.
`xfade`	Smooth transitions between segments — 18 types: fade, dissolve, wipe, pixelize, radial, etc.
`split_screen`	Side-by-side (horizontal) or top-bottom (vertical) multi-video layout.
`animated_overlay`	Moving image overlay with motion presets: scroll, float, bounce, slide.

✏️ Text & Graphics (9 skills)

Skill	Description
`animated_text`	Animated text overlay
`scrolling_text`	Scrolling credits-style text
`ticker`	News-style scrolling ticker bar
`lower_third`	Professional broadcast lower third
`countdown`	Countdown timer overlay
`typewriter_text`	Typewriter reveal effect
`bounce_text`	Bouncing animated text
`fade_text`	Text that fades in and out
`karaoke_text`	Karaoke-style fill text

✂️ Editing & Delivery (13 skills)

Skill	Description
`picture_in_picture`	PiP overlay window with optional border
`blend`	Blend two video inputs
`delogo`	Remove logo from a region
`remove_dup_frames`	Strip duplicate/stuttered frames
`mask_blur`	Blur a rectangular region for privacy
`extract_frames`	Export frames as image sequence
`jump_cut`	Auto-cut to high-energy moments
`beat_sync`	Sync cuts to a beat interval
`color_match`	Auto histogram equalization
`extract_subtitles`	Extract subtitle track
`preview_strip`	Filmstrip preview of key frames
`sprite_sheet`	Contact sheet of frames

🤖 AI-Powered (4 skills)

Skill	Description
`auto_transcribe`	Transcribe audio with Whisper AI and burn SRT subtitles
`karaoke_subtitles`	Word-by-word karaoke subtitles with progressive color fill (Whisper)
`auto_mask`	SAM3-powered object segmentation from text prompts
`generate_audio`	AI-generate synchronized audio/foley from video + text (MMAudio)

⚠️ License Notice: The generate_audio skill uses MMAudio model weights which are licensed under CC-BY-NC 4.0 (non-commercial use only). Model weights are downloaded on first use — by downloading them you accept the CC-BY-NC 4.0 license. The FFMPEGA code itself remains GPL-3.0.

🧠 Agentic Tools

Beyond the 200+ editing skills, the agent has built-in tools for analyzing media and making better decisions. In LLM mode, the agent calls these autonomously based on your prompt. Some (like analyze_video and search_skills) are also invoked directly by internal skills, the Effects Builder, and no-LLM modes.

🔍 Analysis & Discovery

Tool	What It Does
`analyze_video`	Probes resolution, duration, codec, FPS, bitrate — the agent calls this to understand your source
`extract_frames`	Extracts PNG frames for vision models to "see" the video content
`analyze_colors`	Numeric color metrics (luminance, saturation, color balance) via ffprobe signalstats — guides color grading decisions without vision
`analyze_audio`	Numeric audio metrics (volume dB, EBU R128 loudness LUFS, silence detection) — guides audio effect decisions
`search_skills`	Searches skills by keyword — the agent always calls this to find the right skills
`list_luts`	Lists available LUT files for color grading — called before `lut_apply` to discover available looks

🎨 LUT Color Grading System

8 bundled LUT files for cinematic color grading. The agent discovers these via list_luts and applies them with the lut_apply skill.

LUT	Style
`cinematic_teal_orange`	Hollywood teal-orange grade
`warm_vintage`	Warm retro film look
`cool_scifi`	Cool blue sci-fi tone
`film_noir`	Classic noir — desaturated, crushed
`golden_hour`	Warm golden sunlight
`cross_process`	Cross-processed film chemistry
`bleach_bypass`	Bleach bypass — low saturation, high contrast
`neutral_clean`	Subtle clarity enhancement

Adding your own LUTs: Drop .cube or .3dl files into the luts/ folder. The agent will discover them via list_luts automatically. Short names auto-resolve to full paths (e.g., cinematic_teal_orange → luts/cinematic_teal_orange.cube).

✅ Output Verification Loop

When verify_output is enabled (default: On), the agent inspects its own output after execution:

Extracts frames from the output video
Runs color and/or audio analysis on the result
Sends analysis to the LLM with the original prompt for quality assessment
If the LLM detects issues, it auto-corrects the pipeline and re-executes once

This closes the feedback loop — the agent can catch and fix mistakes like wrong color grades, failed effects, or audio issues without re-queuing.

🧩 Custom Skills

Create your own skills via YAML — no Python required. Drop a .yaml file in custom_skills/, restart ComfyUI, and the agent can use it immediately.

# custom_skills/dreamy_blur.yaml
name: dreamy_blur
description: "Soft dreamy blur with glow"
category: visual
tags: [dream, blur, soft, glow]

parameters:
  radius:
    type: int
    default: 5
    min: 1
    max: 30

ffmpeg_template: "gblur=sigma={radius},eq=brightness=0.06"

Skill packs — installable collections of related skills, optionally with Python handlers for complex logic:

# Linux / macOS / Windows (Git Bash or PowerShell)
cd custom_skills/
git clone https://github.com/someone/ffmpega-retro-pack retro-pack

Two example skills ship in custom_skills/examples/ — warm_glow.yaml (template) and film_burn.yaml (pipeline composite).

📄 See CUSTOM_SKILLS.md for the full schema reference, skill pack structure, Python handlers, and advanced examples.

🤖 LLM Configuration

Tested with: FFMPEGA has been primarily tested using Gemini CLI and Qwen3 8B (via Ollama). Results may vary with other models.

Author's pick: The CLI connectors (especially Gemini CLI) have been the most reliable option in my experience — they handle tool-calling, structured output, and long context exceptionally well. Highly recommended if you have access.

Choosing a Model

FFMPEGA works best with models that have strong JSON output and instruction-following abilities. The agent sends a structured prompt and expects a valid JSON pipeline back — models with tool-calling or function-calling capabilities tend to perform best.

Things to keep in mind:

Some models work better than others — larger models and those trained for structured output (JSON/tool-calling) produce more reliable results
Some models may need more retries — if the agent fails to parse the response, try running the same prompt again. Smaller models occasionally return malformed JSON on the first try
Find what works best for you — experiment with different models to find the right balance of speed, quality, and reliability for your hardware

Ollama (Local — Free)

The default option. Runs locally, no API key needed.

Install Ollama: Download from ollama.com/download (available for Linux, macOS, and Windows).

# Start Ollama (all platforms)
ollama serve

# Pull a model
ollama pull qwen3:8b

Windows users: After installing Ollama, the ollama command is available in both PowerShell and Command Prompt. Ollama also runs as a system tray app.

Recommended local models (≤30B, consumer GPU friendly):

Model	Size	Speed	Notes
`qwen3:8b`	8B	⚡ Fast	Tested — excellent structured output, native tool-calling
`qwen3-vl`	8B	⚡ Fast	Tested — multimodal vision-language model, sees video frames
`qwen3:14b`	14B	⚡ Fast	Sweet spot of speed and quality, tools + thinking tags
`qwen3:30b`	30B	🔄 Medium	Best Qwen under 30B, needs 16GB+ VRAM
`qwen2.5:14b`	14B	⚡ Fast	Top IFEval scores, strong instruction following
`deepseek-r1:14b`	14B	⚡ Fast	Reasoning model — verifies its own tool calls, very reliable
`mistral-nemo`	12B	⚡ Fast	NVIDIA + Mistral collab, 128k context, great reasoning
`mistral-small3.2`	24B	🔄 Medium	Native function calling, 128k context
`phi4`	14B	⚡ Fast	Microsoft reasoning SLM, rivals larger models in logic
`gemma3:12b`	12B	⚡ Fast	High reasoning scores, 128k context
`llama3.3:8b`	8B	⚡ Fast	Reliable tool-calling, large ecosystem

Tip: On the Ollama library, look for models with a tools tag — this indicates native tool/function-calling support, which produces the best results with FFMPEGA.

OpenAI

llm_model: gpt-5.2
api_key: your-openai-key

Gemini (Google API)

llm_model: gemini-3-flash
api_key: your-google-ai-key

Gemini CLI (Free with any Google Account)

Use the Gemini CLI to run Gemini models without an API key. Works with any Google account.

Install:

Platform	Command
Linux / macOS	`npm install -g @google/gemini-cli`
Windows (PowerShell)	`npm install -g @google/gemini-cli`

Tip: You can also use pnpm add -g or yarn global add if you prefer.

Authenticate (first time only):

gemini

This opens a browser to sign in with your Google account.

Use in FFMPEGA:

llm_model: gemini-cli

No API key is needed — authentication is handled by the CLI. Select gemini-cli from the model dropdown in the node.

Note: The Gemini CLI runs as a subprocess and is sandboxed to the custom node directory for security. On Windows, gemini.cmd is also detected automatically.

Plans & Usage Limits

Plan	Rate Limit	Daily Limit	Models	Cost
Free (Google login)	60 req/min	1,000 req/day	Gemini model family (Pro + Flash)	Free
Free (API key only)	10 req/min	250 req/day	Flash only	Free
Code Assist Standard	120 req/min	1,500 req/day	Gemini model family	Paid
Code Assist Enterprise	120 req/min	2,000 req/day	Gemini model family	Paid
Google AI Pro	Higher	Higher	Full Gemini family	$19.99/mo

Tip: Sign in with a Google account (free) for the best experience — 1,000 requests/day with access to Pro and Flash models. An unpaid API key limits you to 250/day on Flash only.

Available Models

The Gemini CLI auto-selects the best model, but the following are available:

Model	Best For
Gemini 2.5 Pro	Complex reasoning, creative tasks
Gemini 2.5 Flash	Fast responses, high throughput
Gemini 2.5 Flash-Lite	Maximum speed, lowest cost
Gemini 3 Pro	Most capable, advanced reasoning
Gemini 3 Flash	Fast + capable, good balance

Free tier may auto-switch to Flash models when Pro quota is exhausted.

Anthropic

llm_model: claude-sonnet-4-6
api_key: your-anthropic-key

Claude Code CLI (Free with Anthropic Account)

Use the Claude Code CLI as a local LLM backend. Uses its own authentication — no API key needed in FFMPEGA.

Install:

Platform	Command
Linux / macOS	`npm install -g @anthropic-ai/claude-code`
Windows (PowerShell)	`npm install -g @anthropic-ai/claude-code`

Tip: You can also use pnpm add -g or yarn global add if you prefer.

Authenticate (first time only):

claude

This opens a browser to sign in with your Anthropic account.

Use in FFMPEGA:

llm_model: claude-cli

Auto-detected on PATH. Select claude-cli from the model dropdown.

Cursor Agent CLI

Use Cursor's CLI in agent mode as an LLM backend.

Install (all platforms): Open Cursor IDE → Command Palette (Ctrl+Shift+P / Cmd+Shift+P) → "Install 'cursor' command"

Start the agent:

agent

Use in FFMPEGA:

llm_model: cursor-agent

Auto-detected on PATH. Select cursor-agent from the model dropdown.

Qwen Code CLI (Free — 2,000 requests/day)

Use Qwen Code as a free LLM backend. Powered by Qwen3-Coder with 2,000 free requests/day via OAuth — no credit card required.

Install:

Platform	Command
Linux / macOS	`npm install -g @qwen-code/qwen-code@latest`
Windows (PowerShell)	`npm install -g @qwen-code/qwen-code@latest`

Tip: You can also use pnpm add -g or yarn global add if you prefer.

Authenticate (first time only):

qwen

Select "Qwen OAuth (Free)" and follow the browser prompts to sign in.

Use in FFMPEGA:

llm_model: qwen-cli

Auto-detected on PATH. Select qwen-cli from the model dropdown.

👁️ CLI Agent Vision Support

When Frame Extraction is used, FFMPEGA saves extracted frames to a _vision_frames/ directory and passes the frame paths to the CLI agent. Agents with vision support can see and analyze the actual frame images to make better editing decisions.

CLI Agent	Vision Support	Notes
Gemini CLI	✅ Yes	`read_file` converts images to base64 for multimodal analysis
Claude Code CLI	✅ Yes	Native image reading and description
Cursor Agent CLI	✅ Yes	Native image reading and description
Qwen Code CLI	❌ Not yet	Known issue — `read_file` returns raw binary instead of interpreting images. Vision is listed as a planned feature.

Note: Agents without vision support still receive the frame file paths and can use video metadata (duration, resolution, FPS) from analyze_video to make editing decisions. When Qwen fixes their vision support upstream, it will work automatically since the frame paths are already passed correctly.

⚠️ Important: The _vision_frames/ directory must not be listed in .gitignore or .git/info/exclude — CLI agents respect these ignore patterns and will be unable to read the frames. FFMPEGA's cleanup_vision_frames() automatically deletes the directory after each pipeline run.

Custom Model

Select custom from the model dropdown and type any model name in the custom_model field. The provider is auto-detected from the name:

Prefix	Provider
`gpt-*`	OpenAI
`claude-*`	Anthropic
`gemini-*`	Google
Anything else	Ollama (local)

This lets you use any new model immediately without waiting for a code update.

🔒 API Key Security

Your API keys are automatically scrubbed and never stored in output files:

Error messages — keys are redacted before being shown in the UI (e.g. ****abcd)
Workflow metadata — ComfyUI embeds workflow data in output images/videos; FFMPEGA strips the api_key field from this metadata before saving
HTTP errors — keys are removed from network error messages that might include auth headers
Debug logs — LLMConfig redacts keys in all string representations

No configuration needed — this protection is always active when an API key is provided.

⚠️ Safety precaution: As with any software, always inspect your output files before sharing them publicly — in the unlikely event of a bug or edge case that bypasses the automatic scrubbing.

📊 Token Usage Tracking

Monitor your LLM token consumption with opt-in usage tracking. Enable via two toggles on the node:

Toggle	Default	What It Does
`track_tokens`	Off	Prints a formatted usage summary to the console after each run
`log_usage`	Off	Appends a JSON entry to `usage_log.jsonl` for cumulative tracking

Token data sources by connector:

Connector	Source	Estimated?
Ollama	Native API (`prompt_eval_count` / `eval_count`)	No
OpenAI / Gemini API	Native API (`usage` field)	No
Anthropic API	Native API (`usage.input_tokens`)	No
Gemini CLI	JSON output via `-o json`	No
Claude CLI	JSON output via `--output-format json`	No
Other CLIs	Character-based estimation (~4 chars/token)	Yes

When enabled, the analysis output includes a usage breakdown:

Token Usage:
  Prompt tokens:     4,200
  Completion tokens: 1,800
  Total tokens:      6,000
  LLM calls:         5
  Tool calls:        3
  Elapsed:           12.4s

The usage_log.jsonl file stores one JSON object per run for historical analysis. It is gitignored by default.

🐛 Troubleshooting

FFMPEG Not Found

Ensure FFMPEG is installed and in your system PATH:

ffmpeg -version

Install FFMPEG:

Platform	Command / Method
Ubuntu / Debian	`sudo apt install ffmpeg`
Arch / CachyOS	`sudo pacman -S ffmpeg`
Fedora	`sudo dnf install ffmpeg`
macOS	`brew install ffmpeg`
Windows (winget)	`winget install Gyan.FFmpeg`
Windows (choco)	`choco install ffmpeg`
Windows (scoop)	`scoop install ffmpeg`
Windows (manual)	Download from ffmpeg.org/download, extract, and add the `bin/` folder to your system PATH

Windows PATH tip: After installing, open a new terminal and run ffmpeg -version to verify. If not found, you may need to add ffmpeg's bin/ directory to your system PATH manually: Settings → System → About → Advanced system settings → Environment Variables → Edit Path.

Ollama Connection Failed

Make sure Ollama is running:

ollama serve

If using a custom URL, set it in the node's ollama_url field.

Model Not Found

Pull the required model first:

ollama pull qwen2.5:8b

LLM Returns Empty Response

This usually means:

The model is still loading (first request after start)
The prompt is too long for the model's context window
Try running the same prompt again
Try a different model

Parameter Validation Errors

FFMPEGA auto-coerces types (float→int) and clamps out-of-range values. If you still see errors, try simplifying your prompt or using a more capable model.

Cancelling a Running Request

If the LLM is taking too long or you want to abort mid-request, close the ComfyUI terminal or restart ComfyUI instead of using the interrupt button. The interrupt button waits for the current LLM response to complete, which can take a while — closing/restarting ComfyUI kills it immediately.

🤝 Contributing

Contributions are welcome! Whether it's bug reports, new skills, or improvements — your help is appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 License

This project is licensed under the GPL-3.0 License — see the LICENSE file for details.

Developed by Æmotion Studio

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github/workflows		.github/workflows
config		config
core		core
custom_skills		custom_skills
example_workflows		example_workflows
js		js
luts		luts
mcp		mcp
nodes		nodes
prompts		prompts
skills		skills
stubs		stubs
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ADVANCED_TEST_PROMPTS.md		ADVANCED_TEST_PROMPTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
CUSTOM_SKILLS.md		CUSTOM_SKILLS.md
LICENSE		LICENSE
LLMS.md		LLMS.md
README.md		README.md
SKILLS_REFERENCE.md		SKILLS_REFERENCE.md
SKILL_TEST_PROMPTS.md		SKILL_TEST_PROMPTS.md
__init__.py		__init__.py
conftest.py		conftest.py
install.py		install.py
py.typed		py.typed
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
ruff.toml		ruff.toml
server.py		server.py

Folders and files

Latest commit

History

Repository files navigation

ComfyUI-FFMPEGA

🚀 What's New in v2.9.1 (March 3, 2026)

✨ Features

🗣️ Natural Language Editing

🏗️ Manual Mode — No AI Required

🤖 Multi-LLM Support

🎨 200+ Skills

🎨 Right-Click Presets

⚡ Batch & Preview

🎬 Examples

4×4 Video Grid

Crossfade Transitions + Bouncing Logo

Color Grade + Text Overlay + Compression

Picture-in-Picture with Audio Mix

Vintage Film Look + Subtitles

Datamosh Glitch Effects

Cinematic Teal & Orange

Neon Glow Edge Detection

Colorhold Noir

Green Screen Removal

📦 Installation

Requirements

Option 1: ComfyUI Manager (Recommended)

Option 2: Manual Install

🚀 Quick Start

Example Prompts

💬 Prompt Guide

Specifying Exact Values

What Works Well ✅

What Might Not Work as Expected ⚠️

🧠 AI Models (Auto-Downloaded)

🎛️ Nodes

🎯 Skill System

🧠 Agentic Tools

🤖 LLM Configuration

Choosing a Model

Ollama (Local — Free)

OpenAI

Gemini (Google API)

Gemini CLI (Free with any Google Account)

Plans & Usage Limits

Available Models

Anthropic

Claude Code CLI (Free with Anthropic Account)

Cursor Agent CLI

Qwen Code CLI (Free — 2,000 requests/day)

👁️ CLI Agent Vision Support

Custom Model

🔒 API Key Security

📊 Token Usage Tracking

🐛 Troubleshooting

🤝 Contributing

📝 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Languages

Packages