The ultimate video editing suite for ComfyUI — edit with natural language or hands-on manual controls.
Use AI to describe edits in plain English, or take full manual control with the Effects Builder and text presets — no LLM required.
Features • Examples • Installation • Quick Start • Prompt Guide • Skills • LLM Setup • Troubleshooting • Contributing • Changelog
AI Audio Generation, AI Lip Sync, FLUX Klein Editing, Modular Architecture, CI/CD & Type Safety
-
✨ AI Object Removal & Editing (FLUX Klein 4B): New
auto_mask:effect=removeandauto_mask:effect=editpowered by FLUX Klein 4B (Apache 2.0). AI-powered per-frame object removal replaces LaMa with higher-quality results. Newediteffect enables text-guided video changes (e.g. "change hair to red", "replace background with beach"). Reference-image conditioning, 4-step inference, temporal smoothing, ~8–13 GB VRAM. -
🔊 AI Audio Generation (
generate_audio): New skill powered by MMAudio — synthesizes synchronized audio/foley from video and/or text descriptions. Supports video-to-audio, text-to-audio, and automatic long video chunking with crossfade. 11 natural language aliases (foley,sound_effects,v2a, etc.). -
👄 AI Lip Sync (
lip_sync): New skill powered by MuseTalk V15 — synchronizes lip movements to match provided audio. Video+audio and image+audio inputs, multi-face support, batch inference. Zero new pip dependencies. 7 aliases (lipsync,dub,dubbing,sync_lips,talking_head,lip_dub,voice_sync). Subprocess isolation for CUDA memory safety. -
🏗️ Modular Node Architecture: Broke up monolithic
agent_node.pyinto 6 focused modules —input_resolver,execution_engine,output_handler,batch_processor,nollm_modes, andpipeline_assembler. Agent node is now a thin orchestrator. -
🔒 Pyright Type Checking: Added
pyrightconfig.jsonwith type stubs for ComfyUI modules. Fixed 78 type errors acrosscore/,skills/,mcp/. -
🧱 Platform Abstraction (
core/platform.py): All ComfyUI boundary interactions extracted into a single adapter module. Core logic no longer imports ComfyUI directly. -
📋 Structured Logging (
core/logging.py):JSONFormatterfor machine-readable logs,get_logger()convenience function, andLogTimercontext manager. Enable viaFFMPEGA_LOG_JSON=1. -
🧪 CI Workflow & 799 Tests: New
.github/workflows/ci.ymlwith Pytest + Pyright on every PR/push. 12 new integration tests running real FFmpeg pipelines. Test suite expanded from 656 → 799 tests, 0 failures. -
📖 CONTRIBUTING.md & Pre-commit: New contributor docs with architecture overview and PR guidelines. Pre-commit hooks with Ruff lint/format and Pyright.
-
🧠 Subprocess Isolation: Audio generation runs in a subprocess to prevent CUDA memory leaks — same proven pattern as SAM3. Falls back to in-process with VRAM offloading.
-
⚡ Native Safetensors & Memory-Efficient Loading: Uses
comfy.utils.load_torch_filefor direct.safetensorsloading, plusaccelerate's zero-copy model init — no 3× memory spike during model loading. -
🪞 Model Mirror Repositories: MMAudio on
AEmotionStudio/mmaudio-models(fp16, ~5.5 GB). MuseTalk UNet onAEmotionStudio/musetalk-models(fp16 1.6 GB / fp32 3.2 GB). Mirror-first download with upstream HuggingFace fallback. -
🔗 Mask Points Chaining:
SaveVideoNodeandLoadVideoPathNodenow passmask_pointsdata through the node chain — upstream segmentation points propagate to downstream processing. -
♿ Point Selector Accessibility: ARIA roles,
aria-modal,aria-liveregions, and hidden decorative emojis for screen reader support. -
🔧 Bug Fixes: LaMa cache false-negative, log propagation leak, token usage display, platform import robustness.
Previous: v2.8.0 — Effects Builder, Manual Mode, SAM3 Hardening & 15+ Bug Fixes
- 🏗️ Effects Builder Node: New companion node for manual effect composition — select up to 3 skills with params, combine with raw FFmpeg filters, and use presets. No LLM required.
- 🎬 Effects Builder Presets: Right-click the Effects Builder for quick access to all 18 built-in presets, plus save/load/delete custom presets and a "Clear All Effects" reset.
- 📝 Text Node Presets: Right-click the FFMPEGA Text node for 10 built-in presets (SRT Subtitle Example, Cinematic Subtitles, Watermark, Title Card, Social Caption, Meme Text, Lower Third, Copyright Notice, Credits Roll, Chapter Marker) with example text. Custom save/load/delete and "Clear Text" reset.
- 💬 No-LLM Text Support: Connect a Text node to the Agent in no-LLM manual mode (without an Effects Builder) and it auto-generates a text overlay or subtitle pipeline.
- 🏛️ Manual Mode (Default): New
manualno-LLM mode — setllm_modeltononeand use the Effects Builder to edit videos without any AI. This is now the defaultno_llm_mode. - 🎙️ Whisper No-LLM Modes:
transcribeandkaraoke_subtitlesin theno_llm_modedropdown — run Whisper directly without an LLM. Also available as Effects Builder presets. - 🧠 SAM3 Subprocess Isolation: SAM3 now runs in a separate subprocess to prevent CUDA memory leaks. Multiple OOM fixes, point prompt support, and real-time progress streaming.
- 🔒 MCP Security: Fixed path traversal vulnerabilities in MCP tools. SkillComposer parameter hardening.
- 🔗 Effects Builder Multi-Input: Fixed concat, grid, xfade, and all multi-input skills in the Effects Builder. Extra inputs (
video_b, etc.) now correctly injected. - 📐 Concat Resolution Fix: Concat/xfade/slideshow now default to input resolution instead of hardcoded 1920×1080.
- 🔧 Dynamic Slot Root Cause Fix:
video_bno longer disappears on page refresh. - ⚡ Performance: Ultrafast temp video encoding, optimized pipe buffers, SAM3 VRAM offloading.
- 🐛 Template Placeholder Fix: Unsubstituted template placeholders no longer crash ffmpeg.
Previous: v2.7.1 — Advanced Options Toggle, Dynamic Input Fix & Default Refinements
- ⚙️ Advanced Options Toggle: New Simple/Advanced toggle hides power-user settings (
preview_mode,crf,encoding_preset,video_path,subtitle_path,batch_mode) behind a single switch. The node is now much more compact by default. - 🔧 Dynamic Input Persistence: Fixed dynamic input slots (e.g.
video_bauto-appearing whenvideo_ais connected) not restoring when loading saved workflows or images. - 🎛️ Default Refinements: Whisper defaults to CPU (avoids VRAM pressure), token tracking enabled by default, PTC mode and SAM3 CPU inputs removed from UI.
⚠️ SAM3 Checkpoint Warnings: Detects wrong checkpoint format and logs reconversion instructions.- 🧹 Code Cleanup: Removed ~1100 lines of dead code, added
ValidationErrorfor path validation, token log rotation forusage_log.jsonl.
Previous: v2.7.0 — SAM3 Auto-Mask, LaMa Inpainting & PTC
- 🎭 SAM3 Auto-Mask: New
removeskill powered by Segment Anything Model 3 — describe what to remove in natural language (e.g. "remove the person") and SAM3 generates per-frame masks automatically. - 🖌️ LaMa Inpainting: AI-powered video object removal using LaMa (Large Mask Inpainting). Temporal Gaussian smoothing reduces frame-to-frame flicker. Falls back to black fill if LaMa is not installed.
- 🟢 Greenscreen: New
greenscreenskill — uses SAM3 masks to replace backgrounds with solid colors or transparency (WebM output). - ⚡ Programmatic Tool Calling (PTC): New
execute_codetool lets LLMs write a single Python script that orchestrates multiple tool calls in one pass, reducing round-trips from ~6 to 1. Three modes:off,auto,on. - 🔒 PTC Sandbox: Hardened with 25+ escape vector blocks — dunder introspection, module access, traceback traversal, dynamic attribute access, and
chr()construction bypasses. - 🧪 656 Tests: Expanded test suite from 516 → 656 with 0 failures. 34 new PTC executor tests, security hardening tests, and corrected mock contracts.
Previous: v2.6.5 — Whisper Auto-Transcription, Karaoke Subtitles & Letterbox Fixes
- 🎙️ Auto-Transcribe: New
auto_transcribeskill — transcribes video audio with OpenAI Whisper and burns SRT subtitles directly into the output. Supports multi-video concat with correct cross-clip timing. - 🎤 Karaoke Subtitles: New
karaoke_subtitlesskill — word-by-word progressive-fill karaoke effect using Whisper's word-level timestamps and ASS\kftags. - ⚙️ Whisper Controls: Choose your Whisper model size (
tiny→large-v3) and device (gpu/cpu) via node settings — trade off speed vs. accuracy, or offload to CPU on low-VRAM systems. - 📐 Letterbox Fix: Replaced
crop+padwithdrawboxfor letterboxing, preserving video content. Handles both letterbox and pillarbox cases correctly. - 🧹 Code Dedup: Extracted shared
ffmpeg_escape_path,color_to_ass_bgr, and_run_transcriptionhelpers — 3 deduplication refactors reducing maintenance surface. - 🔧 9 Bug Fixes: Whisper memory leak, xfade subtitle timing, ASS escaping, hex color validation, aspect ratio div-by-zero, and more.
Previous: v2.6.0 — HandlerResult, Compose Decomposition & 516 Tests
- 🏗️ HandlerResult Contract: All 9 handler modules now return a formal
HandlerResultdataclass, replacing ad-hoc tuples. Backward-compatible with existing code. - 🔧 Compose Decomposition: Extracted 5 orchestration methods from the 600+ line
compose()into pure, testable static methods. - 🎵 PiP Audio Mixing:
picture_in_picturenow supportsaudio_mixto blend both audio tracks via ffmpeg'samix. - 🔄 CLI Retry: CLI connectors retry on transient failures with exponential backoff (3 attempts).
- 📝 TextInput Node: New node for subtitle and text overlay workflows with auto SRT detection.
- 🔒 Security: Sanitized text overlay
enableparameter, fixed path traversal on output dirs, fixed weak UUID entropy. - 🧪 516 Tests: Expanded test suite from 481 → 516 with 0 failures. New handler unit tests, skill combination tests, and orchestration helper tests.
Previous: v2.5.0 — PiP Overlay Fixes, VL Model Vision & Border Support
- 🖼️ PiP Border Support:
picture_in_picturenow acceptsborderandborder_colorparameters. - 👁️ Ollama VL Auto-Embedding: Vision-language models automatically receive 3 video frames in the initial message.
- 🔗 PiP Alias Fix: Models using
pip,picture-in-picture, etc. now correctly resolve topicture_in_picture. - 🔧 Ollama VL Verification: Fixed 400 error when verifying output with Ollama VL models.
Previous: v2.4.0 — Pipeline Chaining, Animated Overlays & Zero-Memory Image Paths
- 🖼️ Zero-Memory Image Paths: Image inputs passed as file paths instead of decoded tensors.
- 🎯 Overlay Animation:
overlay_imagesupportsanimation=bounceand more motion presets. - 🔗 Pipeline Chaining Fixes: Fixed filter graph chaining for multi-skill pipelines.
- 🏗️ Handler Module Extraction: Skill handlers split into
skills/handlers/modules. - 🔒 Security Hardening: Extended FFMPEG parameter sanitization.
- ⚡ Performance:
frames_to_tensorpre-allocates memory.
Previous: v2.3.0 — Token Tracking, LUT Color Grading & Vision
- 📊 Token Usage Tracking: Opt-in
track_tokensandlog_usagetoggles — monitor prompt/completion tokens, LLM calls, tool calls, and elapsed time. - 🎨 LUT Color Grading: 8 bundled cinematic
.cubeLUT files. Drop custom LUTs intoluts/for automatic discovery. - 🖼️ Vision System: Multimodal frame analysis — the agent extracts frames and "sees" the video.
- 🔊 Audio Analysis: Volume (dB), EBU R128 loudness (LUFS), and silence detection.
- 🤖 Real Token Stats: Gemini CLI and Claude CLI return native token counts via JSON output.
📄 See CHANGELOG.md for the complete version history.
NotebookLM Overview: Exploring the features and capabilities of ComfyUI-FFMPEGA. (Click to watch on YouTube)
|
Describe edits in plain text: "Make it cinematic with a fade in", "Speed up 2x", "VHS look with grain". The AI agent interprets your prompt and builds the FFMPEG pipeline automatically. |
Use the Effects Builder to visually compose up to 3 effects with parameters. Add text overlays and subtitles via preset-powered Text nodes. Full editing control with zero LLM dependency. |
|
Works with Ollama (local, free), OpenAI, Anthropic, Google Gemini, and CLI tools (Gemini CLI, Claude Code, Cursor Agent, Qwen Code). Use any local model — Llama 3.1, Qwen3, Mistral, and more. Or skip the LLM entirely. |
200+ video editing skills across visual effects, audio processing, spatial transforms, temporal edits, encoding, cinematic presets, vintage looks, social media, creative effects, text animations, editing & composition, audio visualization, multi-input operations, transitions, concat, split screen, and AI-powered skills (Whisper transcription, SAM3 masking, MMAudio generation, MuseTalk lip sync). |
|
18 built-in Effects Builder presets and 10 Text node presets with example content. Save/load/delete your own custom presets. One-click clear to reset. |
Process multiple videos with the same instruction. Generate quick low-res previews before committing to full renders. Quality presets from draft to lossless. |
See what FFMPEGA can do — each example shows the prompt or preset used, the input clip, and the result.
Prompt:
Concatenate these clips in a 4x4 grid
| Before | After |
Prompt:
Concatenate these clips with a crossfade transition between each, add a fade in at the start and fade out at the end and a bouncing image_path_a at 10% size and 30% opacity
| Before | After |
Prompt:
Color grade with the cinematic teal orange LUT, normalize audio, add a text "water" in the bottom right corner, compress for web at 720p
| Before | After |
Prompt:
Place video_b in the bottom-right corner at 25% size with a white border over video_a and mix the audio
| Before | After |
Prompt:
Normalize audio to -14 LUFS, add a warm vintage film look with grain overlay, and burn in these subtitles
| Before | After |
Prompt:
Apply datamosh glitch effect, add chromatic aberration with strong RGB split, pixelate slightly, add ghost trails
| Before | After |
Prompt:
Add a cinematic teal and orange color grade, apply a subtle vignette, and fade in from black
| Before | After |
Prompt:
Apply a neon glow edge detection effect, add chromatic aberration, and slow the video to 0.5x speed with smooth motion
| Before | After |
Prompt:
Use colorhold to keep only the red, desaturate everything else, boost contrast to 1.5, add a strong vignette, apply noir style
| Before | After |
Prompt:
Remove the green screen with chroma key, despill the green edges, sharpen slightly
| Before | After |
- ComfyUI (latest)
- Python 3.10+
- FFMPEG installed and in PATH (install guide)
- Node.js 18+ (required for CLI tools: Gemini CLI, Claude CLI, Qwen CLI — download)
- Ollama (optional, for local LLM inference — download)
- Open ComfyUI Manager
- Search for
ComfyUI-FFMPEGA - Click Install
Linux / macOS
cd /path/to/ComfyUI/custom_nodes
git clone https://github.com/AEmotionStudio/ComfyUI-FFMPEGA.git
cd ComfyUI-FFMPEGA
pip install -r requirements.txtWindows (PowerShell)
cd C:\path\to\ComfyUI\custom_nodes
git clone https://github.com/AEmotionStudio/ComfyUI-FFMPEGA.git
cd ComfyUI-FFMPEGA
pip install -r requirements.txtNote: Use whichever Python package manager your ComfyUI venv uses (
pip,uv pip, etc.). The above commands assumepipis available in your ComfyUI virtual environment.
Restart ComfyUI after installation.
- Add an FFMPEG Agent node to your workflow
- Connect a video path or use the input field
- Enter a natural language prompt
- Select your LLM model
- Run the workflow
💡 Tip: Right-click the FFMPEG Agent node to open the FFMPEGA Presets context menu — 200+ categorized effects you can apply with a single click, no prompt typing needed. Great for quick edits or discovering what's available.
| Prompt | What It Does |
|---|---|
"Make it cinematic with a vignette" |
Adds letterbox, color grade, and edge darkening |
"Speed up 2x, keep the audio pitch" |
Doubles speed with pitch-corrected audio |
"Make it look like old VHS footage" |
Adds noise, color shift, scan lines |
"Trim first 5 seconds, resize to 720p" |
Cuts intro and scales down |
"Underwater look with echo on audio" |
Blue tint, blur, and audio echo |
"Pixelate it like an 8-bit game" |
Mosaic/pixel art effect |
"Add 'Subscribe!' text at the bottom" |
Text overlay with positioning |
"Cyberpunk style with neon glow" |
High-contrast neon aesthetic |
"Normalize audio, compress for web" |
Loudness normalization + web optimization |
"Spin the video clockwise" |
Continuous animated rotation |
"Add camera shake" |
Random shake/earthquake effect |
"Fade in from black and out at the end" |
Smooth intro/outro transitions |
"Wipe reveal from the left" |
Directional wipe reveal animation |
"Make it pulse like a heartbeat" |
Rhythmic zoom breathing effect |
"Show audio waveform at the bottom" |
Audio visualization overlay |
"Arrange these images in a grid" |
Multi-image grid collage |
"Create a slideshow with fades" |
Image slideshow with transitions |
"Overlay the logo in the corner" |
Picture-in-picture / watermark |
"Create a side-by-side comparison" |
Video next to image in 2-column grid |
"Create a slideshow starting with the video" |
Video first, then image slides |
"Overlay images in the corners" |
Multiple images auto-placed in corners |
"Split screen, use audio from audio_b" |
Side-by-side video with specific audio track |
"Split screen, mix both audio tracks" |
Side-by-side with both audio tracks blended |
When using an LLM, the AI agent interprets your natural language and maps it to skills with specific parameters. Here's how to get the best results. (For manual editing without an LLM, see the Effects Builder and Text Input sections.)
You can request specific parameter values and the agent will use them directly:
| Prompt | What the Agent Does |
|---|---|
"Set brightness to 0.3" |
brightness:value=0.3 |
"Blur with strength 20" |
blur:radius=20 |
"Speed up to 3x" |
speed:factor=3.0 |
"Crop to 1280x720" |
crop:width=1280,height=720 |
"Deband with threshold 0.3 and range 32" |
deband:threshold=0.3,range=32 |
"CRF 18, slow preset" |
quality:crf=18,preset=slow |
"Fade in for 3 seconds" |
fade:type=in,duration=3 |
- Explicit numbers: "brightness 0.2", "speed 1.5x", "CRF 20" — the agent maps these directly
- Named presets: "VHS look", "cinematic style", "noir" — triggers multi-step preset pipelines
- Chaining operations: "Trim first 5 seconds, resize to 720p, add vignette" — executes in order
- Descriptive goals: "Make it look warmer", "Remove the green screen" — the agent picks the right skills
- Technical terms: "denoise", "deband", "normalize audio" — maps to exact FFmpeg filters
- Vague intensity words: "Make it very blurry" or "a little brighter" — the agent has to guess what number "very" or "a little" means. Tip: use a specific value instead: "blur with radius 15"
- Out-of-range values: Parameters are auto-clamped to their valid range. If you ask for "brightness 5.0" it caps at the max (1.0)
- Complex compositing: Multi-layer effects with precise timing may need to be broken into separate passes
- Format-dependent features: Some effects (like transparency) require specific output formats. H.264/MP4 doesn't support alpha channels
Some skills use AI models that auto-download on first use. You can disable automatic downloads with the allow_model_downloads toggle on the FFMPEG Agent node — runs requiring a missing model will fail with a clear message and a manual download link.
All models are mirrored to first-party AEmotionStudio HuggingFace repos for supply chain resilience. Downloads try the AEmotionStudio mirror first, then fall back to upstream sources.
| Model | Size | Stored In | Triggered By | Manual Download |
|---|---|---|---|---|
| SAM3 (Segment Anything 3) | ~300 MB | ComfyUI/models/SAM3/ |
auto_mask skill, sam3_masking no-LLM mode, Effects Builder SAM3 target |
AEmotionStudio/sam3 — download sam3.safetensors |
| Whisper large-v3 | ~3 GB | ComfyUI/models/whisper/ |
auto_transcribe, karaoke_subtitles skills, transcribe / karaoke_subtitles no-LLM modes |
AEmotionStudio/whisper-models |
| Whisper medium | ~1.5 GB | ComfyUI/models/whisper/ |
Same as above (set whisper_model to medium) |
Same as above |
| Whisper small | ~500 MB | ComfyUI/models/whisper/ |
Same as above (set whisper_model to small) |
Same as above |
| Whisper base | ~150 MB | ComfyUI/models/whisper/ |
Same as above (set whisper_model to base) |
Same as above |
| Whisper tiny | ~75 MB | ComfyUI/models/whisper/ |
Same as above (set whisper_model to tiny) |
Same as above |
| LaMa (Large Mask Inpainting) | ~200 MB | ~/.cache/torch/hub/checkpoints/ |
auto_mask:effect=remove (legacy fallback) |
AEmotionStudio/lama-inpainting — download big-lama.pt |
| FLUX Klein 4B (Editing/Removal) | ~15 GB (bf16) | ComfyUI/models/flux_klein/ |
auto_mask:effect=remove, auto_mask:effect=edit |
AEmotionStudio/flux-klein |
| MMAudio (Video-to-Audio) | ~5.5 GB | ComfyUI/models/mmaudio/ |
generate_audio skill |
AEmotionStudio/mmaudio-models |
| MuseTalk (Lip Sync) | ~1.6 GB (fp16) | ComfyUI/models/musetalk/ |
lip_sync skill |
AEmotionStudio/musetalk-models |
| U²-Net (rembg) | ~170 MB | ~/.u2net/ |
remove_background skill |
Install with pip install 'comfyui-ffmpega[masking]' — model auto-fetched by rembg |
Note
Models are only downloaded when you use the corresponding skill for the first time. Core FFmpeg editing skills (200+ of them) require zero model downloads.
FFMPEGA provides 8 nodes that work together:
Tip
One task per run. Instead of cramming multiple edits into a single prompt, focus each run on one editing task — then feed the output back into FFMPEGA for the next. This keeps context low and model focus high, leading to significantly better results. Chain FFMPEGA Agent → Save Video → Load Video Path → FFMPEGA Agent for multi-step workflows.
Warning
Low VRAM? Skills that load AI models (SAM3 masking, Whisper transcription, LaMa inpainting) each consume significant VRAM. On GPUs with limited memory, limit each run to one model-loading task — e.g. do your SAM3 removal pass first, save the result, then run Whisper subtitles as a separate pass.
FFMPEG Agent — The main node. Translates natural language into FFMPEG commands.
Required Inputs
| Input | Type | Description |
|---|---|---|
video_path |
STRING | Absolute path to source video. Used as ffmpeg input unless images_a is connected. |
prompt |
STRING | Natural language editing instruction (e.g. "Add cinematic letterbox", "Speed up 2x"). Not required in manual mode. |
llm_model |
DROPDOWN | AI model selection — local Ollama models, CLI tools, or cloud APIs. Select none for no-LLM mode. |
no_llm_mode |
DROPDOWN | Mode when llm_model is none: manual (Effects Builder, default), sam3_masking, transcribe, karaoke_subtitles. |
quality_preset |
DROPDOWN | Output quality: draft, standard, high, lossless. |
seed |
INT | Change to force re-execution with the same prompt. Supports randomize control. |
Optional Inputs
| Input | Type | Description |
|---|---|---|
images_a |
IMAGE | Video frames from upstream (e.g. Load Video). Auto-expands: images_b, images_c... |
image_a |
IMAGE | Extra image input for multi-input skills (grid, slideshow, overlay). Auto-expands: image_b, image_c... |
audio_a |
AUDIO | Audio input for muxing or multi-audio workflows. Auto-expands: audio_b, audio_c... |
video_a |
STRING | File path to extra video for concat, split screen, grid, xfade. Zero memory. Auto-expands: video_b, video_c... |
image_path_a |
STRING | File path to image for overlay, grid, slideshow. Zero memory. Auto-expands: image_path_b... |
text_a |
STRING | Text input from FFMPEGA Text node for subtitles, overlays, watermarks. Auto-expands: text_b... |
pipeline_json |
STRING | Connect from FFMPEGA Effects Builder. In manual mode the pipeline is executed directly; with an LLM it provides skill hints. |
subtitle_path |
STRING | Direct path to a .srt or .ass subtitle file. |
advanced_options |
BOOLEAN | Simple/Advanced toggle — shows preview mode, CRF, encoding preset, batch processing when enabled. |
preview_mode |
BOOLEAN | Quick low-res preview (480p, 10s) instead of full render. |
save_output |
BOOLEAN | Save video + workflow PNG to output folder. |
output_path |
STRING | Custom output file/folder path. Empty = ComfyUI default. |
ollama_url |
STRING | Ollama server URL (default: http://localhost:11434). |
api_key |
STRING | API key for cloud models (GPT, Claude, Gemini). Auto-redacted from outputs. |
custom_model |
STRING | Exact model name when llm_model is set to custom. |
crf |
INT | Override CRF (0 = lossless, 23 = default, 51 = worst). -1 uses quality_preset. |
encoding_preset |
DROPDOWN | Override x264/x265 speed preset (ultrafast → veryslow). auto follows quality_preset. |
use_vision |
BOOLEAN | Embed video frames as images for vision-capable LLMs. Off = numeric color analysis only. |
verify_output |
BOOLEAN | Agent inspects output after rendering and auto-corrects if it doesn't match intent. |
Whisper / SAM3 Inputs
| Input | Type | Description |
|---|---|---|
whisper_device |
DROPDOWN | Device for Whisper model: cpu (default, avoids VRAM pressure) or gpu (faster, ~3 GB VRAM). |
whisper_model |
DROPDOWN | Whisper model size: large-v3 (default, most accurate), medium, small, base, tiny. |
sam3_max_objects |
INT | Max objects SAM3 tracks per frame (1–20, default 5). Lower = less VRAM. |
sam3_det_threshold |
FLOAT | Minimum detection confidence for SAM3 (0.0–1.0, default 0.70). Higher = fewer objects. |
mask_output_type |
DROPDOWN | black_white (raw mask for compositing) or colored_overlay (SAM3-style preview). |
mask_points |
STRING | JSON point selection data from Load Video Path's Point Selector. Guides SAM3 with click-to-select. |
Batch Processing Inputs
| Input | Type | Description |
|---|---|---|
batch_mode |
BOOLEAN | Process all matching videos in video_folder with the same prompt. Single LLM call. |
video_folder |
STRING | Folder containing videos to batch process. |
file_pattern |
DROPDOWN | File pattern to match (*.mp4, *.mov, *.*, etc.). |
max_concurrent |
INT | Maximum simultaneous encodes in batch mode (1–16, default 4). |
| Output | Description |
|---|---|
images |
All frames from the output video as a batched image tensor |
audio |
Audio extracted from output video (or passed through from audio_a) |
video_path |
Absolute path to the rendered output video file |
command_log |
The ffmpeg command(s) that were executed |
analysis |
LLM interpretation, pipeline steps, and warnings |
FFMPEGA Effects Builder — Compose video effects visually without an LLM.
Select up to 3 skills with parameters, add raw FFmpeg filters, and use presets. Outputs a pipeline JSON that connects to the FFMPEG Agent's pipeline_json input.
Built-in presets: 🎬 Cinematic Look, 📼 VHS Retro, 🎥 High Quality Export, 🎵 Clean Audio, ✨ Glow + Saturation, 🎙️ Auto Subtitles, 🎤 Karaoke Subtitles, 🌑 Fade In + Out, 🪞 Mirror Horizontal, 🎬 Ken Burns Zoom.
| Input | Type | Description |
|---|---|---|
preset |
DROPDOWN | Quick-start preset. Auto-fills effect slots and params. Set to none to build your own. |
effect_1 |
DROPDOWN | First effect. Categorized by type (🎨 Visual, ⏱️ Temporal, 📐 Spatial, 🔊 Audio, 📦 Encoding, ✨ Outcome). |
effect_1_params |
STRING | JSON parameters for effect 1 (e.g. {"strength": 5}). Auto-filled from defaults. |
effect_2 |
DROPDOWN | Second effect (chained after effect 1). |
effect_2_params |
STRING | JSON parameters for effect 2. |
effect_3 |
DROPDOWN | Third effect (chained after effect 2). |
effect_3_params |
STRING | JSON parameters for effect 3. |
raw_ffmpeg |
STRING | Raw FFmpeg -vf filter string applied after skill effects. |
sam3_target |
STRING | SAM3 text target — apply effects only to the masked region. Leave empty for full-frame. |
sam3_effect |
DROPDOWN | Effect for SAM3-detected region: blur, pixelate, remove, edit, grayscale, highlight, greenscreen, transparent. |
| Output | Description |
|---|---|
pipeline_json |
JSON pipeline — connect to FFMPEG Agent's pipeline_json input |
Frame Extract (FFMPEGA) — Extract individual frames from a video as image tensors.
| Input | Type | Description |
|---|---|---|
video_path |
STRING | Absolute path to the video to extract frames from. |
fps |
FLOAT | Extraction rate (0.1–60.0). 1.0 = one frame per second. |
start_time |
FLOAT | (optional) Start time in seconds (default: 0). |
duration |
FLOAT | (optional) Duration to extract from, in seconds (default: 10). |
max_frames |
INT | (optional) Max frames to return (default: 100, max: 1000). |
| Output | Description |
|---|---|
frames |
Extracted video frames as a batched image tensor |
Tip: Connect
framesoutput to the FFMPEG Agent'simages_ainput to build pipelines that analyze frames before editing.
Load Image Path (FFMPEGA) — Zero-memory image loader, outputs a file path instead of a tensor.
Outputs the image file path as a STRING instead of decoding into a ~6 MB IMAGE tensor. Connect to FFMPEGA Agent's image_path_a / image_path_b / … slots so ffmpeg reads the file directly.
| Input | Type | Description |
|---|---|---|
image |
FILE PICKER | Select an image from ComfyUI's input directory or upload a new one. |
| Output | Description |
|---|---|
image_path |
Absolute file path to the selected image |
Load Video Path (FFMPEGA) — Zero-memory video input with inline preview and metadata.
Validates the video file exists and outputs the path as a STRING — loads ZERO frames into memory. Features inline video preview, metadata display (fps, duration, resolution), and VHS-style trim parameters.
| Input | Type | Description |
|---|---|---|
video |
FILE PICKER | Select or upload a video file. |
force_rate |
FLOAT | Override FPS (0 = use source). |
skip_first_frames |
INT | Frames to skip from start. |
frame_load_cap |
INT | Max frames to use (0 = all). |
select_every_nth |
INT | Select every Nth frame (1 = every frame). |
| Output | Description |
|---|---|
video_path |
Validated video file path |
frame_count |
Total usable frames after trim |
fps |
Effective FPS |
duration |
Effective duration in seconds |
Save Video (FFMPEGA) — Zero-memory video output with inline preview.
Takes a video path (from FFMPEGA Agent or Load Video Path), copies the file to ComfyUI's output directory, and shows a preview. No re-encoding — just a file copy.
| Input | Type | Description |
|---|---|---|
video_path |
STRING | Path to video file (typically from FFMPEGA Agent's output). |
filename_prefix |
STRING | Prefix for saved filename. Supports %date:yyyy-MM-dd%. |
overwrite |
BOOLEAN | (optional) Overwrite existing file vs auto-increment counter. |
Text Input (FFMPEGA) — Flexible text input for subtitles, overlays, and watermarks.
Auto-detects whether text is SRT subtitles, a short watermark, or overlay text. Outputs JSON-encoded metadata that the FFMPEGA Agent node parses for burn_subtitles, text_overlay, or watermark skills.
Right-click Presets: 10 built-in presets with example text (SRT Subtitle Example, Cinematic Subtitles, Bold Watermark, Title Card, Social Caption, Meme Text, Lower Third, Copyright Notice, Credits Roll, Chapter Marker). Save/load/delete custom presets. "Clear Text" resets all fields to defaults.
No-LLM Text Mode: Connect a Text node to the Agent's text_a input in no-LLM manual mode (without an Effects Builder). The Agent auto-generates a text overlay or subtitle pipeline from the Text node's mode, position, font size, and color settings.
| Input | Type | Description |
|---|---|---|
text |
STRING | Text content — plain text, multi-line subtitles, or full SRT format with timestamps. |
auto_mode |
BOOLEAN | (optional) Auto-detect mode from content (default: on). |
mode |
DROPDOWN | (optional) Override mode: subtitle, overlay, watermark, title_card, raw. |
position |
DROPDOWN | (optional) Text placement: center, top, bottom, bottom_right, etc. auto = mode default. |
font_size |
INT | (optional) Font size in px (0 = auto: 24 subtitle, 48 overlay, 20 watermark). |
font_color |
STRING | (optional) Text color as hex (#RRGGBB, default: white). |
start_time |
FLOAT | (optional) Start time in seconds (default: 0). |
end_time |
FLOAT | (optional) End time in seconds (-1 = full duration). |
| Output | Description |
|---|---|
text_output |
JSON-encoded text with metadata — connect to text_a, text_b, etc. on the FFMPEGA Agent node |
Video to Path (FFMPEGA) — Convert IMAGE tensor to a temp video file path.
Bridge node between Load Video nodes (which output IMAGE tensors) and FFMPEGA Agent's video_a/video_b/video_c slots. Encodes frames to a temp video file, then releases the tensor so ComfyUI can free the memory.
| Input | Type | Description |
|---|---|---|
images |
IMAGE | Video frames from a Load Video node. |
fps |
INT | (optional) FPS for the output video (default: 24). |
| Output | Description |
|---|---|
video_path |
File path to the temp video |
FFMPEGA includes a comprehensive skill system with 200+ operations organized into categories. Use them in two ways: let the AI agent select skills from your prompt, or pick them yourself with the Effects Builder — no LLM needed.
📄 See SKILLS_REFERENCE.md for the complete skill reference with all parameters and example prompts.
🧪 See SKILL_TEST_PROMPTS.md for ready-to-use copy-and-paste test prompts for every skill.
🎨 Visual Effects (30 skills)
| Skill | Description |
|---|---|
brightness |
Adjust brightness (-1.0 to 1.0) |
contrast |
Adjust contrast (0.0 to 3.0) |
saturation |
Adjust color saturation (0.0 to 3.0) |
hue |
Shift color hue (-180 to 180) |
sharpen |
Increase sharpness |
blur |
Apply blur effect |
denoise |
Reduce noise/grain (light, medium, strong) |
vignette |
Darken edges for cinematic focus |
fade |
Fade in/out to black |
colorbalance |
Adjust shadows/midtones/highlights |
noise |
Add film grain |
curves |
Apply color curve presets (vintage, cross_process, etc.) |
text_overlay |
Add text with position, color, size, font |
invert |
Invert colors (photo negative) |
edge_detect |
Edge detection / sketch look |
pixelate |
Mosaic / 8-bit pixel effect |
gamma |
Gamma correction |
exposure |
Exposure adjustment |
chromakey |
Green screen removal |
colorkey |
Key out any arbitrary color and replace with a background |
colorhold |
Keep only a selected color, desaturate everything else (spot color) |
lumakey |
Key out regions based on brightness (luma) |
despill |
Remove green/blue color spill from chroma-keyed edges |
deband |
Remove color banding artifacts |
white_balance |
Adjust color temperature (2000K–12000K) |
shadows_highlights |
Separately adjust shadows and highlights |
split_tone |
Warm highlights, cool shadows |
deflicker |
Remove fluorescent/timelapse flicker |
unsharp_mask |
Fine-grained luma/chroma sharpening |
remove_background |
Remove backgrounds using AI (rembg) |
⏱️ Temporal (9 skills)
| Skill | Description |
|---|---|
trim |
Cut a segment by time |
speed |
Change playback speed (0.1x to 10x) |
reverse |
Play backwards |
loop |
Repeat video |
fps |
Change frame rate (1 to 120) |
scene_detect |
Auto-detect scene changes |
silence_remove |
Remove silent segments |
time_remap |
Gradual speed ramp |
freeze_frame |
Freeze a frame at a timestamp |
📐 Spatial (8 skills)
| Skill | Description |
|---|---|
resize |
Scale to specific dimensions |
crop |
Crop video region |
rotate |
Rotate by degrees |
flip |
Mirror horizontal/vertical |
pad |
Add padding / letterbox |
aspect |
Change aspect ratio (16:9, 4:3, 1:1, 9:16, 21:9) |
auto_crop |
Detect and remove black borders |
scale_2x |
Quick upscale with algo choice (2x, 4x) |
🔊 Audio (27 skills)
| Skill | Description |
|---|---|
volume |
Adjust audio level |
normalize |
Normalize loudness |
fade_audio |
Audio fade in/out |
remove_audio |
Strip all audio |
extract_audio |
Extract audio only |
bass / treble |
Boost/cut frequencies |
pitch |
Shift pitch by semitones |
echo |
Add echo / reverb |
equalizer |
Adjust specific frequency band |
stereo_swap |
Swap L/R channels |
mono |
Convert to mono |
audio_speed |
Change audio speed only |
chorus |
Chorus thickening effect |
flanger |
Sweeping jet flanger |
lowpass / highpass |
Frequency filters |
audio_reverse |
Reverse audio track |
compress_audio |
Dynamic range compression |
noise_reduction |
Remove background noise |
audio_crossfade |
Smooth audio crossfade |
audio_delay |
Add delay/offset to audio |
ducking |
Audio dynamic compression |
dereverb |
Remove room echo/reverb |
split_audio |
Extract left/right channel |
audio_normalize_loudness |
EBU R128 loudness normalization |
replace_audio |
Replace original audio track |
mix_audio |
Mix/blend audio tracks from two inputs (both audible) |
audio_bitrate |
Set audio encoding bitrate |
📦 Encoding (13 skills)
| Skill | Description |
|---|---|
compress |
Reduce file size (light, medium, heavy) |
convert |
Change codec (h264, h265, vp9, av1) |
quality |
Set CRF and encoding preset |
bitrate |
Set video/audio bitrate |
web_optimize |
Fast-start for web streaming |
container |
Change format (mp4, mkv, avi, mov, webm) |
pixel_format |
Set pixel format (yuv420p, yuv444p, etc.) |
hwaccel |
Hardware acceleration (cuda, vaapi, qsv) |
audio_codec |
Set audio codec (aac, mp3, opus, flac) |
frame_rate_interpolation |
Motion-interpolated FPS conversion |
frame_interpolation |
Smooth slow motion via motion interpolation |
two_pass |
Two-pass encoding for better quality |
hls_package |
HLS adaptive streaming packaging |
🎬 Cinematic Presets (14 skills)
| Skill | Description |
|---|---|
cinematic |
Hollywood film look — teal-orange grading |
blockbuster |
Michael Bay style — high contrast, dramatic |
documentary |
Clean, natural documentary look |
indie_film |
Indie art-house — faded, low contrast |
commercial |
Bright, clean corporate video |
dream_sequence |
Dreamy, soft, ethereal atmosphere |
action |
Fast-paced action movie grading |
romantic |
Soft, warm romantic mood |
sci_fi |
Cool blue sci-fi atmosphere |
dark_moody |
Dark, atmospheric, moody feel |
color_grade |
Cinematic color grading (teal_orange, warm, cool) |
color_temperature |
Adjust color temperature (warm/cool) |
letterbox |
Cinematic widescreen letterbox bars |
film_grain |
Film grain texture (light, medium, heavy) |
📼 Vintage & Retro (9 skills)
| Skill | Description |
|---|---|
vintage |
Classic old film look (50s–90s) |
vhs |
VHS tape aesthetic |
sepia |
Classic sepia/brown tone |
super8 |
Super 8mm film look |
polaroid |
Polaroid instant photo |
faded |
Washed-out, faded look |
old_tv |
CRT television aesthetic |
damaged_film |
Aged/weathered film |
noir |
Film noir — B&W, high contrast |
📱 Social Media (9 skills)
| Skill | Description |
|---|---|
social_vertical |
TikTok / Reels / Shorts (9:16) |
social_square |
Instagram feed (1:1) |
youtube |
YouTube optimized |
twitter |
Twitter/X optimized |
gif |
Convert to animated GIF |
thumbnail |
Extract thumbnail frame |
caption_space |
Add space for captions |
watermark |
Overlay logo/watermark |
intro_outro |
Add intro/outro segments |
✨ Creative Effects (14 skills)
| Skill | Description |
|---|---|
neon |
Neon glow — vibrant edges and colors |
horror |
Dark, desaturated, grainy horror atmosphere |
underwater |
Blue tint, blur, darker underwater look |
sunset |
Golden hour warm glow |
cyberpunk |
Neon tones, high contrast cyberpunk |
comic_book |
Bold colors, comic/pop art style |
miniature |
Tilt-shift toy model effect |
surveillance |
Security camera / CCTV look |
music_video |
Punchy colors, contrast, vignette |
anime |
Anime / cel-shaded cartoon |
lofi |
Lo-fi chill aesthetic |
thermal |
Thermal / heat vision camera |
posterize |
Reduce color palette / screen-print |
emboss |
Emboss / relief surface effect |
🧪 Special Effects (36 skills)
| Skill | Description |
|---|---|
meme |
Deep-fried meme aesthetic |
glitch |
Digital glitch / databend |
mirror |
Mirror / kaleidoscope effect |
slow_zoom |
Slow push-in zoom |
black_and_white |
B&W with style options |
day_for_night |
Simulate nighttime from daytime |
dreamy |
Soft, ethereal dream look |
hdr_look |
Simulated HDR dynamic range |
datamosh |
Glitch art / motion vector visualization |
radial_blur |
Radial / zoom blur effect |
grain_overlay |
Cinematic film grain with intensity control |
burn_subtitles |
Hardcode subtitles |
selective_color |
Isolate specific colors |
perspective |
Perspective transform |
lut_apply |
Apply LUT color grading |
lens_correction |
Fix lens distortion |
fill_borders |
Fill black borders |
deshake |
Quick stabilization |
deinterlace |
Remove interlacing |
halftone |
Newspaper dot pattern |
false_color |
Pseudocolor heat map |
frame_blend |
Temporal frame blending |
tilt_shift |
Tilt-shift miniature effect |
color_channel_swap |
Color channel remapping |
ghost_trail |
Temporal motion trails |
glow |
Bloom / soft glow effect |
sketch |
Pencil drawing / ink line art |
chromatic_aberration |
RGB channel offset / color fringing |
boomerang |
Looping boomerang effect |
ken_burns |
Slow zoom pan for photos |
slowmo |
Smooth slow motion |
stabilize |
Remove camera shake |
timelapse |
Dramatic speed-up for timelapse |
zoom |
Zoom in/out effect |
scroll |
Scroll video vertically/horizontally |
monochrome |
Monochrome with optional tint |
🎬 Transitions (3 skills)
| Skill | Description |
|---|---|
fade_to_black |
Fade in from + fade out to black |
fade_to_white |
Fade in from + fade out to white |
flash |
Camera flash at a specific timestamp |
🌀 Motion (5 skills)
| Skill | Description |
|---|---|
spin |
Continuous animated rotation |
shake |
Camera shake / earthquake (light, medium, heavy) |
pulse |
Rhythmic breathing zoom effect |
bounce |
Vertical bouncing animation |
drift |
Slow cinematic pan (left, right, up, down) |
🔮 Reveal Effects (3 skills)
| Skill | Description |
|---|---|
iris_reveal |
Circle expanding from center |
wipe |
Directional wipe from black |
slide_in |
Slide video in from edge |
🎵 Audio Visualization (1 skill)
| Skill | Description |
|---|---|
waveform |
Audio waveform overlay (line, point, cline modes) |
🔗 Multi-Input & Composition (10 skills)
| Skill | Description |
|---|---|
grid |
Arrange video + images in a grid layout (xstack). Auto-includes video as first cell. |
slideshow |
Create slideshow from images with fade transitions. Optionally starts with the main video. |
overlay_image |
Picture-in-picture / watermark overlay. Supports multiple overlays auto-placed in corners. Accepts animation=bounce for motion. |
concat |
Concatenate video segments sequentially. Connect multiple videos/images to join them. |
xfade |
Smooth transitions between segments — 18 types: fade, dissolve, wipe, pixelize, radial, etc. |
split_screen |
Side-by-side (horizontal) or top-bottom (vertical) multi-video layout. |
animated_overlay |
Moving image overlay with motion presets: scroll, float, bounce, slide. |
✏️ Text & Graphics (9 skills)
| Skill | Description |
|---|---|
animated_text |
Animated text overlay |
scrolling_text |
Scrolling credits-style text |
ticker |
News-style scrolling ticker bar |
lower_third |
Professional broadcast lower third |
countdown |
Countdown timer overlay |
typewriter_text |
Typewriter reveal effect |
bounce_text |
Bouncing animated text |
fade_text |
Text that fades in and out |
karaoke_text |
Karaoke-style fill text |
✂️ Editing & Delivery (13 skills)
| Skill | Description |
|---|---|
picture_in_picture |
PiP overlay window with optional border |
blend |
Blend two video inputs |
delogo |
Remove logo from a region |
remove_dup_frames |
Strip duplicate/stuttered frames |
mask_blur |
Blur a rectangular region for privacy |
extract_frames |
Export frames as image sequence |
jump_cut |
Auto-cut to high-energy moments |
beat_sync |
Sync cuts to a beat interval |
color_match |
Auto histogram equalization |
extract_subtitles |
Extract subtitle track |
preview_strip |
Filmstrip preview of key frames |
sprite_sheet |
Contact sheet of frames |
🤖 AI-Powered (4 skills)
| Skill | Description |
|---|---|
auto_transcribe |
Transcribe audio with Whisper AI and burn SRT subtitles |
karaoke_subtitles |
Word-by-word karaoke subtitles with progressive color fill (Whisper) |
auto_mask |
SAM3-powered object segmentation from text prompts |
generate_audio |
AI-generate synchronized audio/foley from video + text (MMAudio) |
⚠️ License Notice: Thegenerate_audioskill uses MMAudio model weights which are licensed under CC-BY-NC 4.0 (non-commercial use only). Model weights are downloaded on first use — by downloading them you accept the CC-BY-NC 4.0 license. The FFMPEGA code itself remains GPL-3.0.
Beyond the 200+ editing skills, the agent has built-in tools for analyzing media and making better decisions. In LLM mode, the agent calls these autonomously based on your prompt. Some (like analyze_video and search_skills) are also invoked directly by internal skills, the Effects Builder, and no-LLM modes.
🔍 Analysis & Discovery
| Tool | What It Does |
|---|---|
analyze_video |
Probes resolution, duration, codec, FPS, bitrate — the agent calls this to understand your source |
extract_frames |
Extracts PNG frames for vision models to "see" the video content |
analyze_colors |
Numeric color metrics (luminance, saturation, color balance) via ffprobe signalstats — guides color grading decisions without vision |
analyze_audio |
Numeric audio metrics (volume dB, EBU R128 loudness LUFS, silence detection) — guides audio effect decisions |
search_skills |
Searches skills by keyword — the agent always calls this to find the right skills |
list_luts |
Lists available LUT files for color grading — called before lut_apply to discover available looks |
🎨 LUT Color Grading System
8 bundled LUT files for cinematic color grading. The agent discovers these via list_luts and applies them with the lut_apply skill.
| LUT | Style |
|---|---|
cinematic_teal_orange |
Hollywood teal-orange grade |
warm_vintage |
Warm retro film look |
cool_scifi |
Cool blue sci-fi tone |
film_noir |
Classic noir — desaturated, crushed |
golden_hour |
Warm golden sunlight |
cross_process |
Cross-processed film chemistry |
bleach_bypass |
Bleach bypass — low saturation, high contrast |
neutral_clean |
Subtle clarity enhancement |
Adding your own LUTs: Drop .cube or .3dl files into the luts/ folder. The agent will discover them via list_luts automatically. Short names auto-resolve to full paths (e.g., cinematic_teal_orange → luts/cinematic_teal_orange.cube).
✅ Output Verification Loop
When verify_output is enabled (default: On), the agent inspects its own output after execution:
- Extracts frames from the output video
- Runs color and/or audio analysis on the result
- Sends analysis to the LLM with the original prompt for quality assessment
- If the LLM detects issues, it auto-corrects the pipeline and re-executes once
This closes the feedback loop — the agent can catch and fix mistakes like wrong color grades, failed effects, or audio issues without re-queuing.
🧩 Custom Skills
Create your own skills via YAML — no Python required. Drop a .yaml file in custom_skills/, restart ComfyUI, and the agent can use it immediately.
# custom_skills/dreamy_blur.yaml
name: dreamy_blur
description: "Soft dreamy blur with glow"
category: visual
tags: [dream, blur, soft, glow]
parameters:
radius:
type: int
default: 5
min: 1
max: 30
ffmpeg_template: "gblur=sigma={radius},eq=brightness=0.06"Skill packs — installable collections of related skills, optionally with Python handlers for complex logic:
# Linux / macOS / Windows (Git Bash or PowerShell)
cd custom_skills/
git clone https://github.com/someone/ffmpega-retro-pack retro-packTwo example skills ship in custom_skills/examples/ — warm_glow.yaml (template) and film_burn.yaml (pipeline composite).
📄 See CUSTOM_SKILLS.md for the full schema reference, skill pack structure, Python handlers, and advanced examples.
Tested with: FFMPEGA has been primarily tested using Gemini CLI and Qwen3 8B (via Ollama). Results may vary with other models.
Author's pick: The CLI connectors (especially Gemini CLI) have been the most reliable option in my experience — they handle tool-calling, structured output, and long context exceptionally well. Highly recommended if you have access.
FFMPEGA works best with models that have strong JSON output and instruction-following abilities. The agent sends a structured prompt and expects a valid JSON pipeline back — models with tool-calling or function-calling capabilities tend to perform best.
Things to keep in mind:
- Some models work better than others — larger models and those trained for structured output (JSON/tool-calling) produce more reliable results
- Some models may need more retries — if the agent fails to parse the response, try running the same prompt again. Smaller models occasionally return malformed JSON on the first try
- Find what works best for you — experiment with different models to find the right balance of speed, quality, and reliability for your hardware
The default option. Runs locally, no API key needed.
Install Ollama: Download from ollama.com/download (available for Linux, macOS, and Windows).
# Start Ollama (all platforms)
ollama serve
# Pull a model
ollama pull qwen3:8bWindows users: After installing Ollama, the
ollamacommand is available in both PowerShell and Command Prompt. Ollama also runs as a system tray app.
Recommended local models (≤30B, consumer GPU friendly):
| Model | Size | Speed | Notes |
|---|---|---|---|
qwen3:8b |
8B | ⚡ Fast | Tested — excellent structured output, native tool-calling |
qwen3-vl |
8B | ⚡ Fast | Tested — multimodal vision-language model, sees video frames |
qwen3:14b |
14B | ⚡ Fast | Sweet spot of speed and quality, tools + thinking tags |
qwen3:30b |
30B | 🔄 Medium | Best Qwen under 30B, needs 16GB+ VRAM |
qwen2.5:14b |
14B | ⚡ Fast | Top IFEval scores, strong instruction following |
deepseek-r1:14b |
14B | ⚡ Fast | Reasoning model — verifies its own tool calls, very reliable |
mistral-nemo |
12B | ⚡ Fast | NVIDIA + Mistral collab, 128k context, great reasoning |
mistral-small3.2 |
24B | 🔄 Medium | Native function calling, 128k context |
phi4 |
14B | ⚡ Fast | Microsoft reasoning SLM, rivals larger models in logic |
gemma3:12b |
12B | ⚡ Fast | High reasoning scores, 128k context |
llama3.3:8b |
8B | ⚡ Fast | Reliable tool-calling, large ecosystem |
Tip: On the Ollama library, look for models with a tools tag — this indicates native tool/function-calling support, which produces the best results with FFMPEGA.
llm_model: gpt-5.2
api_key: your-openai-key
llm_model: gemini-3-flash
api_key: your-google-ai-key
Use the Gemini CLI to run Gemini models without an API key. Works with any Google account.
Install:
| Platform | Command |
|---|---|
| Linux / macOS | npm install -g @google/gemini-cli |
| Windows (PowerShell) | npm install -g @google/gemini-cli |
Tip: You can also use
pnpm add -goryarn global addif you prefer.
Authenticate (first time only):
geminiThis opens a browser to sign in with your Google account.
Use in FFMPEGA:
llm_model: gemini-cli
No API key is needed — authentication is handled by the CLI. Select gemini-cli from the model dropdown in the node.
Note: The Gemini CLI runs as a subprocess and is sandboxed to the custom node directory for security. On Windows,
gemini.cmdis also detected automatically.
| Plan | Rate Limit | Daily Limit | Models | Cost |
|---|---|---|---|---|
| Free (Google login) | 60 req/min | 1,000 req/day | Gemini model family (Pro + Flash) | Free |
| Free (API key only) | 10 req/min | 250 req/day | Flash only | Free |
| Code Assist Standard | 120 req/min | 1,500 req/day | Gemini model family | Paid |
| Code Assist Enterprise | 120 req/min | 2,000 req/day | Gemini model family | Paid |
| Google AI Pro | Higher | Higher | Full Gemini family | $19.99/mo |
Tip: Sign in with a Google account (free) for the best experience — 1,000 requests/day with access to Pro and Flash models. An unpaid API key limits you to 250/day on Flash only.
The Gemini CLI auto-selects the best model, but the following are available:
| Model | Best For |
|---|---|
| Gemini 2.5 Pro | Complex reasoning, creative tasks |
| Gemini 2.5 Flash | Fast responses, high throughput |
| Gemini 2.5 Flash-Lite | Maximum speed, lowest cost |
| Gemini 3 Pro | Most capable, advanced reasoning |
| Gemini 3 Flash | Fast + capable, good balance |
Free tier may auto-switch to Flash models when Pro quota is exhausted.
llm_model: claude-sonnet-4-6
api_key: your-anthropic-key
Use the Claude Code CLI as a local LLM backend. Uses its own authentication — no API key needed in FFMPEGA.
Install:
| Platform | Command |
|---|---|
| Linux / macOS | npm install -g @anthropic-ai/claude-code |
| Windows (PowerShell) | npm install -g @anthropic-ai/claude-code |
Tip: You can also use
pnpm add -goryarn global addif you prefer.
Authenticate (first time only):
claudeThis opens a browser to sign in with your Anthropic account.
Use in FFMPEGA:
llm_model: claude-cli
Auto-detected on PATH. Select claude-cli from the model dropdown.
Use Cursor's CLI in agent mode as an LLM backend.
Install (all platforms):
Open Cursor IDE → Command Palette (Ctrl+Shift+P / Cmd+Shift+P) → "Install 'cursor' command"
Start the agent:
agentUse in FFMPEGA:
llm_model: cursor-agent
Auto-detected on PATH. Select cursor-agent from the model dropdown.
Use Qwen Code as a free LLM backend. Powered by Qwen3-Coder with 2,000 free requests/day via OAuth — no credit card required.
Install:
| Platform | Command |
|---|---|
| Linux / macOS | npm install -g @qwen-code/qwen-code@latest |
| Windows (PowerShell) | npm install -g @qwen-code/qwen-code@latest |
Tip: You can also use
pnpm add -goryarn global addif you prefer.
Authenticate (first time only):
qwenSelect "Qwen OAuth (Free)" and follow the browser prompts to sign in.
Use in FFMPEGA:
llm_model: qwen-cli
Auto-detected on PATH. Select qwen-cli from the model dropdown.
When Frame Extraction is used, FFMPEGA saves extracted frames to a _vision_frames/ directory and passes the frame paths to the CLI agent. Agents with vision support can see and analyze the actual frame images to make better editing decisions.
| CLI Agent | Vision Support | Notes |
|---|---|---|
| Gemini CLI | ✅ Yes | read_file converts images to base64 for multimodal analysis |
| Claude Code CLI | ✅ Yes | Native image reading and description |
| Cursor Agent CLI | ✅ Yes | Native image reading and description |
| Qwen Code CLI | ❌ Not yet | Known issue — read_file returns raw binary instead of interpreting images. Vision is listed as a planned feature. |
Note: Agents without vision support still receive the frame file paths and can use video metadata (duration, resolution, FPS) from
analyze_videoto make editing decisions. When Qwen fixes their vision support upstream, it will work automatically since the frame paths are already passed correctly.
⚠️ Important: The_vision_frames/directory must not be listed in.gitignoreor.git/info/exclude— CLI agents respect these ignore patterns and will be unable to read the frames. FFMPEGA'scleanup_vision_frames()automatically deletes the directory after each pipeline run.
Select custom from the model dropdown and type any model name in the custom_model field. The provider is auto-detected from the name:
| Prefix | Provider |
|---|---|
gpt-* |
OpenAI |
claude-* |
Anthropic |
gemini-* |
|
| Anything else | Ollama (local) |
This lets you use any new model immediately without waiting for a code update.
Your API keys are automatically scrubbed and never stored in output files:
- Error messages — keys are redacted before being shown in the UI (e.g.
****abcd) - Workflow metadata — ComfyUI embeds workflow data in output images/videos; FFMPEGA strips the
api_keyfield from this metadata before saving - HTTP errors — keys are removed from network error messages that might include auth headers
- Debug logs —
LLMConfigredacts keys in all string representations
No configuration needed — this protection is always active when an API key is provided.
⚠️ Safety precaution: As with any software, always inspect your output files before sharing them publicly — in the unlikely event of a bug or edge case that bypasses the automatic scrubbing.
Monitor your LLM token consumption with opt-in usage tracking. Enable via two toggles on the node:
| Toggle | Default | What It Does |
|---|---|---|
track_tokens |
Off | Prints a formatted usage summary to the console after each run |
log_usage |
Off | Appends a JSON entry to usage_log.jsonl for cumulative tracking |
Token data sources by connector:
| Connector | Source | Estimated? |
|---|---|---|
| Ollama | Native API (prompt_eval_count / eval_count) |
No |
| OpenAI / Gemini API | Native API (usage field) |
No |
| Anthropic API | Native API (usage.input_tokens) |
No |
| Gemini CLI | JSON output via -o json |
No |
| Claude CLI | JSON output via --output-format json |
No |
| Other CLIs | Character-based estimation (~4 chars/token) | Yes |
When enabled, the analysis output includes a usage breakdown:
Token Usage:
Prompt tokens: 4,200
Completion tokens: 1,800
Total tokens: 6,000
LLM calls: 5
Tool calls: 3
Elapsed: 12.4s
The usage_log.jsonl file stores one JSON object per run for historical analysis. It is gitignored by default.
FFMPEG Not Found
Ensure FFMPEG is installed and in your system PATH:
ffmpeg -versionInstall FFMPEG:
| Platform | Command / Method |
|---|---|
| Ubuntu / Debian | sudo apt install ffmpeg |
| Arch / CachyOS | sudo pacman -S ffmpeg |
| Fedora | sudo dnf install ffmpeg |
| macOS | brew install ffmpeg |
| Windows (winget) | winget install Gyan.FFmpeg |
| Windows (choco) | choco install ffmpeg |
| Windows (scoop) | scoop install ffmpeg |
| Windows (manual) | Download from ffmpeg.org/download, extract, and add the bin/ folder to your system PATH |
Windows PATH tip: After installing, open a new terminal and run
ffmpeg -versionto verify. If not found, you may need to add ffmpeg'sbin/directory to your system PATH manually: Settings → System → About → Advanced system settings → Environment Variables → EditPath.
Ollama Connection Failed
Make sure Ollama is running:
ollama serveIf using a custom URL, set it in the node's ollama_url field.
Model Not Found
Pull the required model first:
ollama pull qwen2.5:8bLLM Returns Empty Response
This usually means:
- The model is still loading (first request after start)
- The prompt is too long for the model's context window
- Try running the same prompt again
- Try a different model
Parameter Validation Errors
FFMPEGA auto-coerces types (float→int) and clamps out-of-range values. If you still see errors, try simplifying your prompt or using a more capable model.
Cancelling a Running Request
If the LLM is taking too long or you want to abort mid-request, close the ComfyUI terminal or restart ComfyUI instead of using the interrupt button. The interrupt button waits for the current LLM response to complete, which can take a while — closing/restarting ComfyUI kills it immediately.
Contributions are welcome! Whether it's bug reports, new skills, or improvements — your help is appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the GPL-3.0 License — see the LICENSE file for details.
Developed by Æmotion Studio





















