Skip to content

AEmotionStudio/ComfyUI-FFMPEGA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ComfyUI-FFMPEGA

The ultimate video editing suite for ComfyUI — edit with natural language or hands-on manual controls.

ComfyUI Version License Dependencies Downloads Visitors Clones Last Commit Activity

FFMPEGA Showcase

Use AI to describe edits in plain English, or take full manual control with the Effects Builder and text presets — no LLM required.

FeaturesExamplesInstallationQuick StartPrompt GuideSkillsLLM SetupTroubleshootingContributingChangelog


🚀 What's New in v2.9.1 (March 3, 2026)

AI Audio Generation, AI Lip Sync, FLUX Klein Editing, Modular Architecture, CI/CD & Type Safety

  • ✨ AI Object Removal & Editing (FLUX Klein 4B): New auto_mask:effect=remove and auto_mask:effect=edit powered by FLUX Klein 4B (Apache 2.0). AI-powered per-frame object removal replaces LaMa with higher-quality results. New edit effect enables text-guided video changes (e.g. "change hair to red", "replace background with beach"). Reference-image conditioning, 4-step inference, temporal smoothing, ~8–13 GB VRAM.

  • 🔊 AI Audio Generation (generate_audio): New skill powered by MMAudio — synthesizes synchronized audio/foley from video and/or text descriptions. Supports video-to-audio, text-to-audio, and automatic long video chunking with crossfade. 11 natural language aliases (foley, sound_effects, v2a, etc.).

  • 👄 AI Lip Sync (lip_sync): New skill powered by MuseTalk V15 — synchronizes lip movements to match provided audio. Video+audio and image+audio inputs, multi-face support, batch inference. Zero new pip dependencies. 7 aliases (lipsync, dub, dubbing, sync_lips, talking_head, lip_dub, voice_sync). Subprocess isolation for CUDA memory safety.

  • 🏗️ Modular Node Architecture: Broke up monolithic agent_node.py into 6 focused modules — input_resolver, execution_engine, output_handler, batch_processor, nollm_modes, and pipeline_assembler. Agent node is now a thin orchestrator.

  • 🔒 Pyright Type Checking: Added pyrightconfig.json with type stubs for ComfyUI modules. Fixed 78 type errors across core/, skills/, mcp/.

  • 🧱 Platform Abstraction (core/platform.py): All ComfyUI boundary interactions extracted into a single adapter module. Core logic no longer imports ComfyUI directly.

  • 📋 Structured Logging (core/logging.py): JSONFormatter for machine-readable logs, get_logger() convenience function, and LogTimer context manager. Enable via FFMPEGA_LOG_JSON=1.

  • 🧪 CI Workflow & 799 Tests: New .github/workflows/ci.yml with Pytest + Pyright on every PR/push. 12 new integration tests running real FFmpeg pipelines. Test suite expanded from 656 → 799 tests, 0 failures.

  • 📖 CONTRIBUTING.md & Pre-commit: New contributor docs with architecture overview and PR guidelines. Pre-commit hooks with Ruff lint/format and Pyright.

  • 🧠 Subprocess Isolation: Audio generation runs in a subprocess to prevent CUDA memory leaks — same proven pattern as SAM3. Falls back to in-process with VRAM offloading.

  • ⚡ Native Safetensors & Memory-Efficient Loading: Uses comfy.utils.load_torch_file for direct .safetensors loading, plus accelerate's zero-copy model init — no 3× memory spike during model loading.

  • 🪞 Model Mirror Repositories: MMAudio on AEmotionStudio/mmaudio-models (fp16, ~5.5 GB). MuseTalk UNet on AEmotionStudio/musetalk-models (fp16 1.6 GB / fp32 3.2 GB). Mirror-first download with upstream HuggingFace fallback.

  • 🔗 Mask Points Chaining: SaveVideoNode and LoadVideoPathNode now pass mask_points data through the node chain — upstream segmentation points propagate to downstream processing.

  • ♿ Point Selector Accessibility: ARIA roles, aria-modal, aria-live regions, and hidden decorative emojis for screen reader support.

  • 🔧 Bug Fixes: LaMa cache false-negative, log propagation leak, token usage display, platform import robustness.

Previous: v2.8.0 — Effects Builder, Manual Mode, SAM3 Hardening & 15+ Bug Fixes
  • 🏗️ Effects Builder Node: New companion node for manual effect composition — select up to 3 skills with params, combine with raw FFmpeg filters, and use presets. No LLM required.
  • 🎬 Effects Builder Presets: Right-click the Effects Builder for quick access to all 18 built-in presets, plus save/load/delete custom presets and a "Clear All Effects" reset.
  • 📝 Text Node Presets: Right-click the FFMPEGA Text node for 10 built-in presets (SRT Subtitle Example, Cinematic Subtitles, Watermark, Title Card, Social Caption, Meme Text, Lower Third, Copyright Notice, Credits Roll, Chapter Marker) with example text. Custom save/load/delete and "Clear Text" reset.
  • 💬 No-LLM Text Support: Connect a Text node to the Agent in no-LLM manual mode (without an Effects Builder) and it auto-generates a text overlay or subtitle pipeline.
  • 🏛️ Manual Mode (Default): New manual no-LLM mode — set llm_model to none and use the Effects Builder to edit videos without any AI. This is now the default no_llm_mode.
  • 🎙️ Whisper No-LLM Modes: transcribe and karaoke_subtitles in the no_llm_mode dropdown — run Whisper directly without an LLM. Also available as Effects Builder presets.
  • 🧠 SAM3 Subprocess Isolation: SAM3 now runs in a separate subprocess to prevent CUDA memory leaks. Multiple OOM fixes, point prompt support, and real-time progress streaming.
  • 🔒 MCP Security: Fixed path traversal vulnerabilities in MCP tools. SkillComposer parameter hardening.
  • 🔗 Effects Builder Multi-Input: Fixed concat, grid, xfade, and all multi-input skills in the Effects Builder. Extra inputs (video_b, etc.) now correctly injected.
  • 📐 Concat Resolution Fix: Concat/xfade/slideshow now default to input resolution instead of hardcoded 1920×1080.
  • 🔧 Dynamic Slot Root Cause Fix: video_b no longer disappears on page refresh.
  • ⚡ Performance: Ultrafast temp video encoding, optimized pipe buffers, SAM3 VRAM offloading.
  • 🐛 Template Placeholder Fix: Unsubstituted template placeholders no longer crash ffmpeg.
Previous: v2.7.1 — Advanced Options Toggle, Dynamic Input Fix & Default Refinements
  • ⚙️ Advanced Options Toggle: New Simple/Advanced toggle hides power-user settings (preview_mode, crf, encoding_preset, video_path, subtitle_path, batch_mode) behind a single switch. The node is now much more compact by default.
  • 🔧 Dynamic Input Persistence: Fixed dynamic input slots (e.g. video_b auto-appearing when video_a is connected) not restoring when loading saved workflows or images.
  • 🎛️ Default Refinements: Whisper defaults to CPU (avoids VRAM pressure), token tracking enabled by default, PTC mode and SAM3 CPU inputs removed from UI.
  • ⚠️ SAM3 Checkpoint Warnings: Detects wrong checkpoint format and logs reconversion instructions.
  • 🧹 Code Cleanup: Removed ~1100 lines of dead code, added ValidationError for path validation, token log rotation for usage_log.jsonl.
Previous: v2.7.0 — SAM3 Auto-Mask, LaMa Inpainting & PTC
  • 🎭 SAM3 Auto-Mask: New remove skill powered by Segment Anything Model 3 — describe what to remove in natural language (e.g. "remove the person") and SAM3 generates per-frame masks automatically.
  • 🖌️ LaMa Inpainting: AI-powered video object removal using LaMa (Large Mask Inpainting). Temporal Gaussian smoothing reduces frame-to-frame flicker. Falls back to black fill if LaMa is not installed.
  • 🟢 Greenscreen: New greenscreen skill — uses SAM3 masks to replace backgrounds with solid colors or transparency (WebM output).
  • ⚡ Programmatic Tool Calling (PTC): New execute_code tool lets LLMs write a single Python script that orchestrates multiple tool calls in one pass, reducing round-trips from ~6 to 1. Three modes: off, auto, on.
  • 🔒 PTC Sandbox: Hardened with 25+ escape vector blocks — dunder introspection, module access, traceback traversal, dynamic attribute access, and chr() construction bypasses.
  • 🧪 656 Tests: Expanded test suite from 516 → 656 with 0 failures. 34 new PTC executor tests, security hardening tests, and corrected mock contracts.
Previous: v2.6.5 — Whisper Auto-Transcription, Karaoke Subtitles & Letterbox Fixes
  • 🎙️ Auto-Transcribe: New auto_transcribe skill — transcribes video audio with OpenAI Whisper and burns SRT subtitles directly into the output. Supports multi-video concat with correct cross-clip timing.
  • 🎤 Karaoke Subtitles: New karaoke_subtitles skill — word-by-word progressive-fill karaoke effect using Whisper's word-level timestamps and ASS \kf tags.
  • ⚙️ Whisper Controls: Choose your Whisper model size (tinylarge-v3) and device (gpu/cpu) via node settings — trade off speed vs. accuracy, or offload to CPU on low-VRAM systems.
  • 📐 Letterbox Fix: Replaced crop+pad with drawbox for letterboxing, preserving video content. Handles both letterbox and pillarbox cases correctly.
  • 🧹 Code Dedup: Extracted shared ffmpeg_escape_path, color_to_ass_bgr, and _run_transcription helpers — 3 deduplication refactors reducing maintenance surface.
  • 🔧 9 Bug Fixes: Whisper memory leak, xfade subtitle timing, ASS escaping, hex color validation, aspect ratio div-by-zero, and more.
Previous: v2.6.0 — HandlerResult, Compose Decomposition & 516 Tests
  • 🏗️ HandlerResult Contract: All 9 handler modules now return a formal HandlerResult dataclass, replacing ad-hoc tuples. Backward-compatible with existing code.
  • 🔧 Compose Decomposition: Extracted 5 orchestration methods from the 600+ line compose() into pure, testable static methods.
  • 🎵 PiP Audio Mixing: picture_in_picture now supports audio_mix to blend both audio tracks via ffmpeg's amix.
  • 🔄 CLI Retry: CLI connectors retry on transient failures with exponential backoff (3 attempts).
  • 📝 TextInput Node: New node for subtitle and text overlay workflows with auto SRT detection.
  • 🔒 Security: Sanitized text overlay enable parameter, fixed path traversal on output dirs, fixed weak UUID entropy.
  • 🧪 516 Tests: Expanded test suite from 481 → 516 with 0 failures. New handler unit tests, skill combination tests, and orchestration helper tests.
Previous: v2.5.0 — PiP Overlay Fixes, VL Model Vision & Border Support
  • 🖼️ PiP Border Support: picture_in_picture now accepts border and border_color parameters.
  • 👁️ Ollama VL Auto-Embedding: Vision-language models automatically receive 3 video frames in the initial message.
  • 🔗 PiP Alias Fix: Models using pip, picture-in-picture, etc. now correctly resolve to picture_in_picture.
  • 🔧 Ollama VL Verification: Fixed 400 error when verifying output with Ollama VL models.
Previous: v2.4.0 — Pipeline Chaining, Animated Overlays & Zero-Memory Image Paths
  • 🖼️ Zero-Memory Image Paths: Image inputs passed as file paths instead of decoded tensors.
  • 🎯 Overlay Animation: overlay_image supports animation=bounce and more motion presets.
  • 🔗 Pipeline Chaining Fixes: Fixed filter graph chaining for multi-skill pipelines.
  • 🏗️ Handler Module Extraction: Skill handlers split into skills/handlers/ modules.
  • 🔒 Security Hardening: Extended FFMPEG parameter sanitization.
  • ⚡ Performance: frames_to_tensor pre-allocates memory.
Previous: v2.3.0 — Token Tracking, LUT Color Grading & Vision
  • 📊 Token Usage Tracking: Opt-in track_tokens and log_usage toggles — monitor prompt/completion tokens, LLM calls, tool calls, and elapsed time.
  • 🎨 LUT Color Grading: 8 bundled cinematic .cube LUT files. Drop custom LUTs into luts/ for automatic discovery.
  • 🖼️ Vision System: Multimodal frame analysis — the agent extracts frames and "sees" the video.
  • 🔊 Audio Analysis: Volume (dB), EBU R128 loudness (LUFS), and silence detection.
  • 🤖 Real Token Stats: Gemini CLI and Claude CLI return native token counts via JSON output.

📄 See CHANGELOG.md for the complete version history.


NotebookLM Overview: Exploring the features and capabilities of ComfyUI-FFMPEGA. (Click to watch on YouTube)

✨ Features

🗣️ Natural Language Editing

Describe edits in plain text: "Make it cinematic with a fade in", "Speed up 2x", "VHS look with grain". The AI agent interprets your prompt and builds the FFMPEG pipeline automatically.

🏗️ Manual Mode — No AI Required

Use the Effects Builder to visually compose up to 3 effects with parameters. Add text overlays and subtitles via preset-powered Text nodes. Full editing control with zero LLM dependency.

🤖 Multi-LLM Support

Works with Ollama (local, free), OpenAI, Anthropic, Google Gemini, and CLI tools (Gemini CLI, Claude Code, Cursor Agent, Qwen Code). Use any local model — Llama 3.1, Qwen3, Mistral, and more. Or skip the LLM entirely.

🎨 200+ Skills

200+ video editing skills across visual effects, audio processing, spatial transforms, temporal edits, encoding, cinematic presets, vintage looks, social media, creative effects, text animations, editing & composition, audio visualization, multi-input operations, transitions, concat, split screen, and AI-powered skills (Whisper transcription, SAM3 masking, MMAudio generation, MuseTalk lip sync).

🎨 Right-Click Presets

18 built-in Effects Builder presets and 10 Text node presets with example content. Save/load/delete your own custom presets. One-click clear to reset.

⚡ Batch & Preview

Process multiple videos with the same instruction. Generate quick low-res previews before committing to full renders. Quality presets from draft to lossless.


🎬 Examples

See what FFMPEGA can do — each example shows the prompt or preset used, the input clip, and the result.


4×4 Video Grid

Prompt: Concatenate these clips in a 4x4 grid

Before After

Before - 4x4 Grid

After - 4x4 Grid


Crossfade Transitions + Bouncing Logo

Prompt: Concatenate these clips with a crossfade transition between each, add a fade in at the start and fade out at the end and a bouncing image_path_a at 10% size and 30% opacity

Before After

Before - Crossfade

After - Crossfade


Color Grade + Text Overlay + Compression

Prompt: Color grade with the cinematic teal orange LUT, normalize audio, add a text "water" in the bottom right corner, compress for web at 720p

Before After

Before - Color Grade

After - Color Grade


Picture-in-Picture with Audio Mix

Prompt: Place video_b in the bottom-right corner at 25% size with a white border over video_a and mix the audio

Before After

Before - PiP

After - PiP


Vintage Film Look + Subtitles

Prompt: Normalize audio to -14 LUFS, add a warm vintage film look with grain overlay, and burn in these subtitles

Before After

Before - Vintage

After - Vintage


Datamosh Glitch Effects

Prompt: Apply datamosh glitch effect, add chromatic aberration with strong RGB split, pixelate slightly, add ghost trails

Before After

Before - Datamosh

After - Datamosh


Cinematic Teal & Orange

Prompt: Add a cinematic teal and orange color grade, apply a subtle vignette, and fade in from black

Before After

Before - Cinematic

After - Cinematic


Neon Glow Edge Detection

Prompt: Apply a neon glow edge detection effect, add chromatic aberration, and slow the video to 0.5x speed with smooth motion

Before After

Before - Neon Glow

After - Neon Glow


Colorhold Noir

Prompt: Use colorhold to keep only the red, desaturate everything else, boost contrast to 1.5, add a strong vignette, apply noir style

Before After

Before - Colorhold

After - Colorhold


Green Screen Removal

Prompt: Remove the green screen with chroma key, despill the green edges, sharpen slightly

Before After

Before - Green Screen

After - Green Screen


📦 Installation

Requirements

  • ComfyUI (latest)
  • Python 3.10+
  • FFMPEG installed and in PATH (install guide)
  • Node.js 18+ (required for CLI tools: Gemini CLI, Claude CLI, Qwen CLI — download)
  • Ollama (optional, for local LLM inference — download)

Option 1: ComfyUI Manager (Recommended)

  1. Open ComfyUI Manager
  2. Search for ComfyUI-FFMPEGA
  3. Click Install

Option 2: Manual Install

Linux / macOS
cd /path/to/ComfyUI/custom_nodes
git clone https://github.com/AEmotionStudio/ComfyUI-FFMPEGA.git
cd ComfyUI-FFMPEGA
pip install -r requirements.txt
Windows (PowerShell)
cd C:\path\to\ComfyUI\custom_nodes
git clone https://github.com/AEmotionStudio/ComfyUI-FFMPEGA.git
cd ComfyUI-FFMPEGA
pip install -r requirements.txt

Note: Use whichever Python package manager your ComfyUI venv uses (pip, uv pip, etc.). The above commands assume pip is available in your ComfyUI virtual environment.

Restart ComfyUI after installation.


🚀 Quick Start

  1. Add an FFMPEG Agent node to your workflow
  2. Connect a video path or use the input field
  3. Enter a natural language prompt
  4. Select your LLM model
  5. Run the workflow

💡 Tip: Right-click the FFMPEG Agent node to open the FFMPEGA Presets context menu — 200+ categorized effects you can apply with a single click, no prompt typing needed. Great for quick edits or discovering what's available.

Example Prompts

Prompt What It Does
"Make it cinematic with a vignette" Adds letterbox, color grade, and edge darkening
"Speed up 2x, keep the audio pitch" Doubles speed with pitch-corrected audio
"Make it look like old VHS footage" Adds noise, color shift, scan lines
"Trim first 5 seconds, resize to 720p" Cuts intro and scales down
"Underwater look with echo on audio" Blue tint, blur, and audio echo
"Pixelate it like an 8-bit game" Mosaic/pixel art effect
"Add 'Subscribe!' text at the bottom" Text overlay with positioning
"Cyberpunk style with neon glow" High-contrast neon aesthetic
"Normalize audio, compress for web" Loudness normalization + web optimization
"Spin the video clockwise" Continuous animated rotation
"Add camera shake" Random shake/earthquake effect
"Fade in from black and out at the end" Smooth intro/outro transitions
"Wipe reveal from the left" Directional wipe reveal animation
"Make it pulse like a heartbeat" Rhythmic zoom breathing effect
"Show audio waveform at the bottom" Audio visualization overlay
"Arrange these images in a grid" Multi-image grid collage
"Create a slideshow with fades" Image slideshow with transitions
"Overlay the logo in the corner" Picture-in-picture / watermark
"Create a side-by-side comparison" Video next to image in 2-column grid
"Create a slideshow starting with the video" Video first, then image slides
"Overlay images in the corners" Multiple images auto-placed in corners
"Split screen, use audio from audio_b" Side-by-side video with specific audio track
"Split screen, mix both audio tracks" Side-by-side with both audio tracks blended

💬 Prompt Guide

When using an LLM, the AI agent interprets your natural language and maps it to skills with specific parameters. Here's how to get the best results. (For manual editing without an LLM, see the Effects Builder and Text Input sections.)

Specifying Exact Values

You can request specific parameter values and the agent will use them directly:

Prompt What the Agent Does
"Set brightness to 0.3" brightness:value=0.3
"Blur with strength 20" blur:radius=20
"Speed up to 3x" speed:factor=3.0
"Crop to 1280x720" crop:width=1280,height=720
"Deband with threshold 0.3 and range 32" deband:threshold=0.3,range=32
"CRF 18, slow preset" quality:crf=18,preset=slow
"Fade in for 3 seconds" fade:type=in,duration=3

What Works Well ✅

  • Explicit numbers: "brightness 0.2", "speed 1.5x", "CRF 20" — the agent maps these directly
  • Named presets: "VHS look", "cinematic style", "noir" — triggers multi-step preset pipelines
  • Chaining operations: "Trim first 5 seconds, resize to 720p, add vignette" — executes in order
  • Descriptive goals: "Make it look warmer", "Remove the green screen" — the agent picks the right skills
  • Technical terms: "denoise", "deband", "normalize audio" — maps to exact FFmpeg filters

What Might Not Work as Expected ⚠️

  • Vague intensity words: "Make it very blurry" or "a little brighter" — the agent has to guess what number "very" or "a little" means. Tip: use a specific value instead: "blur with radius 15"
  • Out-of-range values: Parameters are auto-clamped to their valid range. If you ask for "brightness 5.0" it caps at the max (1.0)
  • Complex compositing: Multi-layer effects with precise timing may need to be broken into separate passes
  • Format-dependent features: Some effects (like transparency) require specific output formats. H.264/MP4 doesn't support alpha channels

🧠 AI Models (Auto-Downloaded)

Some skills use AI models that auto-download on first use. You can disable automatic downloads with the allow_model_downloads toggle on the FFMPEG Agent node — runs requiring a missing model will fail with a clear message and a manual download link.

All models are mirrored to first-party AEmotionStudio HuggingFace repos for supply chain resilience. Downloads try the AEmotionStudio mirror first, then fall back to upstream sources.

Model Size Stored In Triggered By Manual Download
SAM3 (Segment Anything 3) ~300 MB ComfyUI/models/SAM3/ auto_mask skill, sam3_masking no-LLM mode, Effects Builder SAM3 target AEmotionStudio/sam3 — download sam3.safetensors
Whisper large-v3 ~3 GB ComfyUI/models/whisper/ auto_transcribe, karaoke_subtitles skills, transcribe / karaoke_subtitles no-LLM modes AEmotionStudio/whisper-models
Whisper medium ~1.5 GB ComfyUI/models/whisper/ Same as above (set whisper_model to medium) Same as above
Whisper small ~500 MB ComfyUI/models/whisper/ Same as above (set whisper_model to small) Same as above
Whisper base ~150 MB ComfyUI/models/whisper/ Same as above (set whisper_model to base) Same as above
Whisper tiny ~75 MB ComfyUI/models/whisper/ Same as above (set whisper_model to tiny) Same as above
LaMa (Large Mask Inpainting) ~200 MB ~/.cache/torch/hub/checkpoints/ auto_mask:effect=remove (legacy fallback) AEmotionStudio/lama-inpainting — download big-lama.pt
FLUX Klein 4B (Editing/Removal) ~15 GB (bf16) ComfyUI/models/flux_klein/ auto_mask:effect=remove, auto_mask:effect=edit AEmotionStudio/flux-klein
MMAudio (Video-to-Audio) ~5.5 GB ComfyUI/models/mmaudio/ generate_audio skill AEmotionStudio/mmaudio-models
MuseTalk (Lip Sync) ~1.6 GB (fp16) ComfyUI/models/musetalk/ lip_sync skill AEmotionStudio/musetalk-models
U²-Net (rembg) ~170 MB ~/.u2net/ remove_background skill Install with pip install 'comfyui-ffmpega[masking]' — model auto-fetched by rembg

Note

Models are only downloaded when you use the corresponding skill for the first time. Core FFmpeg editing skills (200+ of them) require zero model downloads.


🎛️ Nodes

FFMPEGA provides 8 nodes that work together:

Tip

One task per run. Instead of cramming multiple edits into a single prompt, focus each run on one editing task — then feed the output back into FFMPEGA for the next. This keeps context low and model focus high, leading to significantly better results. Chain FFMPEGA Agent → Save Video → Load Video Path → FFMPEGA Agent for multi-step workflows.

Warning

Low VRAM? Skills that load AI models (SAM3 masking, Whisper transcription, LaMa inpainting) each consume significant VRAM. On GPUs with limited memory, limit each run to one model-loading task — e.g. do your SAM3 removal pass first, save the result, then run Whisper subtitles as a separate pass.

FFMPEG Agent — The main node. Translates natural language into FFMPEG commands.
Required Inputs
Input Type Description
video_path STRING Absolute path to source video. Used as ffmpeg input unless images_a is connected.
prompt STRING Natural language editing instruction (e.g. "Add cinematic letterbox", "Speed up 2x"). Not required in manual mode.
llm_model DROPDOWN AI model selection — local Ollama models, CLI tools, or cloud APIs. Select none for no-LLM mode.
no_llm_mode DROPDOWN Mode when llm_model is none: manual (Effects Builder, default), sam3_masking, transcribe, karaoke_subtitles.
quality_preset DROPDOWN Output quality: draft, standard, high, lossless.
seed INT Change to force re-execution with the same prompt. Supports randomize control.
Optional Inputs
Input Type Description
images_a IMAGE Video frames from upstream (e.g. Load Video). Auto-expands: images_b, images_c...
image_a IMAGE Extra image input for multi-input skills (grid, slideshow, overlay). Auto-expands: image_b, image_c...
audio_a AUDIO Audio input for muxing or multi-audio workflows. Auto-expands: audio_b, audio_c...
video_a STRING File path to extra video for concat, split screen, grid, xfade. Zero memory. Auto-expands: video_b, video_c...
image_path_a STRING File path to image for overlay, grid, slideshow. Zero memory. Auto-expands: image_path_b...
text_a STRING Text input from FFMPEGA Text node for subtitles, overlays, watermarks. Auto-expands: text_b...
pipeline_json STRING Connect from FFMPEGA Effects Builder. In manual mode the pipeline is executed directly; with an LLM it provides skill hints.
subtitle_path STRING Direct path to a .srt or .ass subtitle file.
advanced_options BOOLEAN Simple/Advanced toggle — shows preview mode, CRF, encoding preset, batch processing when enabled.
preview_mode BOOLEAN Quick low-res preview (480p, 10s) instead of full render.
save_output BOOLEAN Save video + workflow PNG to output folder.
output_path STRING Custom output file/folder path. Empty = ComfyUI default.
ollama_url STRING Ollama server URL (default: http://localhost:11434).
api_key STRING API key for cloud models (GPT, Claude, Gemini). Auto-redacted from outputs.
custom_model STRING Exact model name when llm_model is set to custom.
crf INT Override CRF (0 = lossless, 23 = default, 51 = worst). -1 uses quality_preset.
encoding_preset DROPDOWN Override x264/x265 speed preset (ultrafastveryslow). auto follows quality_preset.
use_vision BOOLEAN Embed video frames as images for vision-capable LLMs. Off = numeric color analysis only.
verify_output BOOLEAN Agent inspects output after rendering and auto-corrects if it doesn't match intent.
Whisper / SAM3 Inputs
Input Type Description
whisper_device DROPDOWN Device for Whisper model: cpu (default, avoids VRAM pressure) or gpu (faster, ~3 GB VRAM).
whisper_model DROPDOWN Whisper model size: large-v3 (default, most accurate), medium, small, base, tiny.
sam3_max_objects INT Max objects SAM3 tracks per frame (1–20, default 5). Lower = less VRAM.
sam3_det_threshold FLOAT Minimum detection confidence for SAM3 (0.0–1.0, default 0.70). Higher = fewer objects.
mask_output_type DROPDOWN black_white (raw mask for compositing) or colored_overlay (SAM3-style preview).
mask_points STRING JSON point selection data from Load Video Path's Point Selector. Guides SAM3 with click-to-select.
Batch Processing Inputs
Input Type Description
batch_mode BOOLEAN Process all matching videos in video_folder with the same prompt. Single LLM call.
video_folder STRING Folder containing videos to batch process.
file_pattern DROPDOWN File pattern to match (*.mp4, *.mov, *.*, etc.).
max_concurrent INT Maximum simultaneous encodes in batch mode (1–16, default 4).
Output Description
images All frames from the output video as a batched image tensor
audio Audio extracted from output video (or passed through from audio_a)
video_path Absolute path to the rendered output video file
command_log The ffmpeg command(s) that were executed
analysis LLM interpretation, pipeline steps, and warnings
FFMPEGA Effects Builder — Compose video effects visually without an LLM.

Select up to 3 skills with parameters, add raw FFmpeg filters, and use presets. Outputs a pipeline JSON that connects to the FFMPEG Agent's pipeline_json input.

Built-in presets: 🎬 Cinematic Look, 📼 VHS Retro, 🎥 High Quality Export, 🎵 Clean Audio, ✨ Glow + Saturation, 🎙️ Auto Subtitles, 🎤 Karaoke Subtitles, 🌑 Fade In + Out, 🪞 Mirror Horizontal, 🎬 Ken Burns Zoom.

Input Type Description
preset DROPDOWN Quick-start preset. Auto-fills effect slots and params. Set to none to build your own.
effect_1 DROPDOWN First effect. Categorized by type (🎨 Visual, ⏱️ Temporal, 📐 Spatial, 🔊 Audio, 📦 Encoding, ✨ Outcome).
effect_1_params STRING JSON parameters for effect 1 (e.g. {"strength": 5}). Auto-filled from defaults.
effect_2 DROPDOWN Second effect (chained after effect 1).
effect_2_params STRING JSON parameters for effect 2.
effect_3 DROPDOWN Third effect (chained after effect 2).
effect_3_params STRING JSON parameters for effect 3.
raw_ffmpeg STRING Raw FFmpeg -vf filter string applied after skill effects.
sam3_target STRING SAM3 text target — apply effects only to the masked region. Leave empty for full-frame.
sam3_effect DROPDOWN Effect for SAM3-detected region: blur, pixelate, remove, edit, grayscale, highlight, greenscreen, transparent.
Output Description
pipeline_json JSON pipeline — connect to FFMPEG Agent's pipeline_json input
Frame Extract (FFMPEGA) — Extract individual frames from a video as image tensors.
Input Type Description
video_path STRING Absolute path to the video to extract frames from.
fps FLOAT Extraction rate (0.1–60.0). 1.0 = one frame per second.
start_time FLOAT (optional) Start time in seconds (default: 0).
duration FLOAT (optional) Duration to extract from, in seconds (default: 10).
max_frames INT (optional) Max frames to return (default: 100, max: 1000).
Output Description
frames Extracted video frames as a batched image tensor

Tip: Connect frames output to the FFMPEG Agent's images_a input to build pipelines that analyze frames before editing.

Load Image Path (FFMPEGA) — Zero-memory image loader, outputs a file path instead of a tensor.

Outputs the image file path as a STRING instead of decoding into a ~6 MB IMAGE tensor. Connect to FFMPEGA Agent's image_path_a / image_path_b / … slots so ffmpeg reads the file directly.

Input Type Description
image FILE PICKER Select an image from ComfyUI's input directory or upload a new one.
Output Description
image_path Absolute file path to the selected image
Load Video Path (FFMPEGA) — Zero-memory video input with inline preview and metadata.

Validates the video file exists and outputs the path as a STRING — loads ZERO frames into memory. Features inline video preview, metadata display (fps, duration, resolution), and VHS-style trim parameters.

Input Type Description
video FILE PICKER Select or upload a video file.
force_rate FLOAT Override FPS (0 = use source).
skip_first_frames INT Frames to skip from start.
frame_load_cap INT Max frames to use (0 = all).
select_every_nth INT Select every Nth frame (1 = every frame).
Output Description
video_path Validated video file path
frame_count Total usable frames after trim
fps Effective FPS
duration Effective duration in seconds
Save Video (FFMPEGA) — Zero-memory video output with inline preview.

Takes a video path (from FFMPEGA Agent or Load Video Path), copies the file to ComfyUI's output directory, and shows a preview. No re-encoding — just a file copy.

Input Type Description
video_path STRING Path to video file (typically from FFMPEGA Agent's output).
filename_prefix STRING Prefix for saved filename. Supports %date:yyyy-MM-dd%.
overwrite BOOLEAN (optional) Overwrite existing file vs auto-increment counter.
Text Input (FFMPEGA) — Flexible text input for subtitles, overlays, and watermarks.

Auto-detects whether text is SRT subtitles, a short watermark, or overlay text. Outputs JSON-encoded metadata that the FFMPEGA Agent node parses for burn_subtitles, text_overlay, or watermark skills.

Right-click Presets: 10 built-in presets with example text (SRT Subtitle Example, Cinematic Subtitles, Bold Watermark, Title Card, Social Caption, Meme Text, Lower Third, Copyright Notice, Credits Roll, Chapter Marker). Save/load/delete custom presets. "Clear Text" resets all fields to defaults.

No-LLM Text Mode: Connect a Text node to the Agent's text_a input in no-LLM manual mode (without an Effects Builder). The Agent auto-generates a text overlay or subtitle pipeline from the Text node's mode, position, font size, and color settings.

Input Type Description
text STRING Text content — plain text, multi-line subtitles, or full SRT format with timestamps.
auto_mode BOOLEAN (optional) Auto-detect mode from content (default: on).
mode DROPDOWN (optional) Override mode: subtitle, overlay, watermark, title_card, raw.
position DROPDOWN (optional) Text placement: center, top, bottom, bottom_right, etc. auto = mode default.
font_size INT (optional) Font size in px (0 = auto: 24 subtitle, 48 overlay, 20 watermark).
font_color STRING (optional) Text color as hex (#RRGGBB, default: white).
start_time FLOAT (optional) Start time in seconds (default: 0).
end_time FLOAT (optional) End time in seconds (-1 = full duration).
Output Description
text_output JSON-encoded text with metadata — connect to text_a, text_b, etc. on the FFMPEGA Agent node
Video to Path (FFMPEGA) — Convert IMAGE tensor to a temp video file path.

Bridge node between Load Video nodes (which output IMAGE tensors) and FFMPEGA Agent's video_a/video_b/video_c slots. Encodes frames to a temp video file, then releases the tensor so ComfyUI can free the memory.

Input Type Description
images IMAGE Video frames from a Load Video node.
fps INT (optional) FPS for the output video (default: 24).
Output Description
video_path File path to the temp video

🎯 Skill System

FFMPEGA includes a comprehensive skill system with 200+ operations organized into categories. Use them in two ways: let the AI agent select skills from your prompt, or pick them yourself with the Effects Builder — no LLM needed.

📄 See SKILLS_REFERENCE.md for the complete skill reference with all parameters and example prompts.

🧪 See SKILL_TEST_PROMPTS.md for ready-to-use copy-and-paste test prompts for every skill.

🎨 Visual Effects (30 skills)
Skill Description
brightness Adjust brightness (-1.0 to 1.0)
contrast Adjust contrast (0.0 to 3.0)
saturation Adjust color saturation (0.0 to 3.0)
hue Shift color hue (-180 to 180)
sharpen Increase sharpness
blur Apply blur effect
denoise Reduce noise/grain (light, medium, strong)
vignette Darken edges for cinematic focus
fade Fade in/out to black
colorbalance Adjust shadows/midtones/highlights
noise Add film grain
curves Apply color curve presets (vintage, cross_process, etc.)
text_overlay Add text with position, color, size, font
invert Invert colors (photo negative)
edge_detect Edge detection / sketch look
pixelate Mosaic / 8-bit pixel effect
gamma Gamma correction
exposure Exposure adjustment
chromakey Green screen removal
colorkey Key out any arbitrary color and replace with a background
colorhold Keep only a selected color, desaturate everything else (spot color)
lumakey Key out regions based on brightness (luma)
despill Remove green/blue color spill from chroma-keyed edges
deband Remove color banding artifacts
white_balance Adjust color temperature (2000K–12000K)
shadows_highlights Separately adjust shadows and highlights
split_tone Warm highlights, cool shadows
deflicker Remove fluorescent/timelapse flicker
unsharp_mask Fine-grained luma/chroma sharpening
remove_background Remove backgrounds using AI (rembg)
⏱️ Temporal (9 skills)
Skill Description
trim Cut a segment by time
speed Change playback speed (0.1x to 10x)
reverse Play backwards
loop Repeat video
fps Change frame rate (1 to 120)
scene_detect Auto-detect scene changes
silence_remove Remove silent segments
time_remap Gradual speed ramp
freeze_frame Freeze a frame at a timestamp
📐 Spatial (8 skills)
Skill Description
resize Scale to specific dimensions
crop Crop video region
rotate Rotate by degrees
flip Mirror horizontal/vertical
pad Add padding / letterbox
aspect Change aspect ratio (16:9, 4:3, 1:1, 9:16, 21:9)
auto_crop Detect and remove black borders
scale_2x Quick upscale with algo choice (2x, 4x)
🔊 Audio (27 skills)
Skill Description
volume Adjust audio level
normalize Normalize loudness
fade_audio Audio fade in/out
remove_audio Strip all audio
extract_audio Extract audio only
bass / treble Boost/cut frequencies
pitch Shift pitch by semitones
echo Add echo / reverb
equalizer Adjust specific frequency band
stereo_swap Swap L/R channels
mono Convert to mono
audio_speed Change audio speed only
chorus Chorus thickening effect
flanger Sweeping jet flanger
lowpass / highpass Frequency filters
audio_reverse Reverse audio track
compress_audio Dynamic range compression
noise_reduction Remove background noise
audio_crossfade Smooth audio crossfade
audio_delay Add delay/offset to audio
ducking Audio dynamic compression
dereverb Remove room echo/reverb
split_audio Extract left/right channel
audio_normalize_loudness EBU R128 loudness normalization
replace_audio Replace original audio track
mix_audio Mix/blend audio tracks from two inputs (both audible)
audio_bitrate Set audio encoding bitrate
📦 Encoding (13 skills)
Skill Description
compress Reduce file size (light, medium, heavy)
convert Change codec (h264, h265, vp9, av1)
quality Set CRF and encoding preset
bitrate Set video/audio bitrate
web_optimize Fast-start for web streaming
container Change format (mp4, mkv, avi, mov, webm)
pixel_format Set pixel format (yuv420p, yuv444p, etc.)
hwaccel Hardware acceleration (cuda, vaapi, qsv)
audio_codec Set audio codec (aac, mp3, opus, flac)
frame_rate_interpolation Motion-interpolated FPS conversion
frame_interpolation Smooth slow motion via motion interpolation
two_pass Two-pass encoding for better quality
hls_package HLS adaptive streaming packaging
🎬 Cinematic Presets (14 skills)
Skill Description
cinematic Hollywood film look — teal-orange grading
blockbuster Michael Bay style — high contrast, dramatic
documentary Clean, natural documentary look
indie_film Indie art-house — faded, low contrast
commercial Bright, clean corporate video
dream_sequence Dreamy, soft, ethereal atmosphere
action Fast-paced action movie grading
romantic Soft, warm romantic mood
sci_fi Cool blue sci-fi atmosphere
dark_moody Dark, atmospheric, moody feel
color_grade Cinematic color grading (teal_orange, warm, cool)
color_temperature Adjust color temperature (warm/cool)
letterbox Cinematic widescreen letterbox bars
film_grain Film grain texture (light, medium, heavy)
📼 Vintage & Retro (9 skills)
Skill Description
vintage Classic old film look (50s–90s)
vhs VHS tape aesthetic
sepia Classic sepia/brown tone
super8 Super 8mm film look
polaroid Polaroid instant photo
faded Washed-out, faded look
old_tv CRT television aesthetic
damaged_film Aged/weathered film
noir Film noir — B&W, high contrast
📱 Social Media (9 skills)
Skill Description
social_vertical TikTok / Reels / Shorts (9:16)
social_square Instagram feed (1:1)
youtube YouTube optimized
twitter Twitter/X optimized
gif Convert to animated GIF
thumbnail Extract thumbnail frame
caption_space Add space for captions
watermark Overlay logo/watermark
intro_outro Add intro/outro segments
✨ Creative Effects (14 skills)
Skill Description
neon Neon glow — vibrant edges and colors
horror Dark, desaturated, grainy horror atmosphere
underwater Blue tint, blur, darker underwater look
sunset Golden hour warm glow
cyberpunk Neon tones, high contrast cyberpunk
comic_book Bold colors, comic/pop art style
miniature Tilt-shift toy model effect
surveillance Security camera / CCTV look
music_video Punchy colors, contrast, vignette
anime Anime / cel-shaded cartoon
lofi Lo-fi chill aesthetic
thermal Thermal / heat vision camera
posterize Reduce color palette / screen-print
emboss Emboss / relief surface effect
🧪 Special Effects (36 skills)
Skill Description
meme Deep-fried meme aesthetic
glitch Digital glitch / databend
mirror Mirror / kaleidoscope effect
slow_zoom Slow push-in zoom
black_and_white B&W with style options
day_for_night Simulate nighttime from daytime
dreamy Soft, ethereal dream look
hdr_look Simulated HDR dynamic range
datamosh Glitch art / motion vector visualization
radial_blur Radial / zoom blur effect
grain_overlay Cinematic film grain with intensity control
burn_subtitles Hardcode subtitles
selective_color Isolate specific colors
perspective Perspective transform
lut_apply Apply LUT color grading
lens_correction Fix lens distortion
fill_borders Fill black borders
deshake Quick stabilization
deinterlace Remove interlacing
halftone Newspaper dot pattern
false_color Pseudocolor heat map
frame_blend Temporal frame blending
tilt_shift Tilt-shift miniature effect
color_channel_swap Color channel remapping
ghost_trail Temporal motion trails
glow Bloom / soft glow effect
sketch Pencil drawing / ink line art
chromatic_aberration RGB channel offset / color fringing
boomerang Looping boomerang effect
ken_burns Slow zoom pan for photos
slowmo Smooth slow motion
stabilize Remove camera shake
timelapse Dramatic speed-up for timelapse
zoom Zoom in/out effect
scroll Scroll video vertically/horizontally
monochrome Monochrome with optional tint
🎬 Transitions (3 skills)
Skill Description
fade_to_black Fade in from + fade out to black
fade_to_white Fade in from + fade out to white
flash Camera flash at a specific timestamp
🌀 Motion (5 skills)
Skill Description
spin Continuous animated rotation
shake Camera shake / earthquake (light, medium, heavy)
pulse Rhythmic breathing zoom effect
bounce Vertical bouncing animation
drift Slow cinematic pan (left, right, up, down)
🔮 Reveal Effects (3 skills)
Skill Description
iris_reveal Circle expanding from center
wipe Directional wipe from black
slide_in Slide video in from edge
🎵 Audio Visualization (1 skill)
Skill Description
waveform Audio waveform overlay (line, point, cline modes)
🔗 Multi-Input & Composition (10 skills)
Skill Description
grid Arrange video + images in a grid layout (xstack). Auto-includes video as first cell.
slideshow Create slideshow from images with fade transitions. Optionally starts with the main video.
overlay_image Picture-in-picture / watermark overlay. Supports multiple overlays auto-placed in corners. Accepts animation=bounce for motion.
concat Concatenate video segments sequentially. Connect multiple videos/images to join them.
xfade Smooth transitions between segments — 18 types: fade, dissolve, wipe, pixelize, radial, etc.
split_screen Side-by-side (horizontal) or top-bottom (vertical) multi-video layout.
animated_overlay Moving image overlay with motion presets: scroll, float, bounce, slide.
✏️ Text & Graphics (9 skills)
Skill Description
animated_text Animated text overlay
scrolling_text Scrolling credits-style text
ticker News-style scrolling ticker bar
lower_third Professional broadcast lower third
countdown Countdown timer overlay
typewriter_text Typewriter reveal effect
bounce_text Bouncing animated text
fade_text Text that fades in and out
karaoke_text Karaoke-style fill text
✂️ Editing & Delivery (13 skills)
Skill Description
picture_in_picture PiP overlay window with optional border
blend Blend two video inputs
delogo Remove logo from a region
remove_dup_frames Strip duplicate/stuttered frames
mask_blur Blur a rectangular region for privacy
extract_frames Export frames as image sequence
jump_cut Auto-cut to high-energy moments
beat_sync Sync cuts to a beat interval
color_match Auto histogram equalization
extract_subtitles Extract subtitle track
preview_strip Filmstrip preview of key frames
sprite_sheet Contact sheet of frames
🤖 AI-Powered (4 skills)
Skill Description
auto_transcribe Transcribe audio with Whisper AI and burn SRT subtitles
karaoke_subtitles Word-by-word karaoke subtitles with progressive color fill (Whisper)
auto_mask SAM3-powered object segmentation from text prompts
generate_audio AI-generate synchronized audio/foley from video + text (MMAudio)

⚠️ License Notice: The generate_audio skill uses MMAudio model weights which are licensed under CC-BY-NC 4.0 (non-commercial use only). Model weights are downloaded on first use — by downloading them you accept the CC-BY-NC 4.0 license. The FFMPEGA code itself remains GPL-3.0.

🧠 Agentic Tools

Beyond the 200+ editing skills, the agent has built-in tools for analyzing media and making better decisions. In LLM mode, the agent calls these autonomously based on your prompt. Some (like analyze_video and search_skills) are also invoked directly by internal skills, the Effects Builder, and no-LLM modes.

🔍 Analysis & Discovery
Tool What It Does
analyze_video Probes resolution, duration, codec, FPS, bitrate — the agent calls this to understand your source
extract_frames Extracts PNG frames for vision models to "see" the video content
analyze_colors Numeric color metrics (luminance, saturation, color balance) via ffprobe signalstats — guides color grading decisions without vision
analyze_audio Numeric audio metrics (volume dB, EBU R128 loudness LUFS, silence detection) — guides audio effect decisions
search_skills Searches skills by keyword — the agent always calls this to find the right skills
list_luts Lists available LUT files for color grading — called before lut_apply to discover available looks
🎨 LUT Color Grading System

8 bundled LUT files for cinematic color grading. The agent discovers these via list_luts and applies them with the lut_apply skill.

LUT Style
cinematic_teal_orange Hollywood teal-orange grade
warm_vintage Warm retro film look
cool_scifi Cool blue sci-fi tone
film_noir Classic noir — desaturated, crushed
golden_hour Warm golden sunlight
cross_process Cross-processed film chemistry
bleach_bypass Bleach bypass — low saturation, high contrast
neutral_clean Subtle clarity enhancement

Adding your own LUTs: Drop .cube or .3dl files into the luts/ folder. The agent will discover them via list_luts automatically. Short names auto-resolve to full paths (e.g., cinematic_teal_orangeluts/cinematic_teal_orange.cube).

✅ Output Verification Loop

When verify_output is enabled (default: On), the agent inspects its own output after execution:

  1. Extracts frames from the output video
  2. Runs color and/or audio analysis on the result
  3. Sends analysis to the LLM with the original prompt for quality assessment
  4. If the LLM detects issues, it auto-corrects the pipeline and re-executes once

This closes the feedback loop — the agent can catch and fix mistakes like wrong color grades, failed effects, or audio issues without re-queuing.

🧩 Custom Skills

Create your own skills via YAML — no Python required. Drop a .yaml file in custom_skills/, restart ComfyUI, and the agent can use it immediately.

# custom_skills/dreamy_blur.yaml
name: dreamy_blur
description: "Soft dreamy blur with glow"
category: visual
tags: [dream, blur, soft, glow]

parameters:
  radius:
    type: int
    default: 5
    min: 1
    max: 30

ffmpeg_template: "gblur=sigma={radius},eq=brightness=0.06"

Skill packs — installable collections of related skills, optionally with Python handlers for complex logic:

# Linux / macOS / Windows (Git Bash or PowerShell)
cd custom_skills/
git clone https://github.com/someone/ffmpega-retro-pack retro-pack

Two example skills ship in custom_skills/examples/warm_glow.yaml (template) and film_burn.yaml (pipeline composite).

📄 See CUSTOM_SKILLS.md for the full schema reference, skill pack structure, Python handlers, and advanced examples.


🤖 LLM Configuration

Tested with: FFMPEGA has been primarily tested using Gemini CLI and Qwen3 8B (via Ollama). Results may vary with other models.

Author's pick: The CLI connectors (especially Gemini CLI) have been the most reliable option in my experience — they handle tool-calling, structured output, and long context exceptionally well. Highly recommended if you have access.

Choosing a Model

FFMPEGA works best with models that have strong JSON output and instruction-following abilities. The agent sends a structured prompt and expects a valid JSON pipeline back — models with tool-calling or function-calling capabilities tend to perform best.

Things to keep in mind:

  • Some models work better than others — larger models and those trained for structured output (JSON/tool-calling) produce more reliable results
  • Some models may need more retries — if the agent fails to parse the response, try running the same prompt again. Smaller models occasionally return malformed JSON on the first try
  • Find what works best for you — experiment with different models to find the right balance of speed, quality, and reliability for your hardware

Ollama (Local — Free)

The default option. Runs locally, no API key needed.

Install Ollama: Download from ollama.com/download (available for Linux, macOS, and Windows).

# Start Ollama (all platforms)
ollama serve

# Pull a model
ollama pull qwen3:8b

Windows users: After installing Ollama, the ollama command is available in both PowerShell and Command Prompt. Ollama also runs as a system tray app.

Recommended local models (≤30B, consumer GPU friendly):

Model Size Speed Notes
qwen3:8b 8B ⚡ Fast Tested — excellent structured output, native tool-calling
qwen3-vl 8B ⚡ Fast Tested — multimodal vision-language model, sees video frames
qwen3:14b 14B ⚡ Fast Sweet spot of speed and quality, tools + thinking tags
qwen3:30b 30B 🔄 Medium Best Qwen under 30B, needs 16GB+ VRAM
qwen2.5:14b 14B ⚡ Fast Top IFEval scores, strong instruction following
deepseek-r1:14b 14B ⚡ Fast Reasoning model — verifies its own tool calls, very reliable
mistral-nemo 12B ⚡ Fast NVIDIA + Mistral collab, 128k context, great reasoning
mistral-small3.2 24B 🔄 Medium Native function calling, 128k context
phi4 14B ⚡ Fast Microsoft reasoning SLM, rivals larger models in logic
gemma3:12b 12B ⚡ Fast High reasoning scores, 128k context
llama3.3:8b 8B ⚡ Fast Reliable tool-calling, large ecosystem

Tip: On the Ollama library, look for models with a tools tag — this indicates native tool/function-calling support, which produces the best results with FFMPEGA.

OpenAI

llm_model: gpt-5.2
api_key: your-openai-key

Gemini (Google API)

llm_model: gemini-3-flash
api_key: your-google-ai-key

Gemini CLI (Free with any Google Account)

Use the Gemini CLI to run Gemini models without an API key. Works with any Google account.

Install:

Platform Command
Linux / macOS npm install -g @google/gemini-cli
Windows (PowerShell) npm install -g @google/gemini-cli

Tip: You can also use pnpm add -g or yarn global add if you prefer.

Authenticate (first time only):

gemini

This opens a browser to sign in with your Google account.

Use in FFMPEGA:

llm_model: gemini-cli

No API key is needed — authentication is handled by the CLI. Select gemini-cli from the model dropdown in the node.

Note: The Gemini CLI runs as a subprocess and is sandboxed to the custom node directory for security. On Windows, gemini.cmd is also detected automatically.

Plans & Usage Limits

Plan Rate Limit Daily Limit Models Cost
Free (Google login) 60 req/min 1,000 req/day Gemini model family (Pro + Flash) Free
Free (API key only) 10 req/min 250 req/day Flash only Free
Code Assist Standard 120 req/min 1,500 req/day Gemini model family Paid
Code Assist Enterprise 120 req/min 2,000 req/day Gemini model family Paid
Google AI Pro Higher Higher Full Gemini family $19.99/mo

Tip: Sign in with a Google account (free) for the best experience — 1,000 requests/day with access to Pro and Flash models. An unpaid API key limits you to 250/day on Flash only.

Available Models

The Gemini CLI auto-selects the best model, but the following are available:

Model Best For
Gemini 2.5 Pro Complex reasoning, creative tasks
Gemini 2.5 Flash Fast responses, high throughput
Gemini 2.5 Flash-Lite Maximum speed, lowest cost
Gemini 3 Pro Most capable, advanced reasoning
Gemini 3 Flash Fast + capable, good balance

Free tier may auto-switch to Flash models when Pro quota is exhausted.

Anthropic

llm_model: claude-sonnet-4-6
api_key: your-anthropic-key

Claude Code CLI (Free with Anthropic Account)

Use the Claude Code CLI as a local LLM backend. Uses its own authentication — no API key needed in FFMPEGA.

Install:

Platform Command
Linux / macOS npm install -g @anthropic-ai/claude-code
Windows (PowerShell) npm install -g @anthropic-ai/claude-code

Tip: You can also use pnpm add -g or yarn global add if you prefer.

Authenticate (first time only):

claude

This opens a browser to sign in with your Anthropic account.

Use in FFMPEGA:

llm_model: claude-cli

Auto-detected on PATH. Select claude-cli from the model dropdown.

Cursor Agent CLI

Use Cursor's CLI in agent mode as an LLM backend.

Install (all platforms): Open Cursor IDE → Command Palette (Ctrl+Shift+P / Cmd+Shift+P) → "Install 'cursor' command"

Start the agent:

agent

Use in FFMPEGA:

llm_model: cursor-agent

Auto-detected on PATH. Select cursor-agent from the model dropdown.

Qwen Code CLI (Free — 2,000 requests/day)

Use Qwen Code as a free LLM backend. Powered by Qwen3-Coder with 2,000 free requests/day via OAuth — no credit card required.

Install:

Platform Command
Linux / macOS npm install -g @qwen-code/qwen-code@latest
Windows (PowerShell) npm install -g @qwen-code/qwen-code@latest

Tip: You can also use pnpm add -g or yarn global add if you prefer.

Authenticate (first time only):

qwen

Select "Qwen OAuth (Free)" and follow the browser prompts to sign in.

Use in FFMPEGA:

llm_model: qwen-cli

Auto-detected on PATH. Select qwen-cli from the model dropdown.

👁️ CLI Agent Vision Support

When Frame Extraction is used, FFMPEGA saves extracted frames to a _vision_frames/ directory and passes the frame paths to the CLI agent. Agents with vision support can see and analyze the actual frame images to make better editing decisions.

CLI Agent Vision Support Notes
Gemini CLI ✅ Yes read_file converts images to base64 for multimodal analysis
Claude Code CLI ✅ Yes Native image reading and description
Cursor Agent CLI ✅ Yes Native image reading and description
Qwen Code CLI ❌ Not yet Known issueread_file returns raw binary instead of interpreting images. Vision is listed as a planned feature.

Note: Agents without vision support still receive the frame file paths and can use video metadata (duration, resolution, FPS) from analyze_video to make editing decisions. When Qwen fixes their vision support upstream, it will work automatically since the frame paths are already passed correctly.

⚠️ Important: The _vision_frames/ directory must not be listed in .gitignore or .git/info/exclude — CLI agents respect these ignore patterns and will be unable to read the frames. FFMPEGA's cleanup_vision_frames() automatically deletes the directory after each pipeline run.

Custom Model

Select custom from the model dropdown and type any model name in the custom_model field. The provider is auto-detected from the name:

Prefix Provider
gpt-* OpenAI
claude-* Anthropic
gemini-* Google
Anything else Ollama (local)

This lets you use any new model immediately without waiting for a code update.

🔒 API Key Security

Your API keys are automatically scrubbed and never stored in output files:

  • Error messages — keys are redacted before being shown in the UI (e.g. ****abcd)
  • Workflow metadata — ComfyUI embeds workflow data in output images/videos; FFMPEGA strips the api_key field from this metadata before saving
  • HTTP errors — keys are removed from network error messages that might include auth headers
  • Debug logsLLMConfig redacts keys in all string representations

No configuration needed — this protection is always active when an API key is provided.

⚠️ Safety precaution: As with any software, always inspect your output files before sharing them publicly — in the unlikely event of a bug or edge case that bypasses the automatic scrubbing.

📊 Token Usage Tracking

Monitor your LLM token consumption with opt-in usage tracking. Enable via two toggles on the node:

Toggle Default What It Does
track_tokens Off Prints a formatted usage summary to the console after each run
log_usage Off Appends a JSON entry to usage_log.jsonl for cumulative tracking

Token data sources by connector:

Connector Source Estimated?
Ollama Native API (prompt_eval_count / eval_count) No
OpenAI / Gemini API Native API (usage field) No
Anthropic API Native API (usage.input_tokens) No
Gemini CLI JSON output via -o json No
Claude CLI JSON output via --output-format json No
Other CLIs Character-based estimation (~4 chars/token) Yes

When enabled, the analysis output includes a usage breakdown:

Token Usage:
  Prompt tokens:     4,200
  Completion tokens: 1,800
  Total tokens:      6,000
  LLM calls:         5
  Tool calls:        3
  Elapsed:           12.4s

The usage_log.jsonl file stores one JSON object per run for historical analysis. It is gitignored by default.


🐛 Troubleshooting

FFMPEG Not Found

Ensure FFMPEG is installed and in your system PATH:

ffmpeg -version

Install FFMPEG:

Platform Command / Method
Ubuntu / Debian sudo apt install ffmpeg
Arch / CachyOS sudo pacman -S ffmpeg
Fedora sudo dnf install ffmpeg
macOS brew install ffmpeg
Windows (winget) winget install Gyan.FFmpeg
Windows (choco) choco install ffmpeg
Windows (scoop) scoop install ffmpeg
Windows (manual) Download from ffmpeg.org/download, extract, and add the bin/ folder to your system PATH

Windows PATH tip: After installing, open a new terminal and run ffmpeg -version to verify. If not found, you may need to add ffmpeg's bin/ directory to your system PATH manually: Settings → System → About → Advanced system settings → Environment Variables → Edit Path.

Ollama Connection Failed

Make sure Ollama is running:

ollama serve

If using a custom URL, set it in the node's ollama_url field.

Model Not Found

Pull the required model first:

ollama pull qwen2.5:8b
LLM Returns Empty Response

This usually means:

  • The model is still loading (first request after start)
  • The prompt is too long for the model's context window
  • Try running the same prompt again
  • Try a different model
Parameter Validation Errors

FFMPEGA auto-coerces types (float→int) and clamps out-of-range values. If you still see errors, try simplifying your prompt or using a more capable model.

Cancelling a Running Request

If the LLM is taking too long or you want to abort mid-request, close the ComfyUI terminal or restart ComfyUI instead of using the interrupt button. The interrupt button waits for the current LLM response to complete, which can take a while — closing/restarting ComfyUI kills it immediately.


🤝 Contributing

Contributions are welcome! Whether it's bug reports, new skills, or improvements — your help is appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 License

This project is licensed under the GPL-3.0 License — see the LICENSE file for details.


Developed by Æmotion Studio

YouTube Discord Ko-fi