Skip to content

Conversation

@Dan-Flores
Copy link
Contributor

@Dan-Flores Dan-Flores commented Nov 26, 2025

This PR creates a benchmark to compare VideoEncoder against FFmpeg CLI. These tools aren't one-to-one, so some assumptions are made:

For VideoEncoder, we use this simple workflow:

encoder = VideoEncoder(frames=frames, frame_rate=30)
encoder.to_file(dest=output_path, codec="h264_nvenc", extra_options={"qp": 1})

For FFmpeg CLI, we count the time used to write frames from a tensor to a file if the flag is used: --write_frames

if write_frames:
	raw_frames = frames.permute(0, 2, 3, 1).contiguous()[:num_frames]
    	with open(raw_path, "wb") as f:
        	f.write(raw_frames.cpu().numpy().tobytes())
        
ffmpeg_cmd = [...]
subprocess.run(ffmpeg_cmd, check=True, capture_output=True)

Result Summary:

  • VideoEncoder shows better performance on GPU + CPU.
    • When the time required to write frames to bytes is added, FFmpeg CLI is much slower.
  • On GPU, VideoEncoder shows a significant speed improvement, up to 3.5x faster than FFmpeg CLI for decoding 30 frames, without adding the time required to write frames to bytes.
    • NVENC utilization is higher for VideoEncoder, while Median GPU memory used values are the same.
  • On CPU, FFmpeg CLI has a slight edge without adding the time required to write frames to bytes. Otherwise, VideoEncoder is significantly faster.
Details All benchmarks are run using a 1280x720 video: Command to generate video: `ffmpeg -f lavfi -i testsrc2=duration=600:size=1280x720:rate=30 -c:v libx264 -pix_fmt yuv420p test/resources/testsrc2_10min.mp4`

Benchmarking nasa_13013.mp4, writing frames in FFmpeg

$ python benchmarks/encoders/benchmark_encoders.py

Benchmarking 390 frames from nasa_13013.mp4 over 30 runs:
Decoded 390 frames of size 270x480

VideoEncoder on GPU   med = 119.26 ms, max = 122.06 ms, fps = 3270.1
GPU memory used:      med = 1231.0 MB, max = 1231.0 MB
NVENC utilization:    med = 30.0%,     max = 38.0%

FFmpeg CLI on GPU     med = 1174.55 ms, max = 1524.59 ms, fps = 332.0
GPU memory used:      med = 1231.0 MB, max = 1231.0 MB
NVENC utilization:    med = 15.0%,     max = 22.0%

VideoEncoder on CPU   med = 408.43 ms, max = 454.66 ms, fps = 954.9

FFmpeg CLI on CPU     med = 1184.47 ms, max = 1219.28 ms, fps = 329.3

Benchmarking nasa_13013.mp4, with --skip-write-frames

$ python benchmarks/encoders/benchmark_encoders.py --skip-write-frames

Benchmarking 390 frames from nasa_13013.mp4 over 30 runs:
Decoded 390 frames of size 270x480

VideoEncoder on GPU   med = 120.21 ms, max = 122.40 ms, fps = 3244.4
GPU memory used:      med = 1231.0 MB, max = 1231.0 MB
NVENC utilization:    med = 26.0%,     max = 39.0%

FFmpeg CLI on GPU     med = 419.66 ms, max = 1189.17 ms, fps = 929.3
GPU memory used:      med = 1231.0 MB, max = 1231.0 MB
NVENC utilization:    med = 18.0%,     max = 23.0%

VideoEncoder on CPU   med = 408.86 ms, max = 449.01 ms, fps = 953.9

FFmpeg CLI on CPU     med = 383.65 ms, max = 410.91 ms, fps = 1016.5

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 26, 2025
@Dan-Flores Dan-Flores force-pushed the test_gpu_benchmarking branch from e4b6d52 to 743b664 Compare December 2, 2025 14:23
@Dan-Flores Dan-Flores changed the title [wip] benchmark encoding Benchmark encoding against ffmpeg cli Dec 18, 2025
@Dan-Flores Dan-Flores marked this pull request as ready for review December 18, 2025 14:46
def encode_torchcodec(frames, output_path, device="cpu"):
encoder = VideoEncoder(frames=frames, frame_rate=30)
if device == "cuda":
encoder.to_file(dest=output_path, codec="h264_nvenc", extra_options={"qp": 1})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we currently using qp=1 for torchcodec encoder vs qp=0 for ffmpeg cli? (line 155)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we should not be, thanks for catching this!

@mollyxu
Copy link
Contributor

mollyxu commented Dec 18, 2025

Great work on the benchmarks @Dan-Flores! I liked the detailed analysis of the results. I left two clarifying questions.

Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dan-Flores , this looks good!

self.metrics = {
"utilization": [s["utilization"] for s in samples],
"memory_used": [s["memory_used"] for s in samples],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On NVENCMonitor above, I think we might want to use pynvml instead, as done e.g. in P1984513849.

The main reason is that NVENCMonitor is sampling utilization value every 50ms, which isn't exactly in sync with the number of iterations in the loop. That is, the returned nvenc_tensor doesn't contain the same amount of values the times tensor, and so their reported values aren't averaged over the same amount of experiments either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example, I'll update to use pynvml.

I see how arbitrarily selecting 50ms will not get the same amount of values as the times tensor, but I don't completely understand how pynvml.nvmlDeviceGetDecoderUtilization manages it. It seems like it is always sampling the device for usage, then when called returns a single median/max/average over an automatically determined sampling period when called?

min = unit_times.min().item()
max = unit_times.max().item()
print(
f"\n{prefix} {med = :.2f}, {mean = :.2f} +- {std:.2f}, {min = :.2f}, {max = :.2f} - in {unit}, fps = {fps:.1f}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, no strong opinion but I find that reporting the mean isn't super useful when the std is small enough. Same for min and max (one is enough). It makes the logs slightly easier to read.

Comment on lines 104 to 105
util_nonzero = nvenc_metrics["utilization"][nvenc_metrics["utilization"] > 0]
util_median = util_nonzero.median().item() if len(util_nonzero) > 0 else 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want the complete view an not exclude the zero values. That the utilization is sometimes zero is actually relevant information!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I skipped any zero value was that for shorter benchmarks with the previous NVENC Monitor approach, the NVENC was not active most of the time, so the median was often zero. It seems the pynvml library is smarter about sampling frequency, so this is not a problem anymore!

print("CUDA not available. GPU benchmarks will be skipped.")

decoder = VideoDecoder(str(args.path))
valid_max_frames = min(args.max_frames, len(decoder))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? I think we should be able to remove the corresponding parameter in write_raw_frames and just use len(frames)?

Comment on lines 220 to 221
else:
print("Skipping VideoEncoder GPU benchmark (CUDA not available)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, something similar is already printed above. We only need one

Comment on lines 239 to 240
else:
print("Skipping FFmpeg CLI GPU benchmark (CUDA not available)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. We should only need a single if cuda_available: block and benchmark both torchcodec and the ffmpeg cli in there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants