-
Notifications
You must be signed in to change notification settings - Fork 82
Benchmark encoding against ffmpeg cli #1074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e4b6d52 to
743b664
Compare
…into test_gpu_benchmarking
…o test_gpu_benchmarking
…/torchcodec into test_gpu_benchmarking
| def encode_torchcodec(frames, output_path, device="cpu"): | ||
| encoder = VideoEncoder(frames=frames, frame_rate=30) | ||
| if device == "cuda": | ||
| encoder.to_file(dest=output_path, codec="h264_nvenc", extra_options={"qp": 1}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we currently using qp=1 for torchcodec encoder vs qp=0 for ffmpeg cli? (line 155)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and we should not be, thanks for catching this!
|
Great work on the benchmarks @Dan-Flores! I liked the detailed analysis of the results. I left two clarifying questions. |
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dan-Flores , this looks good!
| self.metrics = { | ||
| "utilization": [s["utilization"] for s in samples], | ||
| "memory_used": [s["memory_used"] for s in samples], | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On NVENCMonitor above, I think we might want to use pynvml instead, as done e.g. in P1984513849.
The main reason is that NVENCMonitor is sampling utilization value every 50ms, which isn't exactly in sync with the number of iterations in the loop. That is, the returned nvenc_tensor doesn't contain the same amount of values the times tensor, and so their reported values aren't averaged over the same amount of experiments either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example, I'll update to use pynvml.
I see how arbitrarily selecting 50ms will not get the same amount of values as the times tensor, but I don't completely understand how pynvml.nvmlDeviceGetDecoderUtilization manages it. It seems like it is always sampling the device for usage, then when called returns a single median/max/average over an automatically determined sampling period when called?
| min = unit_times.min().item() | ||
| max = unit_times.max().item() | ||
| print( | ||
| f"\n{prefix} {med = :.2f}, {mean = :.2f} +- {std:.2f}, {min = :.2f}, {max = :.2f} - in {unit}, fps = {fps:.1f}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, no strong opinion but I find that reporting the mean isn't super useful when the std is small enough. Same for min and max (one is enough). It makes the logs slightly easier to read.
| util_nonzero = nvenc_metrics["utilization"][nvenc_metrics["utilization"] > 0] | ||
| util_median = util_nonzero.median().item() if len(util_nonzero) > 0 else 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want the complete view an not exclude the zero values. That the utilization is sometimes zero is actually relevant information!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I skipped any zero value was that for shorter benchmarks with the previous NVENC Monitor approach, the NVENC was not active most of the time, so the median was often zero. It seems the pynvml library is smarter about sampling frequency, so this is not a problem anymore!
| print("CUDA not available. GPU benchmarks will be skipped.") | ||
|
|
||
| decoder = VideoDecoder(str(args.path)) | ||
| valid_max_frames = min(args.max_frames, len(decoder)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? I think we should be able to remove the corresponding parameter in write_raw_frames and just use len(frames)?
| else: | ||
| print("Skipping VideoEncoder GPU benchmark (CUDA not available)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, something similar is already printed above. We only need one
| else: | ||
| print("Skipping FFmpeg CLI GPU benchmark (CUDA not available)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. We should only need a single if cuda_available: block and benchmark both torchcodec and the ffmpeg cli in there.
This PR creates a benchmark to compare VideoEncoder against FFmpeg CLI. These tools aren't one-to-one, so some assumptions are made:
For VideoEncoder, we use this simple workflow:
For FFmpeg CLI, we count the time used to write frames from a tensor to a file if the flag is used:
--write_framesResult Summary:
VideoEncodershows better performance on GPU + CPU.VideoEncodershows a significant speed improvement, up to 3.5x faster than FFmpeg CLI for decoding 30 frames, without adding the time required to write frames to bytes.VideoEncoder, while Median GPU memory used values are the same.VideoEncoderis significantly faster.Details
All benchmarks are run using a 1280x720 video: Command to generate video: `ffmpeg -f lavfi -i testsrc2=duration=600:size=1280x720:rate=30 -c:v libx264 -pix_fmt yuv420p test/resources/testsrc2_10min.mp4`Benchmarking
nasa_13013.mp4, writing frames in FFmpeg$
python benchmarks/encoders/benchmark_encoders.pyBenchmarking
nasa_13013.mp4, with--skip-write-frames$
python benchmarks/encoders/benchmark_encoders.py --skip-write-frames