-
Notifications
You must be signed in to change notification settings - Fork 86
Rotation metadata support #1173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| // Extracts the rotation angle in degrees from the stream's display matrix | ||
| // side data. The display matrix is used to specify how the video should be | ||
| // rotated for correct display. | ||
| std::optional<double> getRotationFromStream(const AVStream* avStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there is a huge difference between using optional rather than 0 to represent no rotation.
Would it be useful to distinguish the case of a video with no rotation metadata and a video that is known to not been rotated before (ie. rotation of 0 degrees)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this distinction is important to users or FFmpeg, but optional seems like a reasonable type to use here. Correct me if I'm wrong, the only difference is that we need to check rotation is not None?
| #elif LIBAVFORMAT_VERSION_MAJOR >= 60 // FFmpeg 6.0 | ||
| // FFmpeg 6.0: Use av_stream_get_side_data (deprecated but still available) | ||
| // Suppress deprecation warning for this specific call | ||
| #pragma GCC diagnostic push |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is a better way to handle the case of FFmpeg 6 deprecating av_stream_get_side_data while the replacement codecpar->coded_side_data not being available until FFmpeg 6.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awful 😄 . But I think you're doing the right thing!
I'd just add a comment at the top of the function, with a high-level explanation of what's going on. Something like this (please check me on the versions and the API names):
av_stream_get_side_datawas deprecated in version FFmpeg 6.0, but its replacement (av_packet_side_data_get+codecpar->coded_side_data) is only available from version 6.1. We need some#pragmamagic to silence the deprecation warning which our compile chain would otherwise treat as an error.
(CC @scotts, I'm sure you will enjoy this).
test/test_decoders.py
Outdated
| dimension_order=dimension_order, | ||
| ) | ||
|
|
||
| frame = decoder[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets modify this test to check accuracy for multiple frames.
Also, the nasa_13013 video contains multiple video streams, so we could parametrize over stream_index as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on Daniel's suggestions. Kinda related but: do we actually need to check against FFmpeg? I'm hoping we can get away with just comparing the output of NASA_VIDEO and NASA_VIDEO_ROTATED (where we manually apply the 90 rotation to either, in order to compare them)?
To be clear I do agree that comparing against FFmpeg is the "golden standard" test... but I think we can get away with the simpler approach described above.
On parametrizating over streams: if we're going to check-in both json files for both stream 0 and stream 3, I agree, we might as well parametrize over both. But do we need to? Would a single stream be enough here?
test/test_decoders.py
Outdated
| # Frame is (H, W, C) | ||
| frame_h, frame_w = frame.shape[0], frame.shape[1] | ||
|
|
||
| # Pixel-exact comparison with ffmpeg CLI (autorotate is on by default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does autorotate refer to an ffmpeg CLI option? It's not used in the command below, so this comment is a bit confusing.
| if self._dimension_order == "NCHW": | ||
| dims = (-2, -1) | ||
| else: | ||
| dims = (-3, -2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we use negative indices here? It works, but I had to think about it for a second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using negative indices to account for batch vs unbatched call but it might not be applicable after moving the logic to C++
| # Precompute rotation parameters for torch.rot90 | ||
| # k is the number of 90-degree counter-clockwise rotations | ||
| rotation = self.metadata.rotation | ||
| if rotation is not None and rotation != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking rotation != 0 shouldn't matter here, we will set self._rotation_k to 0 either way.
| // Extracted from the display matrix side data. This indicates how the video | ||
| // should be rotated for correct display. Common for videos recorded on mobile | ||
| // devices. | ||
| std::optional<double> rotation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The information from this comment is also in _metadata.py, it's probably not necessary here.
| AV_PKT_DATA_DISPLAYMATRIX); | ||
| if (sideData != nullptr && sideData->size >= 9 * sizeof(int32_t)) { | ||
| displayMatrix = reinterpret_cast<const int32_t*>(sideData->data); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we verify the size of sideData here to be >= 9 * sizeof(int32_t)? I found one example in FFmpeg where there is only a nullptr check, but its possible I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I think I was being overly cautious to check that the display matrix has 9 elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we test that rotation is present in the metadata, it makes sense to check in a video.
test_rotation_applied_to_frames only needs a few frames, perhaps we can reduce the size of the video file and associated all_frames_info JSONs?
|
|
||
| frame_data, *_ = core.get_frame_at_index(self._decoder, frame_index=key) | ||
| return frame_data | ||
| return self._apply_rotation(frame_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A side effect of applying the rotation at the python level is that we have to update each frame retrieval API.
I wonder if this should be handled in C++, similar to maybePermuteHWC2CHW. We would still need to add it to every SingleStreamDecoder::getFrame ... method, but it keeps the complexity in the C++ implementation.
Let me know your thoughts @mollyxu @NicolasHug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I think it would make the public API more clean to do something similar to maybePermuteHWC2CHW
I was also considering applying the ffmpeg transpose filter as an option to consolidate the calls in one place. The con of that would be the pixels being moved instead of it just being a view operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good call-out, I agree that we should try to implement this logic in C++. The main reason is that, while we don't want to have a public C++ decoding API, we should still aim to have a fully-featured C++ decoder. Ideally, we want the python layer to be as slim as possible (with no logic). We can't always do that, but that's what we should aim for.
It makes a lot of sense to me to treat this rotation logic in the same way we treat maybePermuteHWC2CHW.
On the implementation, fully agreed that we should prefer using torch::rot90 (which should still exist in C++) so that we can return a view instead of incurring a copy through filtergraph.
test/test_decoders.py
Outdated
| dimension_order=dimension_order, | ||
| ) | ||
|
|
||
| frame = decoder[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on Daniel's suggestions. Kinda related but: do we actually need to check against FFmpeg? I'm hoping we can get away with just comparing the output of NASA_VIDEO and NASA_VIDEO_ROTATED (where we manually apply the 90 rotation to either, in order to compare them)?
To be clear I do agree that comparing against FFmpeg is the "golden standard" test... but I think we can get away with the simpler approach described above.
On parametrizating over streams: if we're going to check-in both json files for both stream 0 and stream 3, I agree, we might as well parametrize over both. But do we need to? Would a single stream be enough here?
test/test_decoders.py
Outdated
| Verifies pixel-exact match with ffmpeg CLI. | ||
| """ | ||
| import subprocess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep the test against FFmpeg, this should be a top-level import.
src/torchcodec/_core/_metadata.py
Outdated
| display. Common values are 0, 90, 180, or -90, -180 degrees. Videos recorded on | ||
| mobile devices often have rotation metadata. TorchCodec automatically | ||
| applies this rotation during decoding, so the returned frames are | ||
| in the correct orientation (float or None).""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great comment! We should indicate that we're rounding to the closest multiple of 90.
Also: The height and width that we report, are they pre-rotation or post-rotation? I think we'll want them to be post-rotation, since this is what we'll return from our decoding methods? No matter what we choose, we should make sure to document it!
|
|
||
| frame_data, *_ = core.get_frame_at_index(self._decoder, frame_index=key) | ||
| return frame_data | ||
| return self._apply_rotation(frame_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good call-out, I agree that we should try to implement this logic in C++. The main reason is that, while we don't want to have a public C++ decoding API, we should still aim to have a fully-featured C++ decoder. Ideally, we want the python layer to be as slim as possible (with no logic). We can't always do that, but that's what we should aim for.
It makes a lot of sense to me to treat this rotation logic in the same way we treat maybePermuteHWC2CHW.
On the implementation, fully agreed that we should prefer using torch::rot90 (which should still exist in C++) so that we can return a view instead of incurring a copy through filtergraph.
Add rotation metadata support
Context: #1084
Videos recorded on mobile devices often store rotation metadata indicating how the video should be displayed. TorchCodec now extracts this metadata and applies the rotation during decoding so frames are returned in the correct orientation.
This PR adds support for extracting and applying rotation metadata from video streams. The rotation angle (in degrees counter clockwise) is extracted from the display matrix side data and:
VideoStreamMetadata.rotationThe
rotationpropertyNone: The video has no rotation metadata (most videos)float: The rotation angle in degrees (counter-clockwise) needed for correct display. Common values are 0, 90, 180, -90, or -180.Implementation
Metadata extraction (
FFMPEGCommon.cpp):getRotationFromStream()which extracts the rotation fromAV_PKT_DATA_DISPLAYMATRIXside dataav_packet_side_data_get()withcodecpar->coded_side_dataav_stream_get_side_data()with deprecation warning suppression for FFmpeg 6av_stream_get_side_data()withint*size parameterav_display_rotation_get()to compute the angle from the display matrixMetadata propagation:
StreamMetadatastruct now includes anstd::optional<double> rotationfield (Metadata.h)SingleStreamDecoderpopulates this during stream scanningVideoStreamMetadatadataclass exposesrotation: float | None(_metadata.py)custom_ops.cpp)Test coverage
nasa_13013_rotated.mp4test resource with 90° rotation metadatatest_rotation_metadata(intest_metadata.py): Verifies rotation is correctly extracted for rotated videos and isNonefor non-rotated videostest_rotation_applied_to_frames(intest_decoders.py): Performs pixel-exact comparison with ffmpeg CLI output