Skip to content

Analysis & Discussion: Jpeg & Resize processing pipelines, improvement opportunities #1064

@antonfirsov

Description

@antonfirsov

Introduction

Apart from the API simplification, the main intent of #907 was to enable new optimizations: it's possible to eliminate a bunch of unnecessary processing steps from the most common YCbCr Jpeg thumbnail making use-case. As it turned out in #1062, simply changing the pixel type to Rgba24 is not sufficient, we need to implement the processing pipeline optimizations enabled by the .NET Core 3.0 Hardware Intrinsic API, especially by the shuffle and permutation intrinsics which are allowowing fast conversion between different pixel type representations and component orders (eg. Rgba32 <--> Rgb24), as well as fast conversion between Planar/SOA and Packed/AOS pixel representations. The latter is important because raw Jpeg data consists of 3 planes representing the YCbCr data, while an ImageSharp Image is always packed.

This analyisis:

  1. Kicks off by explaining the causes of the Rgb24 slowdown in Add La16 and La32 IPixel formats. #1062
  2. Defines Processing Pipelines as a chains of Data States and Transformations
  3. Presents a deep overview of the current floating point Jpeg and Resize pipelines, showing incremental improvement opportunities. Note: the Resize pipeline is still TODO, and it will remain so for a couple of days/weeks. This should not prevent you from getting the big picture though.
  4. Roughly explains the challenges of adding integer SIMD operations to the Jpeg pipeline

Please let me know, if some pieces are still hard to follow. It's worth to check out all URL-s while reading.

TLDR
If you want to hear some good news before reading through the whole thing, jump to the Conclusion part 😄

Why is Rgb24 post processing slow in our current code?

YCbCr -> TPixel conversions, the generic case

JpegImagePostprocessor is processing the YCbCr data in two steps:

  1. Color convert AND pack the Y + Cb + Cr image planes to Vector4 RGBA buffers. The two operations are carried out together by the matching JpegColorConverter. With the YCbCr colorspace which has only 3 components, this is already a sub-optimal, since the 4th alpha component (Vector4.W) is redundant. Vector4 packing is done with non-vectorized code.
  2. Convert the Vector4 buffer to pixel buffer, using the pixel specific implementation.

Rgba32 vs Rgb24

The difference is that PixelOperations<Rgba32>.FromVector4() does not need to do any component shuffling, only expanding byte values to float-s, while in PixelOperations<Rgba32>.FromVector4(), we first convert the float buffers to Rgba32 buffers (fast), which is followed by an Rgba32 -> Rgb24 conversion using the sub-optimal default conversion implementation. This operation:

  • Could be significantly optimized by utilizing byte shuffling SIMD intrinsics.
  • Is in fact unnecessary. By extending JpegColorConverter with a method to pack data into Vector3 buffers, we could convert Vector3 data into Rgb24 data exactly the same way we do the Vector4 -> Rgba32 conversion.

Definition of Processing Pipelines

Personally, my memory is terrible and I always need to reverse engineer my own code when we want to understand what's happening and make decisions. Lack of comments and confusing terminology is also misleading. To get a good overview, it's really important to step back and abstract away implementation details, by thinking about our algorithms as PIPELINES composed of Data States and Transformations, where

  • [D] Data States (nodes) are representations of pixel data buffers in a specific form
  • (T) Transformations (edges) are specific SIMD or scalar implementations of algorithms

This representation is only good for analyzing data flow for a specific configuration, eg. a well defined input image format + decoder configuration + output pixel type. To visualize the junctions, we need DAG-s 🤓.

Current floating point YCbCr Jpeg Color Processing & Resize pipelines, improvement opportunities

Presumtions:

  • The executing runtime is > netcoreapp2.1 (enables Vector.Widen)
  • The executing CPU supports the AVX2 instruction set, implying that Vector<T>-s are in fact AVX2 registers and Vector<T> intrinsics are JIT-ed to AVX2 instructions
  • Vector4 operations are JIT-ed to SSE2 instructions

(I.) Converting raw jpeg spectral data to YCbCr planes

Converting raw jpeg spectral data to YCbCr planes, done by CopyBlocksToColorBuffer
[D] 3 planes of quantized spectral Int16 jpeg components (3 x Buffer2D<Block8x8>, Y+Cb+Cr)
(T) AVX2 Int16 -> Int32 widening and Int32 -> float conversion, both using Vector<T>, implemented in Block8x8F.LoadFrom(Block8x8)
[D] 3 planes of quantized spectral float jpeg components (3 x Buffer2D<Block8x8>, Y+Cb+Cr)
(T) Dequantization by SSE2 multiplication: Block8x8F.MultiplyInplace(DequantiazationTable)
[D] 3 planes of DEquantized spectral float jpeg components (3 x Buffer2D<Block8x8>, Y+Cb+Cr)
(T) SSE2 floating point IDCT
[D] 3 Planes of float jpeg color channels (3 x Buffer2D<Block8x8>, Y+Cb+Cr)
(T) AVX2 normalization and rounding using Vector<T>. Rounding is needed for better libjpeg compatibility
[D] 3 Planes of SUBSAMPLED float jpeg color channels normalized to 0-255 (3 x Buffer2D<Block8x8>, Y+Cb+Cr)
(T) Chroma supersampling. No SIMD, fully scalar code, full of ugly optimizations to make it at least cache friendly. Done by Block8x8.CopyTo()) (super misleading name!)
[D] 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)

(II. a) Converting the Y+Cb+Cr planes to an Rgba32 buffer

Y+Cb+Cr planes -> Rgba32 buffer, done by ConvertColorsInto
[D] 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) Color convert and pack into a single Vector4 buffer
[D] Floating point RGBA data as Memory<Vector4>
(T) Convert the Vector4 buffer to an Rgba32 buffer. In the Rgba32 case case, the input buffer could be handled as homogenous float buffer, where all individual float values should be converted to byte-s. The conversion is implemented in BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrowing operations through Vector<T>
[D] The result image as an Rgba32 buffer

(II. b) Converting the Y+Cb+Cr planes to an Rgb24 buffer, current sub-optimal pipeline

Y+Cb+Cr planes -> Rgb24 buffer, done by ConvertColorsInto
[D] 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) Color convert and pack into a single Vector4 buffer
[D] Floating point RGBA data as Memory<Vector4>
(T) Convert the Vector4 buffer to an Rgba32 buffer, utilizing BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrow operations through Vector<T>
[D] Temporary Rgba32 buffer
(T) PixelOperations<Rgb24>.FromRgba32() (sub-optimal, extra transformation!)
[D] The result image as an Rgb24 buffer

(II. b++) Converting the Y+Cb+Cr planes to an Rgba24 buffer, IMPROVEMENT PROPOSAL

See #1121

(III. a) Resize Image<Rgba32>, current pipeline

TODO

(III. b) Resize Image<Rgb24>, current pipeline

TODO.
Without any change, the current code shall run faster than for Image<Rgba32>.

(III. b++) Resize Image<Rgb24>, IMPROVEMENT PROPOSAL

TODO

Integer-based SIMD pipelines

Although the Hardware Intrinsic API removes all theoretical boundaries to have 1:1 match with other high performance imaging libraries, for both Jpeg Decoder and Resize by utilizing AVX2 and SSE2 integer algorithms, there is a big practical challange: It's very hard to introduce these improvements in an iterative manner.

It's not possible to exchange the elements of the Jpeg pipeline at arbitrary points, because it would lead to insertion of extra float <-> Int16/32 conversions. To overcome this, we should start introducing integer transformations and data states at the beginning and/or at the end of the pipeline. This could be done by replacing the transformations and the data states in subsequent PR-s while moving the Int16 -> float conversion towards the bottom (when starting from the beginning), and the float -> byte conversion towards the top (when starting from the end). EG:

  • At the beginning of the pipeline first replace dequantization, then IDCT, then normalization etc..
  • At the end of the pipeline, we shall implement a full integer YCbCr24 -> Rgb24 SIMD conversion first

Conclusion

If we aim for low hanging fruits, I would start by implementing (II. b++) and (III. b++). After that, we can continue by introducing integer SIMD operations starting at the beginning or at the end of the Jpeg pipeline.

I would also suggest to keep the current floating point pipeline in the codebase as is, to avoid perf regressions for pre-3.0 users. I believe those platforms will be still relevant for many customers for a couple of other years.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions