Skip to content

Speed improvements to resize convolution (no vpermps w/ FMA)#2793

Merged
JimBobSquarePants merged 18 commits intomainfrom
js/resize-map-optimizations
Feb 4, 2026
Merged

Speed improvements to resize convolution (no vpermps w/ FMA)#2793
JimBobSquarePants merged 18 commits intomainfrom
js/resize-map-optimizations

Conversation

@JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented Aug 15, 2024

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Fixes #1515

This is a replacement for #1518 by @Sergio0694 with most of the work based upon his implementation. I've modernized some of the code and fixed the precision issues.

Description

Follow up to #1513. This PR does a couple things:

  • Switch the resize kernel processing to float
  • Add an AVX2 vectorized method to normalize the kernel
  • Vectorize the kernel copy when not using FMA, using Span<T>.CopyTo instead
  • Remove the permute8x32 when using FMA, by creating a convolution kernel of 4x the size

Resize convolution codegen diff

Before:

vmovsd xmm2, [rax]
vpermps ymm2, ymm1, ymm2
vfmadd231ps ymm0, ymm2, [r8]

After:

vmovupd ymm2, [r8]
vfmadd231ps ymm0, ymm2, [rax] 

Benchmarks

Main

BenchmarkDotNet v0.15.8, Windows 11 (10.0.26200.7628/25H2/2025Update/HudsonValley2)
AMD RYZEN AI MAX+ 395 w/ Radeon 8060S 3.00GHz, 1 CPU, 32 logical and 16 physical cores
.NET SDK 10.0.102
  [Host] : .NET 8.0.23 (8.0.23, 8.0.2325.60607), X64 RyuJIT x86-64-v4

Runtime=.NET 8.0  Arguments=/p:DebugType=portable  Toolchain=InProcessEmitToolchain
Method Mean Error StdDev Ratio Allocated Alloc Ratio
SystemDrawing 4.432 ms 0.0089 ms 0.0083 ms 1.00 96 B 1.00
'ImageSharp, MaxDegreeOfParallelism = 1' 2.032 ms 0.0044 ms 0.0037 ms 0.46 54664 B 569.42

PR

BenchmarkDotNet v0.15.8, Windows 11 (10.0.26200.7628/25H2/2025Update/HudsonValley2)
AMD RYZEN AI MAX+ 395 w/ Radeon 8060S 3.00GHz, 1 CPU, 32 logical and 16 physical cores
.NET SDK 10.0.102
  [Host] : .NET 8.0.23 (8.0.23, 8.0.2325.60607), X64 RyuJIT x86-64-v4

Runtime=.NET 8.0  Arguments=/p:DebugType=portable  Toolchain=InProcessEmitToolchain
Method Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
SystemDrawing 4.429 ms 0.0069 ms 0.0061 ms 1.00 - 96 B 1.00
'ImageSharp, MaxDegreeOfParallelism = 1' 1.892 ms 0.0103 ms 0.0086 ms 0.43 1.9531 54617 B 568.93

Performance in the Playground Benchmarks looks really, really good.

CC @antonfirsov @saucecontrol

BenchmarkDotNet v0.15.8, Windows 11 (10.0.26200.7628/25H2/2025Update/HudsonValley2)
AMD RYZEN AI MAX+ 395 w/ Radeon 8060S 3.00GHz, 1 CPU, 32 logical and 16 physical cores
.NET SDK 10.0.102
  [Host]   : .NET 10.0.2 (10.0.2, 10.0.225.61305), X64 RyuJIT x86-64-v4
  ShortRun : .NET 10.0.2 (10.0.2, 10.0.225.61305), X64 RyuJIT x86-64-v4

Job=ShortRun  IterationCount=5  LaunchCount=1
WarmupCount=5
Method Mean Error StdDev Ratio
'MagicScaler Load, Resize, Save' 31.95 ms 0.377 ms 0.098 ms 0.17
'ImageSharp TD Load, Resize, Save' 39.72 ms 0.842 ms 0.219 ms 0.21
'NetVips Load, Resize, Save' 58.67 ms 1.063 ms 0.276 ms 0.32
'ImageSharp Load, Resize, Save' 59.62 ms 3.248 ms 0.503 ms 0.32
'SkiaSharp Load, Resize, Save' 77.58 ms 1.987 ms 0.516 ms 0.42
'ImageFree Load, Resize, Save' 132.15 ms 1.994 ms 0.518 ms 0.71
'System.Drawing Load, Resize, Save' 185.61 ms 4.171 ms 1.083 ms 1.00
'ImageFlow Load, Resize, Save' 189.55 ms 2.393 ms 0.622 ms 1.02
'ImageMagick Load, Resize, Save' 200.84 ms 2.369 ms 0.615 ms 1.08

@saucecontrol
Copy link
Contributor

Brain's not fully awake yet today, but I'll give the maths a look soon.

@JimBobSquarePants
Copy link
Member Author

Brain's not fully awake yet today, but I'll give the maths a look soon.

Thanks for the review so far. I still haven't figured out what is going on with PeriodicKernelMap. It all looks correct to me.

@saucecontrol
Copy link
Contributor

It looks to me like the only differences are due to the change to single precision for kernel normalization and for calculation of the distances passed to the interpolation function. You'll definitely give up some accuracy there, and I'm not sure it's worth it since the kernels only have to be built once per resize. You can see here that @antonfirsov changed the precision to double from the initial implementation some years back.

Since the periodic kernel map relies on each repetition of the kernel weights being exact, I can see how precision loss might lead to some differences when compared with a separate calculation per interval. I've actually never looked at your implementation of the kernel map before, and now my curiosity is piqued because I arrived at something similar myself, but my implementation calculates each kernel window separately, and only replaces that interval with the periodic version if they match exactly. Part of this was due to a lack of confidence in the maths on my part, as I only discovered the periodicity of the kernel weights by observation and kind of intuited my way to a solution.

@antonfirsov would you mind filling in some gaps on the theory behind your periodic kernel map implementation? Did you use some paper or other implementation as inspiration, or did you arrive at it observationally like I did?

@JimBobSquarePants
Copy link
Member Author

I thought, I'd update this to match latest main. I don't quite understand what is happening with the sampling here and I'm not sure it's worth me taking the time to figure it out. @antonfirsov if you do have any insight I'd appreciate it otherwise I think I'll scrap this.

image

@antonfirsov
Copy link
Member

@antonfirsov if you do have any insight I'd appreciate

Not without going deep down the rabbit hole :(

@JimBobSquarePants
Copy link
Member Author

@antonfirsov if you do have any insight I'd appreciate

Not without going deep down the rabbit hole :(

I thought that might be the case. I'll leave this hanging around for a bit longer, but I don't know if it's worth it. I can do a few smaller things instead (like vectorize normalize.)

@JimBobSquarePants JimBobSquarePants marked this pull request as ready for review February 3, 2026 08:07
@JimBobSquarePants JimBobSquarePants changed the title WIP - Speed improvements to resize convolution (no vpermps w/ FMA) Speed improvements to resize convolution (no vpermps w/ FMA) Feb 3, 2026
@JimBobSquarePants
Copy link
Member Author

@Sergio0694 Only took me 5 years to figure out the precision issue!! 😛

@JimBobSquarePants JimBobSquarePants merged commit ad816ed into main Feb 4, 2026
12 checks passed
@JimBobSquarePants JimBobSquarePants deleted the js/resize-map-optimizations branch February 4, 2026 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pre-duplicate kernel values in ResizeKernelMap for faster FMA convolution

3 participants