Skip to content

Conversation

@pl752
Copy link

@pl752 pl752 commented Dec 11, 2025

I want to propose set of changes aimed at improving performance, which I have implemented and used for some time in my (private) projects.
The main goal of these changes is to significantly reduce allocations to heap by using stack allocations, array pool and avoiding unnecessary allocations in first place.
I have created topic in mailing list
I will appreciate opinions and help with testing, as I was used these changes for a while without any anomalies, though I didn't run thorough tests with all versions (I am using fb 3 server). Also the changes shouldn't have changed observable behavior.

@niekschoemaker
Copy link
Contributor

Personally most of the changes seem to make sense, but I would make the case that the Auth part does become way to complex with these changes (and also not sure how often that code even runs, cause I suppose it runs once per connection so probably not too hot of a path)

The other parts do seem to make sense, especially the ReaderWriter optimizations, as those run for each query.

Did you however happen to run the benchmarks against this to see what actual change it makes to performance?

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Unfortunately, I haven't got to running benchmarks yet, however changes resulted in significant reduction of cpu time usage and allocations in application performance profiling runs, I will try to perform more thorough benchmarks and correctness tests soon, when I will have some free time

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Also I agree that auth part is a case of over-optimization and can be omitted. I just applied change pattern to everything which allocates temporary buffers and I have got an eye on. So optimizations for things which run once per session/connection aren't necessary

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Upd: I have run the Perf thing I found in a solution (idk if it is any representative) And yeah, the speed difference is pretty negligible, however reduction in allocations can be clearly observed

BenchmarkDotNet v0.15.8, Windows 10 (10.0.19044.6691/21H2/November2021Update)
AMD Ryzen 7 5800H with Radeon Graphics 3.20GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.101
  [Host]  : .NET 8.0.22 (8.0.22, 8.0.2225.52707), X64 RyuJIT x86-64-v3
  NuGet   : .NET 8.0.22 (8.0.22, 8.0.2225.52707), X64 RyuJIT x86-64-v3
  Project : .NET 8.0.22 (8.0.22, 8.0.2225.52707), X64 RyuJIT x86-64-v3

Jit=RyuJit  Platform=X64  Toolchain=.NET 8.0
WarmupCount=3

| Method  | Job     | BuildConfiguration | DataType             | Count | Mean        | Error     | StdDev    | Ratio | Gen0    | Allocated | Alloc Ratio |
|-------- |-------- |------------------- |--------------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|------------:|
| Execute | NuGet   | ReleaseNuGet       | bigint               | 100   | 20,322.3 us | 212.53 us | 188.40 us |  1.00 | 31.2500 |  307.4 KB |        1.00 |
| Execute | Project | Release            | bigint               | 100   | 20,160.8 us | 175.47 us | 146.52 us |  0.99 |       - | 237.61 KB |        0.77 |
|         |         |                    |                      |       |             |           |           |       |         |           |             |
| Fetch   | NuGet   | ReleaseNuGet       | bigint               | 100   |    482.7 us |   4.17 us |   3.90 us |  1.00 |  6.8359 |  56.64 KB |        1.00 |
| Fetch   | Project | Release            | bigint               | 100   |    484.2 us |   3.33 us |   2.78 us |  1.00 |  4.8828 |  40.35 KB |        0.71 |
|         |         |                    |                      |       |             |           |           |       |         |           |             |
| Execute | NuGet   | ReleaseNuGet       | varch(...) utf8 [30] | 100   | 20,406.9 us | 217.86 us | 193.12 us |  1.00 | 31.2500 | 311.34 KB |        1.00 |
| Execute | Project | Release            | varch(...) utf8 [30] | 100   | 20,251.5 us | 118.63 us | 110.97 us |  0.99 |       - | 238.43 KB |        0.77 |
|         |         |                    |                      |       |             |           |           |       |         |           |             |
| Fetch   | NuGet   | ReleaseNuGet       | varch(...) utf8 [30] | 100   |    490.7 us |   3.71 us |   3.47 us |  1.00 |  6.8359 |  60.51 KB |        1.00 |
| Fetch   | Project | Release            | varch(...) utf8 [30] | 100   |    494.8 us |   6.60 us |   5.85 us |  1.01 |  4.8828 |   41.1 KB |        0.68 |

// * Hints *
Outliers
  CommandBenchmark.Execute: NuGet   -> 1 outlier  was  removed (21.28 ms)
  CommandBenchmark.Execute: Project -> 2 outliers were removed (20.81 ms, 21.15 ms)
  CommandBenchmark.Fetch: Project   -> 2 outliers were removed (499.44 us, 507.61 us)
  CommandBenchmark.Execute: NuGet   -> 1 outlier  was  removed (21.71 ms)
  CommandBenchmark.Fetch: Project   -> 1 outlier  was  removed (528.00 us)

Also firebird 3 is used, disk used is OEM samsung nvme 2tb (pm9a1, aka oem 980 pro), 32gb of ddr4 ram @3200MT JEDEC, dual channel ofc

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Upd2: Ran tests with firebird 3 (no embedded), so it does need further testing with other versions (especially embedded and batch operations in modern fb), there was an issue with boolean reading due to _smallbuffer being used both for reading useful bytes and pad (which doesn't affect types which don't get padded). Also, small test run time reduction was observed (aka 24.1 -> 23.5 mins, but without repeatability checks) and no changes in pass/failed/skipped numbers were noticed (after the fix)

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Upd3: performed tests with embedded engine, all passed

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Upd4:
TLDR: Written some benchmarks specific to my (unfortunately private) solution's queries. Changes in query execution timing sometimes is hard to register due to fb3 engine being the main bottleneck in testing scenarios even in ideal conditions (localhost with fast cpu and nvme), however, it seems that query creation/preparation benefited significantly and also massive boost observed in string operations due to rune conversion rework and also positive side effects in memory and local cpu time utilization can be observed.

Benchmark results:

//Update multiple: Optimized (local_opt2)
| Method                                      | UpdateRows | Mean        | Error     | StdDev    | Gen0    | Allocated |
|-------------------------------------------- |----------- |------------:|----------:|----------:|--------:|----------:|
| Update_MainDeliveryById_Merge_RollbackAsync | 25         |  1,695.5 us |  33.42 us |  39.78 us |  3.9063 |  42.78 KB |
| Update_MainDeliveryById_Merge_RollbackAsync | 1000       | 42,000.3 us | 481.56 us | 426.89 us | 83.3333 | 867.56 KB |


//Update multiple: Original (master)
| Update_MainDeliveryById_Merge_RollbackAsync | 25         |  1,704.9 us |  33.17 us |  52.62 us |  3.9063 |  46.98 KB |
| Update_MainDeliveryById_Merge_RollbackAsync | 1000       | 42,416.1 us | 634.02 us | 593.06 us | 83.3333 |  985.1 KB |


//Single insert/upsert: Optimized
| Select_LoadWBSellerAccountsAsync            | -         |    717.2 us |  14.08 us |  14.46 us |  1.9531 |  29.26 KB |
| Insert_Upsert_WbDocCache_RollbackAsync      | -         |    708.6 us |  12.61 us |  11.18 us |  3.9063 |  33.58 KB |


//Single insert/upsert: Original
| Select_LoadWBSellerAccountsAsync            | -         |    741.5 us |  14.50 us |  18.86 us |  3.9063 |   33.5 KB |
| Insert_Upsert_WbDocCache_RollbackAsync      | -         |    724.0 us |  13.94 us |  18.13 us |  3.9063 |  37.22 KB |


//Select multiple mixed (3 int, 1 literal char string): Optimized
| Method                              | Rows   | Mean           | Error        | StdDev        | Gen0       | Gen1      | Allocated    |
|------------------------------------ |------- |---------------:|-------------:|--------------:|-----------:|----------:|-------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10     |       741.1 us |     19.82 us |      57.83 us |          - |         - |     47.52 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100    |     3,507.2 us |     66.56 us |      81.74 us |          - |         - |     421.2 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000   |    31,362.5 us |  5,421.76 us |  15,986.17 us |          - |         - |   4078.15 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000  |   321,426.4 us | 17,596.63 us |  51,884.06 us |  4000.0000 |         - |  40710.48 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000 | 3,394,208.6 us | 67,371.50 us | 193,301.47 us | 49000.0000 | 9000.0000 | 407078.14 KB |


//Select multiple mixed (3 int, 1 literal char string): Original 
	(Yes, 1.09 to >2x in speed and 10x in allocation volumes 
	and when profiling, actually, ~100x difference in allocate/free event counters)
| Method                              | Rows   | Mean         | Error      | StdDev      | Median       | Gen0        | Gen1        | Allocated     |
|------------------------------------ |------- |-------------:|-----------:|------------:|-------------:|------------:|------------:|--------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10     |     1.611 ms |  0.0506 ms |   0.1453 ms |     1.604 ms |           - |           - |     457.27 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100    |    11.138 ms |  0.1882 ms |   0.1760 ms |    11.129 ms |           - |           - |    4511.55 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000   |    34.017 ms |  4.9421 ms |  14.4162 ms |    25.360 ms |   5000.0000 |   1000.0000 |   44988.63 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000  |   346.085 ms | 23.2037 ms |  68.4167 ms |   337.300 ms |  55000.0000 |  11000.0000 |  449544.78 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000 | 3,709.593 ms | 73.9280 ms | 194.7560 ms | 3,695.932 ms | 550000.0000 | 110000.0000 | 4494710.22 KB |

//Select multiple int only (3 int): Optimized
| Method                              | Rows    | Mean           | Error       | StdDev      | Gen0       | Gen1      | Allocated    |
|------------------------------------ |-------- |---------------:|------------:|------------:|-----------:|----------:|-------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10      |       376.7 us |    17.69 us |    50.19 us |          - |         - |     11.73 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100     |     1,102.8 us |    54.87 us |   160.06 us |          - |         - |     63.99 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000    |     4,537.8 us |   689.68 us | 2,033.53 us |          - |         - |    497.61 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000   |    18,131.8 us |   137.98 us |   115.22 us |          - |         - |   4927.88 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000  |   176,431.1 us |   956.29 us |   798.54 us |  6000.0000 |         - |  49230.36 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000000 | 1,743,465.8 us | 7,326.00 us | 6,494.31 us | 60000.0000 | 6000.0000 | 497846.96 KB |

//Select multiple int only (3 int): Original
| Method                              | Rows    | Mean           | Error       | StdDev      | Median         | Gen0       | Gen1      | Allocated    |
|------------------------------------ |-------- |---------------:|------------:|------------:|---------------:|-----------:|----------:|-------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10      |       357.9 us |     9.24 us |    25.44 us |       355.4 us |          - |         - |     12.89 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100     |     1,182.7 us |    49.32 us |   142.29 us |     1,158.6 us |          - |         - |     70.78 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000    |     4,541.1 us |   718.80 us | 2,119.41 us |     3,485.7 us |          - |         - |    561.27 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000   |    18,280.0 us |   340.21 us |   454.17 us |    18,246.9 us |          - |         - |   5561.08 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000  |   173,885.9 us |   916.01 us |   764.91 us |   173,896.6 us |  6000.0000 |         - |  55558.87 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000000 | 1,745,630.3 us | 4,109.23 us | 3,642.73 us | 1,745,432.2 us | 68000.0000 | 7000.0000 | 561128.53 KB |

It was a little bit tricky to actually obtain measurements which could show improvements, however some interesting observations can be made.
The main explaination of smallness of timing improvements is that despite my benchmarks doing pretty much nothing aside from opening connection, opening configured transaction, creating queries, filling in parameters, preparing if ran multiple times in a row, executing/reading, mapping selected fields to single instance of structure (to avoid performance noise as much as possible), rolling back the transaction and closing the connection; the db engine seems to use the whole cpu core time, while the application thread is slacking most of the time.
However the string reading benefited heavily due to optimizations which reduced overall allocated object number 10-100x, because of the original rune char enumerator, which allocated every (!) rune as a separate char array resulting in tens of millions char[1] and char[2] objects being allocated and then collected shortly after, while the new methods avoid allocation as much as possible, situation is also worsened by the original rune counting method, which just called full enumeration, creating all the char arrays and then simply counted them while never using char data itself. Reducing allocations to the definitive buffers and strings save a lot of cpu time (as the heap allocation even in dotnet is not cheap operation and during string conversions the client library actually becomes the bottleneck instead of the engine).
Also the 10x memory volume difference when working with strings can be observed due to the char[1]/[2] arrays being not only 2-4 bytes of useful raw data, but also 0-6 bytes of padding (in some cases) and 8-16 bytes of meta array object (containing effectively a Span, aka pointer to real data and length of array), and that's not taking into account object type and reference manager related data.
Also the tests of queries of small volumes of rows usually yielded bigger percentage improvements (1 to 4% and 9 to 100+%) as, I think, that better string processing aided query and parameter preparation phase.
Also the timings are not the whole story, as the changes caused some pretty benefitial side effects: reduced amount of allocations ofc. reduce amount of times GC is called, also stackalloc is free (cause it is not the complex allocator function, but rather a tiny sub esp, size ... add esp, size), and also there is a reduction of cpu time used, observable even without the profiler, as I could clearly see main thread being 2-3% (5-6% during select with char) of whole cpu, while optimized version consumed only 1-3%, which means that on low-end client systems or in situation when the application is heavily uses the thread pool, the db reading task will occupy the thread less, thus providing more time for other tasks, when pool is exhausted and queue is used, and for other programs on low-end or heavily loaded machines, in theory.
Also the lack of proper benchmark/test coverage was due to the rework being small experiment out of curiosity, when I noticed, that firebird was top 1-2 consumer of cpu time in my application, but then I decided that experiment was pretty successful and the contribution might be useful for other developers and their solutions, so I decided reaching out with the proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants