-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Dear Charles Frasch,
i really enjoyed your CPP-talk and have reworked the different queue-designs you presented using the Nim language - just to figure out how close to C++ one can get.
My results are mainly in-line with your findings - Rigtorps cached-cursors really make a difference. Your queue-design - i'm tempted to name it 'frasch-queue' :) since it sounds alike "fresh" in german - beats the best performing SPSC-design in Nim by a impressive factor of 5.
But the moment you groove-away by changing the index calculation from cursor mod capacity to cursor and mask, gaining 4-x in performance, i can only measure 8-10-percent ? Maybe my compiler (CLang) is smart enough to detect the access to a fixed-sized array and can optimise that - not sure, but maybe.
Another maybe interesting variation is - using thread-local-storage for the cached-cursors (and the mask). I tried this with aligned and plain (un-aligned) declarations. Surprisingly these two variations became my best-performers. They clock-in at 340-350 mill/ops - thats about half the thru-put of your 4a-version. Your i7 seems to be a bit faster and has larger caches, than my elderly i5 (2017), but i doubt that this is the full story. I expected to see around 60-75-percent thru-put.
My version of your 4a-version maxes-out at ~270 mill/ops ? I have no idea why - as this should be the superior design.
A naive question - your benchmark counts a pair of successful push/pop-operations as a single operation ? Failed push/pops are not considered. So the op/sec are simply 400-mill div runtime. Since when i measure e.g. a cache or hash-map, i'd count every operation separately, even failed ones ?
Since you like performant data-structures - if you need a quick LRU-Cache, you might enjoy to take a look at Hiroshi Inoue's "Multistep LRU"- its brilliant and vectorised, therefore blazingly fast. His excellent paper is online.
cu, maybe at the CCP-con, greets Andreas