perf: implement fast Get for integral types#216
perf: implement fast Get for integral types#216TerrorJack wants to merge 1 commit intohaskell:masterfrom
Conversation
Bodigrim
left a comment
There was a problem hiding this comment.
(I'm not a maintainer here)
| (fromIntegral (s `B.unsafeIndex` 1)) | ||
| {-# INLINE[2] getWord16be #-} | ||
| {-# INLINE word16be #-} | ||
| #if defined(WORDS_BIGENDIAN) |
There was a problem hiding this comment.
Is it feasible to add a s390x job to CI? See https://github.com/haskell/bytestring/blob/master/.github/workflows/ci.yml#L121 for instance. Otherwise #if defined(WORDS_BIGENDIAN) tends to bit rot really quickly.
There was a problem hiding this comment.
that'll be an extra source of flakiness before https://gitlab.haskell.org/ghc/ghc/-/issues/25541 is sorted out
There was a problem hiding this comment.
I think it's a good idea to run CI on a big endian arch, but that can be done in a later PR.
This patch implements fast `Get` logic for integral types based on: - Use a single load operation when loading with same endianness of the host, otherwise do a host load and a byteSwap. This avoids the overhead of multiple single-byte loads in the previous implementation. - Use the unaligned Addr# load/store primops added since GHC 9.10 when available, otherwise do a plain peek. This ensures the GHC backends see the right AlignmentSpec at the Cmm level and can correctly emit unaligned load instructions. There's no need for changing `Put` logic they're backed by `FixedPrim` logic in `Data.ByteString.Builder.Prim.Binary` that already does similar optimization.
eeaa5ea to
652ee91
Compare
|
I tried to benchmark the use of the unaligned load primops, but I couldn't notice any difference. Do you have any idea why that might be? This is my benchmark: import Control.Monad
import Data.Binary.Get
import Data.Binary.Put
import Test.Tasty.Bench
main :: IO ()
main = defaultMain
[ bench "" $ whnf (runGet getData) bs
]
where
n = 100000
getData = fmap sum $ replicateM n $ do
w8 <- getWord8
w16 <- getWord16host
w32 <- getWord32host
w64 <- getWord64host
pure $! fromIntegral w8 + fromIntegral w16 + fromIntegral w32 + w64
bs = runPut $ replicateM_ n $ do
putWord8 42
putWord16host 12
putWord32host 0xff00ff00
putWord64host 0x0123456789abcdef |
This patch implements fast
Getlogic for integral types based on:host, otherwise do a host load and a byteSwap. This avoids the
overhead of multiple single-byte loads in the previous
implementation.
available, otherwise do a plain peek. This ensures the GHC backends
see the right AlignmentSpec at the Cmm level and can correctly emit
unaligned load instructions.
There's no need for changing
Putlogic they're backed byFixedPrimlogic in
Data.ByteString.Builder.Prim.Binarythat already doessimilar optimization.
Closes #215.