-
Notifications
You must be signed in to change notification settings - Fork 621
Add a variable-length integer encoder/decoder #744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
lhecker
wants to merge
2
commits into
main
Choose a base branch
from
dev/lhecker/varint
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,6 +7,7 @@ | |
|
|
||
| pub mod arena; | ||
| pub mod sys; | ||
| pub mod varint; | ||
|
|
||
| mod helpers; | ||
| pub use helpers::*; | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,140 @@ | ||
| // Copyright (c) Microsoft Corporation. | ||
| // Licensed under the MIT License. | ||
|
|
||
| //! Variable-length `u32` encoding and decoding, with efficient storage of `u32::MAX`. | ||
| //! `u32::MAX` is a common value in Microsoft Edit's syntax highlighter bytecode. | ||
| //! | ||
| //! # Format | ||
| //! | ||
| //! ```text | ||
| //! 0-127 ( 7 bits): xxxxxxx0 | ||
| //! 128-16383 (14 bits): xxxxxx01 yyyyyyyx | ||
| //! 16384-2097151 (21 bits): xxxxx011 yyyyyyxx zzzzzzzy | ||
| //! 2097152-268435455 (28 bits): xxxx0111 yyyyyxxx zzzzzzyy wwwwwwwz | ||
| //! 4294967295 (32 bits): ....1111 | ||
| //! ``` | ||
| //! | ||
| //! The least significant bits indicate the length, in a format identical to UTF-8. The remaining bits store | ||
| //! the value, in little-endian order. Little endian was chosen, as most architectures today use that. | ||
| //! | ||
| //! On x86, `tzcnt` (= `trailing_ones()` = what we need) has the benefit that its encoding is identical to `rep bsf`. | ||
| //! Older CPUs without BMI1 will ignore the `rep` prefix and use `bsf`, while modern CPUs will use the faster `tzcnt`. | ||
| //! So not just can we drop the need for `bswap` on x86, but we also speed up the bit count calculation. | ||
| //! This makes this encoding faster than LEB128, Google Varint, and others. | ||
|
|
||
| pub fn encode(val: u32) -> Vec<u8> { | ||
| let mut result = Vec::with_capacity(5); | ||
| let shift = match val { | ||
| 0..0x80 => 0, | ||
| 0x80..0x4000 => 1, | ||
| 0x4000..0x200000 => 2, | ||
| 0x200000..0x10000000 => 3, | ||
| _ => { | ||
| result.push(0xff); | ||
| return result; | ||
| } | ||
| }; | ||
| let marker = (1u32 << shift) - 1; | ||
| let encoded = (val << (shift + 1)) | marker; | ||
| let bytes = encoded.to_le_bytes(); | ||
| result.extend_from_slice(&bytes[..=shift]); | ||
| result | ||
| } | ||
|
|
||
| /// # Safety | ||
| /// | ||
| /// The caller must ensure that `data..data+4` is valid memory. | ||
| /// It doesn't need to be a valid value, but it must be readable. | ||
| pub unsafe fn decode(data: *const u8) -> (u32, usize) { | ||
| // For inputs such as: | ||
| // [0xff, 0xff, 0xff, 0xff] | ||
| // the shifts below will shift by more than 31 digits, which Rust considers undefined behavior. | ||
| // *We explicitly want UB here*. | ||
| // | ||
| // If we write an if condition here (like this one), LLVM will turn that into a proper branch. Since our inputs | ||
| // are relatively random, that branch will mispredict, hurting performance. The if condition at the end | ||
| // gets turned into conditional moves (good!), but that only works because it comes after the shifts. | ||
| // Unfortunately, there's no way to ask Rust for "platform-defined behavior" (`unchecked_shl/shr` is not it). | ||
| #[cfg(debug_assertions)] | ||
| unsafe { | ||
| if (*data & 0x0f) == 0x0f { | ||
| return (u32::MAX, 1); | ||
| } | ||
| } | ||
|
|
||
| unsafe { | ||
| let val = u32::from_le((data as *const u32).read_unaligned()); | ||
| let ones = val.trailing_ones(); | ||
|
|
||
| let mut len = ones as usize + 1; | ||
| let mut res = 'bextr: { | ||
| // Give LLVM a helping hand for x86 CPUs with BMI1. It's not smart enough to figure out that `bextr` can | ||
| // be used here. To be fair, it's not faster, so maybe that's why. It is _a lot_ more compact, however. | ||
| #[cfg(target_feature = "bmi1")] | ||
| break 'bextr std::arch::x86_64::_bextr_u32(val, len as u32, (7 * len) as u32); | ||
|
|
||
| // This is where you'd put more architecture-specific optimizations. | ||
| // In fact this is where I'd put my ARM optimizations, but it doesn't have anything like `bextr`. :( | ||
|
|
||
| let mut res = val; | ||
| // Shift out the bytes we read but don't need. | ||
| res <<= 32 - 8 * len; | ||
| // Shift back down and remove the trailing 0/10/110/1110/1111 length bits. | ||
| res >>= 32 - 7 * len; | ||
| break 'bextr res; | ||
| }; | ||
|
|
||
| // If the lead byte indicates >28 bits, assume `u32::MAX`. | ||
| // This doubles as a simple form of error correction. | ||
| if len > 4 { | ||
| res = u32::MAX; | ||
| len = 1; | ||
| } | ||
|
|
||
| (res, len) | ||
| } | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
| use super::*; | ||
|
|
||
| #[test] | ||
| fn test_encode_decode_roundtrip() { | ||
| // Test various boundary values | ||
| let test_values = [ | ||
| 0u32, | ||
| 1, | ||
| 123, | ||
| 127, // Max 1 byte | ||
| 128, // Min 2 bytes | ||
| 1234, | ||
| 16383, // Max 2 bytes | ||
| 16384, // Min 3 bytes | ||
| 2097151, // Max 3 bytes | ||
| 2097152, // Min 4 bytes | ||
| 268435455, // Max 4 bytes | ||
| u32::MAX, // Special case | ||
| ]; | ||
|
|
||
| for &val in &test_values { | ||
| let encoded = encode(val); | ||
| println!("Value {} encoded as: {:02X?}", val, encoded); | ||
| let (decoded, len) = unsafe { decode(encoded.as_ptr()) }; | ||
| println!(" Decoded as: {} with length {}", decoded, len); | ||
| assert_eq!(decoded, val, "Failed roundtrip for value {}", val); | ||
| assert_eq!(len, encoded.len(), "Length mismatch for value {}", val); | ||
| } | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_specific_encodings() { | ||
| // Test specific byte patterns | ||
| unsafe { | ||
| assert_eq!((0, 1), decode([0, 0xbb, 0xcc, 0xdd].as_ptr())); | ||
| assert_eq!((123, 1), decode([0xf6, 0xbb, 0xcc, 0xdd].as_ptr())); | ||
| assert_eq!((1234, 2), decode([0x49, 0x13, 0xcc, 0xdd].as_ptr())); | ||
| assert_eq!((u32::MAX, 1), decode([0xff, 0xbb, 0xcc, 0xdd].as_ptr())); | ||
| } | ||
| } | ||
| } | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't understand - it's not a u32 encodig, it's a u28 encoding with a special case for u32::MAX and a pretty significant gap between 268435455 and 4294967295
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's fair. Perhaps I should move this into the
lshproject now that I made it a library. 🤔 The reason it's an "u28" is because lsh really doesn't need values >2^28, while an efficient compression for a >2^28 value is still useful (it's used for setting the input offset to max. when matching a.*).