GH-533: Add ALP (Adaptive Lossless floating-Point) encoding specification#557
GH-533: Add ALP (Adaptive Lossless floating-Point) encoding specification#557prtkgaur wants to merge 1 commit intoapache:masterfrom
Conversation
Add the encoding specification for ALP (encoding value 10) to Encodings.md. ALP compresses FLOAT and DOUBLE columns by converting values to integers via decimal scaling, then applying Frame of Reference encoding and bit-packing. Values that cannot be losslessly round-tripped are stored as exceptions. The spec covers: - Page layout: 7-byte header, offset array, compressed vectors - Vector format: AlpInfo, ForInfo, packed values, exception data - Encoding math: two-step multiplication for cross-language consistency - Parameter selection, exception detection, and decoding steps Based on the paper "ALP: Adaptive Lossless floating-Point Compression" (Afroozeh and Boncz, SIGMOD 2024). Wire format matches the C++ Arrow and Java parquet-java implementations.
| ``` | ||
| Bytes AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6 | ||
| ``` | ||
|
|
There was a problem hiding this comment.
I think there are now links at the top of file summarizing applicability of encoding.
|
|
||
| ##### Header (7 bytes) | ||
|
|
||
| All multi-byte values are little-endian. |
There was a problem hiding this comment.
| All multi-byte values are little-endian. | |
| All multi-byte values are stored in little-endian order. |
| |--------|-------|------|------|-------------| | ||
| | 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) | | ||
| | 1 | integer_encoding | 1 byte | uint8 | Integer encoding (must be 0 = FOR + bit-packing) | | ||
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) | |
There was a problem hiding this comment.
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) | | |
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Recommended default: 10 (vector size 1024) | |
|
|
||
| | Offset | Field | Size | Type | Description | | ||
| |--------|-------|------|------|-------------| | ||
| | 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) | |
There was a problem hiding this comment.
we think ALP-RD fits in well here? I forget what the extension point is and why we were OK keeping this field, but not version.
| |--------|-------|------|------|-------------| | ||
| | 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) | | ||
| | 1 | integer_encoding | 1 byte | uint8 | Integer encoding (must be 0 = FOR + bit-packing) | | ||
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) | |
There was a problem hiding this comment.
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) | | |
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in the inclusive range: \[3, 15\]. Default: 10 (vector size 1024) | |
| **Note:** The number of elements per vector and the packed data size are NOT stored | ||
| in the header. They are derived: | ||
| * Elements per vector: `vector_size` for all vectors except the last, which may be smaller. | ||
| * Packed data size: `ceil(num_elements_in_vector * bit_width / 8)`. |
There was a problem hiding this comment.
bit_width isn't in the header either, so it is a little strange to call ths out here?
|
|
||
| **Note:** The number of elements per vector and the packed data size are NOT stored | ||
| in the header. They are derived: | ||
| * Elements per vector: `vector_size` for all vectors except the last, which may be smaller. |
There was a problem hiding this comment.
This seems like a little bit of a strange callout since it is covered on line 457 explicitly. and log_vector_size is stored in the header?
| values. Each offset gives the byte position of the corresponding vector's data, | ||
| measured from the start of the offset array itself. | ||
|
|
||
| The first offset equals `num_vectors * 4` (pointing just past the offset array). |
There was a problem hiding this comment.
| The first offset equals `num_vectors * 4` (pointing just past the offset array). | |
| The first offset always equals `num_vectors * 4` (pointing just past the offset array). |
Lets be explicitly here that we don't support padding.
| Data section sizes: | ||
| | Section | Size Formula | Description | | ||
| |---------------------|-----------------------------|------------------------------| | ||
| | PackedValues | ceil(N * bit\_width / 8) | Bit-packed delta values | |
There was a problem hiding this comment.
| | PackedValues | ceil(N * bit\_width / 8) | Bit-packed delta values | | |
| | PackedValues | ceil(`vector_size` * bit\_width / 8) | Bit-packed delta values | |
| |---------------------|-----------------------------|------------------------------| | ||
| | PackedValues | ceil(N * bit\_width / 8) | Bit-packed delta values | | ||
| | ExceptionPositions | num\_exceptions * 2 bytes | uint16 indices of exceptions | | ||
| | ExceptionValues | num\_exceptions * sizeof(T) | Original float/double values | |
There was a problem hiding this comment.
| | ExceptionValues | num\_exceptions * sizeof(T) | Original float/double values | | |
| | ExceptionValues | num\_exceptions * sizeof(encoded type) (float=4 and double=8) | Original float/double values | |
|
|
||
| The FOR-encoded deltas, bit-packed into `ceil(num_elements_in_vector * bit_width / 8)` bytes. | ||
| Values are packed from the least significant bit of each byte to the most significant bit, | ||
| in groups of 8 values, using the same bit-packing order as the |
There was a problem hiding this comment.
Where does the group of 8 values come in? Wouldn't this messup the number of bytes math?
| The encoding uses two separate multiplications (not a single multiplication by | ||
| `10^(e-f)`, and not division) to ensure that implementations produce identical | ||
| floating-point rounding across languages. The powers of 10 MUST be stored as | ||
| precomputed floating-point constants (i.e., literal values like `1e-3f`), not |
There was a problem hiding this comment.
Why can't they be precomputed at runtime?
| | Type | Magic Number | Formula | | ||
| |--------|-----------------------------------|----------------------------------| | ||
| | FLOAT | 2^22 + 2^23 = 12,582,912 | `(int)((value + magic) - magic)` | | ||
| | DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(long)((value + magic) - magic)` | |
There was a problem hiding this comment.
| | DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(long)((value + magic) - magic)` | | |
| | DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(int64_t)((value + magic) - magic)` | |
|
|
||
| | Type | Magic Number | Formula | | ||
| |--------|-----------------------------------|----------------------------------| | ||
| | FLOAT | 2^22 + 2^23 = 12,582,912 | `(int)((value + magic) - magic)` | |
There was a problem hiding this comment.
| | FLOAT | 2^22 + 2^23 = 12,582,912 | `(int)((value + magic) - magic)` | | |
| | FLOAT | 2^22 + 2^23 = 12,582,912 | `(int32_t)((value + magic) - magic)` | |
| ``` | ||
| +-------------------------------------------------------------------+ | ||
| | | | ||
| | encoded = round( value * 10^e * 10^(-f) ) | |
There was a problem hiding this comment.
| | encoded = round( value * 10^e * 10^(-f) ) | | |
| | encoded = fast_round( value * 10^e * 10^(-f) ) | |
Add the encoding specification for ALP (encoding value 10) to Encodings.md. ALP compresses FLOAT and DOUBLE columns by converting values to integers via decimal scaling, then applying Frame of Reference encoding and bit-packing. Values that cannot be losslessly round-tripped are stored as exceptions.
See rendered preview here: https://github.com/prtkgaur/parquet-format/blob/alpEncoding/Encodings.md#adaptive-lossless-floating-point-alp--10
The spec covers:
Based on the paper "ALP: Adaptive Lossless floating-Point Compression" (Afroozeh and Boncz, SIGMOD 2024). Wire format matches the C++ Arrow and Java parquet-java implementations.
Rationale for this change
What changes are included in this PR?
Do these changes have PoC implementations?