Skip to content

GH-533: Add ALP (Adaptive Lossless floating-Point) encoding specification#557

Open
prtkgaur wants to merge 1 commit intoapache:masterfrom
prtkgaur:alpEncoding
Open

GH-533: Add ALP (Adaptive Lossless floating-Point) encoding specification#557
prtkgaur wants to merge 1 commit intoapache:masterfrom
prtkgaur:alpEncoding

Conversation

@prtkgaur
Copy link
Copy Markdown

@prtkgaur prtkgaur commented Mar 11, 2026

Add the encoding specification for ALP (encoding value 10) to Encodings.md. ALP compresses FLOAT and DOUBLE columns by converting values to integers via decimal scaling, then applying Frame of Reference encoding and bit-packing. Values that cannot be losslessly round-tripped are stored as exceptions.

See rendered preview here: https://github.com/prtkgaur/parquet-format/blob/alpEncoding/Encodings.md#adaptive-lossless-floating-point-alp--10

The spec covers:

  • Page layout: 7-byte header, offset array, compressed vectors
  • Vector format: AlpInfo, ForInfo, packed values, exception data
  • Encoding math: two-step multiplication for cross-language consistency
  • Parameter selection, exception detection, and decoding steps

Based on the paper "ALP: Adaptive Lossless floating-Point Compression" (Afroozeh and Boncz, SIGMOD 2024). Wire format matches the C++ Arrow and Java parquet-java implementations.

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

Add the encoding specification for ALP (encoding value 10) to Encodings.md.
ALP compresses FLOAT and DOUBLE columns by converting values to integers via
decimal scaling, then applying Frame of Reference encoding and bit-packing.
Values that cannot be losslessly round-tripped are stored as exceptions.

The spec covers:
- Page layout: 7-byte header, offset array, compressed vectors
- Vector format: AlpInfo, ForInfo, packed values, exception data
- Encoding math: two-step multiplication for cross-language consistency
- Parameter selection, exception detection, and decoding steps

Based on the paper "ALP: Adaptive Lossless floating-Point Compression"
(Afroozeh and Boncz, SIGMOD 2024). Wire format matches the C++ Arrow
and Java parquet-java implementations.
@alamb alamb changed the title Add ALP (Adaptive Lossless floating-Point) encoding specification GH-533: Add ALP (Adaptive Lossless floating-Point) encoding specification Mar 11, 2026
```
Bytes AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are now links at the top of file summarizing applicability of encoding.


##### Header (7 bytes)

All multi-byte values are little-endian.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All multi-byte values are little-endian.
All multi-byte values are stored in little-endian order.

|--------|-------|------|------|-------------|
| 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) |
| 1 | integer_encoding | 1 byte | uint8 | Integer encoding (must be 0 = FOR + bit-packing) |
| 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) |
| 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Recommended default: 10 (vector size 1024) |


| Offset | Field | Size | Type | Description |
|--------|-------|------|------|-------------|
| 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we think ALP-RD fits in well here? I forget what the extension point is and why we were OK keeping this field, but not version.

|--------|-------|------|------|-------------|
| 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) |
| 1 | integer_encoding | 1 byte | uint8 | Integer encoding (must be 0 = FOR + bit-packing) |
| 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) |
| 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in the inclusive range: \[3, 15\]. Default: 10 (vector size 1024) |

**Note:** The number of elements per vector and the packed data size are NOT stored
in the header. They are derived:
* Elements per vector: `vector_size` for all vectors except the last, which may be smaller.
* Packed data size: `ceil(num_elements_in_vector * bit_width / 8)`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit_width isn't in the header either, so it is a little strange to call ths out here?


**Note:** The number of elements per vector and the packed data size are NOT stored
in the header. They are derived:
* Elements per vector: `vector_size` for all vectors except the last, which may be smaller.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a little bit of a strange callout since it is covered on line 457 explicitly. and log_vector_size is stored in the header?

values. Each offset gives the byte position of the corresponding vector's data,
measured from the start of the offset array itself.

The first offset equals `num_vectors * 4` (pointing just past the offset array).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The first offset equals `num_vectors * 4` (pointing just past the offset array).
The first offset always equals `num_vectors * 4` (pointing just past the offset array).

Lets be explicitly here that we don't support padding.

Data section sizes:
| Section | Size Formula | Description |
|---------------------|-----------------------------|------------------------------|
| PackedValues | ceil(N * bit\_width / 8) | Bit-packed delta values |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| PackedValues | ceil(N * bit\_width / 8) | Bit-packed delta values |
| PackedValues | ceil(`vector_size` * bit\_width / 8) | Bit-packed delta values |

|---------------------|-----------------------------|------------------------------|
| PackedValues | ceil(N * bit\_width / 8) | Bit-packed delta values |
| ExceptionPositions | num\_exceptions * 2 bytes | uint16 indices of exceptions |
| ExceptionValues | num\_exceptions * sizeof(T) | Original float/double values |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| ExceptionValues | num\_exceptions * sizeof(T) | Original float/double values |
| ExceptionValues | num\_exceptions * sizeof(encoded type) (float=4 and double=8) | Original float/double values |


The FOR-encoded deltas, bit-packed into `ceil(num_elements_in_vector * bit_width / 8)` bytes.
Values are packed from the least significant bit of each byte to the most significant bit,
in groups of 8 values, using the same bit-packing order as the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the group of 8 values come in? Wouldn't this messup the number of bytes math?

The encoding uses two separate multiplications (not a single multiplication by
`10^(e-f)`, and not division) to ensure that implementations produce identical
floating-point rounding across languages. The powers of 10 MUST be stored as
precomputed floating-point constants (i.e., literal values like `1e-3f`), not
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't they be precomputed at runtime?

| Type | Magic Number | Formula |
|--------|-----------------------------------|----------------------------------|
| FLOAT | 2^22 + 2^23 = 12,582,912 | `(int)((value + magic) - magic)` |
| DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(long)((value + magic) - magic)` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(long)((value + magic) - magic)` |
| DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(int64_t)((value + magic) - magic)` |


| Type | Magic Number | Formula |
|--------|-----------------------------------|----------------------------------|
| FLOAT | 2^22 + 2^23 = 12,582,912 | `(int)((value + magic) - magic)` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| FLOAT | 2^22 + 2^23 = 12,582,912 | `(int)((value + magic) - magic)` |
| FLOAT | 2^22 + 2^23 = 12,582,912 | `(int32_t)((value + magic) - magic)` |

```
+-------------------------------------------------------------------+
| |
| encoded = round( value * 10^e * 10^(-f) ) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| encoded = round( value * 10^e * 10^(-f) ) |
| encoded = fast_round( value * 10^e * 10^(-f) ) |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Proposal] Add ALP encoding support in parquet file format

3 participants