Skip to content

Sampling somehow compresses into 0 bytes #7268

@connortsui20

Description

@connortsui20

We have this code in our sampling compressor (which has existed for a long time, even before the most recent changes):

let after = scheme
    .compress(compressor, &mut sample_data, sample_ctx)?
    .nbytes();
let before = sample_data.array().nbytes();

let ratio = before as f64 / after as f64;

tracing::debug!("estimate_compression_ratio_with_sampling(compressor={scheme:#?}) = {ratio}",);

Ok(ratio)

If I add some checks:

let mut sample_data = ArrayAndStats::new(sample_array, scheme.stats_options());
let cascade_history = ctx.cascade_history().to_vec();
let sample_ctx = ctx.with_sampling();

let before = sample_data.array().nbytes();
let after = scheme
    .compress(compressor, &mut sample_data, sample_ctx)?
    .nbytes();

if after == 0 {
    tracing::warn!(
        scheme = %scheme.id(),
        ?cascade_history,
        "sample compressed to 0 bytes, which should only happen for constant arrays",
    );
}

we get hundreds of warnings saying that bitpacking compresses to 0 bytes. This also means that the ratio ends up being infinity, which we interpret as invalid.

Here are some examples. These lists are in order of descent, so the first in the list is the parent scheme.

WARN vortex_compressor::estimate: vortex-compressor/src/estimate.rs:129: sample compressed to 0 bytes, which should only happen for constant arrays scheme=vortex.int.bitpacking
cascade_history=[
(SchemeId { name: "vortex.string.dict" }, 0),
(SchemeId { name: "vortex.string.fsst" }, 0),
(SchemeId { name: "vortex.int.dict" }, 1)]

utf8 encoded as dict with values as fsst, fsst lengths as dict, and the codes as bitpacked

WARN vortex_compressor::estimate: vortex-compressor/src/estimate.rs:129: sample compressed to 0 bytes, which should only happen for constant arrays scheme=vortex.int.bitpacking
cascade_history=[
(SchemeId { name: "vortex.string.fsst" }, 0),
(SchemeId { name: "vortex.int.rle" }, 1)]

utf8 encoded as fsst with lengths as rle, and then rle indices as bitpacking

Metadata

Metadata

Labels

bugA bug issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions