Skip to content

Fixes #877, #876, #884: prevent OOM and unknown chunks in classify.py dask paths#895

Merged
brendancol merged 3 commits intomasterfrom
fix/classify-dask-oom-877-876
Feb 25, 2026
Merged

Fixes #877, #876, #884: prevent OOM and unknown chunks in classify.py dask paths#895
brendancol merged 3 commits intomasterfrom
fix/classify-dask-oom-877-876

Conversation

@brendancol
Copy link
Contributor

@brendancol brendancol commented Feb 25, 2026

Summary

All four classification functions in classify.py had dask code paths that
could materialise the entire array into RAM or create unknown chunk sizes.
This PR fixes them all with the same pattern: lazy sampling via
_generate_sample_indices() + indexed access, and da.where instead of
boolean fancy indexing.

  • natural_breaks (classify.natural_breaks: fallback .ravel().compute() when no sampling #877): Removed the else branch in
    _run_dask_natural_break and _run_dask_cupy_natural_break that called
    data.ravel().compute() when num_sample is None or >= data.size.
    Now caps num_sample to data.size and always uses indexed access
    (data.ravel()[sample_idx].compute()).

  • maximum_breaks (classify.maximum_breaks: unconditional .ravel().compute() on dask #876): Added num_sample parameter (default 20_000,
    matching natural_breaks). Both dask functions now use
    _generate_sample_indices() + indexed access instead of unconditional
    .ravel().compute().

  • quantile & percentiles (classify.quantile: boolean fancy indexing on dask creates unknown chunks #884): Replaced boolean fancy indexing
    (data[da.isfinite(data)]) — which creates unknown dask chunk sizes —
    with dedicated _run_dask_* functions that use da.where to clean
    inf→nan (preserving known chunks), then sample lazily via indexed access
    and compute percentiles with np.percentile + np.unique. Added
    num_sample parameter (default 20_000) to both quantile() and
    percentiles(). The numpy/cupy in-memory paths accept and ignore
    num_sample.

Test plan

  • All 84 existing + new tests in test_classify.py pass
  • Verify on a large dask array that memory stays bounded

…ify.py

natural_breaks and maximum_breaks dask code paths called .ravel().compute()
on the full array, materialising the entire dataset into RAM. Replace with
capped sampling via _generate_sample_indices() + indexed access so only the
sample is ever computed. Add num_sample parameter to maximum_breaks (default
20_000, matching natural_breaks).
…path

quantile() and percentiles() used data[module.isfinite(data)] on dask
arrays, which creates unknown chunk sizes that degrade scheduling and
can force unexpected materialisations.  Replace with dedicated dask
functions that use da.where to clean inf→nan (preserving known chunks),
compute to numpy, then use np.nanpercentile + np.unique.
@brendancol brendancol changed the title Fixes #877, #876: prevent OOM from full dask materialisation in classify.py Fixes #877, #876, #884: prevent OOM and unknown chunks in classify.py dask paths Feb 25, 2026
…aths

The previous commit eliminated unknown dask chunks but still materialised
the full array via .ravel().compute().  Now both functions accept
num_sample (default 20_000, matching natural_breaks/maximum_breaks) and
use _generate_sample_indices() + indexed access so only the sample is
ever computed on dask backends.
@brendancol brendancol merged commit fd78352 into master Feb 25, 2026
10 checks passed
@brendancol brendancol deleted the fix/classify-dask-oom-877-876 branch February 26, 2026 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant