Fixes #877, #876, #884: prevent OOM and unknown chunks in classify.py dask paths by brendancol · Pull Request #895 · xarray-contrib/xarray-spatial

brendancol · 2026-02-25T15:22:30Z

Summary

All four classification functions in classify.py had dask code paths that
could materialise the entire array into RAM or create unknown chunk sizes.
This PR fixes them all with the same pattern: lazy sampling via
_generate_sample_indices() + indexed access, and da.where instead of
boolean fancy indexing.

natural_breaks (classify.natural_breaks: fallback .ravel().compute() when no sampling #877): Removed the else branch in
_run_dask_natural_break and _run_dask_cupy_natural_break that called
data.ravel().compute() when num_sample is None or >= data.size.
Now caps num_sample to data.size and always uses indexed access
(data.ravel()[sample_idx].compute()).
maximum_breaks (classify.maximum_breaks: unconditional .ravel().compute() on dask #876): Added num_sample parameter (default 20_000,
matching natural_breaks). Both dask functions now use
_generate_sample_indices() + indexed access instead of unconditional
.ravel().compute().
quantile & percentiles (classify.quantile: boolean fancy indexing on dask creates unknown chunks #884): Replaced boolean fancy indexing
(data[da.isfinite(data)]) — which creates unknown dask chunk sizes —
with dedicated _run_dask_* functions that use da.where to clean
inf→nan (preserving known chunks), then sample lazily via indexed access
and compute percentiles with np.percentile + np.unique. Added
num_sample parameter (default 20_000) to both quantile() and
percentiles(). The numpy/cupy in-memory paths accept and ignore
num_sample.

Test plan

All 84 existing + new tests in test_classify.py pass
Verify on a large dask array that memory stays bounded

…ify.py natural_breaks and maximum_breaks dask code paths called .ravel().compute() on the full array, materialising the entire dataset into RAM. Replace with capped sampling via _generate_sample_indices() + indexed access so only the sample is ever computed. Add num_sample parameter to maximum_breaks (default 20_000, matching natural_breaks).

…path quantile() and percentiles() used data[module.isfinite(data)] on dask arrays, which creates unknown chunk sizes that degrade scheduling and can force unexpected materialisations. Replace with dedicated dask functions that use da.where to clean inf→nan (preserving known chunks), compute to numpy, then use np.nanpercentile + np.unique.

…aths The previous commit eliminated unknown dask chunks but still materialised the full array via .ravel().compute(). Now both functions accept num_sample (default 20_000, matching natural_breaks/maximum_breaks) and use _generate_sample_indices() + indexed access so only the sample is ever computed on dask backends.

brendancol added 2 commits February 25, 2026 07:22

brendancol changed the title ~~Fixes #877, #876: prevent OOM from full dask materialisation in classify.py~~ Fixes #877, #876, #884: prevent OOM and unknown chunks in classify.py dask paths Feb 25, 2026

brendancol merged commit fd78352 into master Feb 25, 2026
10 checks passed

brendancol deleted the fix/classify-dask-oom-877-876 branch February 26, 2026 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #877, #876, #884: prevent OOM and unknown chunks in classify.py dask paths#895

Fixes #877, #876, #884: prevent OOM and unknown chunks in classify.py dask paths#895
brendancol merged 3 commits intomasterfrom
fix/classify-dask-oom-877-876

brendancol commented Feb 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brendancol commented Feb 25, 2026 •

edited

Loading