🔨 Fix `SemanticSegmentor` Memory Spill Issue #990

Jiaqi-Lv · 2026-01-29T12:43:37Z

This pull request introduces significant improvements to memory management and efficiency in the semantic segmentation engine, especially for large whole-slide image (WSI) processing. The main changes focus on incremental processing and disk-backed storage to avoid excessive RAM usage, as well as more robust cleanup of temporary files. There are also adjustments to test tolerances and some bug fixes in array handling.

Memory management and efficiency improvements:

The prepare_full_batch function now dynamically decides whether to use in-memory NumPy arrays or disk-backed Zarr arrays for large batch outputs, based on available system memory and a configurable threshold. This prevents memory spikes when processing large WSIs. [1] [2] [3] [4]
The save_to_cache function has been refactored to incrementally write Dask array blocks to Zarr on disk, avoiding materializing large arrays in memory and reducing peak RAM usage. [1] [2]
Memory usage checks in infer_wsi now use up-to-date available memory rather than an initial snapshot, and intermediate results are spilled to disk when thresholds are exceeded. [1] [2] [3]

Robustness and cleanup:

Temporary Zarr directories and files created during processing are now properly cleaned up after use, preventing disk space leaks. [1] [2]

Bug fixes and test adjustments:

Fixed array type handling in merge_batch_to_canvas to ensure compatibility with both NumPy and Dask arrays.
Corrected logic in merge_horizontal to compute spans and concatenate outputs only for the current row, improving correctness and efficiency. [1] [2]
Relaxed the upper bound on mean prediction values in several tests to account for increased variability with the new memory management approach. [1] [2] [3]

Previous Problem:
I encountered out-of-memory issues and Python kept crashing when processing a relatively large WSI.
The example slide I was trying to run was: https://huggingface.co/datasets/TIACentre/TIAToolBox_Remote_Samples/blob/main/sample_wsis/D_P000019_PAS_CPG.tif. The code I was trying to run was:

segmentor = SemanticSegmentor(model="fcn_resnet50_unet-bcss")
out = segmentor.run(
    images=[Path(wsi_path)],
    patch_mode=False,
    device="cuda",
    save_dir=output_path,
    overwrite=True,
    output_type="annotationstore",
    auto_get_mask=True,
    memory_threshold=25,
    num_workers=0,
    batch_size=8,
)

Before this PR, the code kept crashing on my workstation, which has 32GBs of RAM, memory spiked to 100% just before it crashed.

codecov · 2026-01-29T12:59:43Z

Codecov Report

❌ Patch coverage is 69.49153% with 18 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (dev-define-engines-abc@ecd61dc). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
tiatoolbox/models/engine/semantic_segmentor.py	69.49%	12 Missing and 6 partials ⚠️

Additional details and impacted files

@@                    Coverage Diff                    @@
##             dev-define-engines-abc     #990   +/-   ##
=========================================================
  Coverage                          ?   95.12%           
=========================================================
  Files                             ?       79           
  Lines                             ?     9996           
  Branches                          ?     1288           
=========================================================
  Hits                              ?     9509           
  Misses                            ?      445           
  Partials                          ?       42

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR targets out-of-memory crashes in SemanticSegmentor.infer_wsi() when processing large WSIs by reducing peak RAM usage and spilling intermediate results to disk more aggressively.

Changes:

Adds disk-backed intermediates (Zarr/temp dirs) to reduce RAM spikes during WSI merging and masked-output alignment.
Adjusts horizontal merging logic to bound per-row allocations.
Updates zarr-output tests to tolerate slightly higher mean prediction values.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`tiatoolbox/models/engine/semantic_segmentor.py`	Refactors WSI inference/merging and introduces new disk-backed caching/temp-dir handling intended to prevent OOM.
`tests/engines/test_semantic_segmentor.py`	Relaxes mean-prediction assertion upper bounds for WSI zarr outputs.

Comments suppressed due to low confidence (3)

tiatoolbox/models/engine/semantic_segmentor.py:571

When intermediate spilling occurs (canvas_zarr is not None), the remaining in-memory canvas/count produced after the last spill are never appended to the Zarr datasets anymore. The code then overwrites canvas/count with da.from_zarr(...), effectively dropping the tail rows and producing truncated outputs. Restore a final flush into canvas_zarr/count_zarr before wrapping, or concatenate the existing Zarr-backed arrays with the remaining in-memory chunks before the vertical merge.

        zarr_group = None
        if canvas_zarr is not None:
            # Wrap zarr in dask array
            canvas = da.from_zarr(canvas_zarr, chunks=canvas_zarr.chunks)
            count = da.from_zarr(count_zarr, chunks=count_zarr.chunks)
            zarr_group = zarr.open(canvas_zarr.store.path, mode="a")

tiatoolbox/models/engine/semantic_segmentor.py:1091

merged_shape is computed using batch_xs/batch_xe, but merge_batch_to_canvas still uses the original global xs:xe values from output_locs_ when writing into the local row canvas. If batch_xs is non-zero (e.g., sparse/masked rows), this will either index out of bounds or place blocks at the wrong x-offset. Adjust output_locs_ to be relative to batch_xs (and ensure downstream vertical merging accounts for any x-offset), or keep a consistent global-width canvas per row.

        # Compute span only for the current row to avoid allocating a canvas
        # covering the entire slide width.
        batch_xs = np.min(output_locs_[:, 0], axis=0)
        batch_xe = np.max(output_locs_[:, 2], axis=0)

        merged_shape = (canvas_np_.shape[1], batch_xe - batch_xs, canvas_np.shape[3])

        canvas_merge, count_merge = merge_batch_to_canvas(
            blocks=canvas_np_,
            output_locations=output_locs_,
            merged_shape=merged_shape,
        )

tiatoolbox/models/engine/semantic_segmentor.py:1438

prepare_full_batch now returns a Zarr array (zarr.zeros(...)) but the function signature/docstring still claim it returns np.ndarray, and the caller later passes this into utilities that expect NumPy/Dask arrays. This can break concatenation and/or force full materialization via __array__. Either return a concrete NumPy array (only when safe) or return a Dask array (e.g., da.from_zarr) and update the downstream merge/concatenation path to handle it consistently.

    store = zarr.DirectoryStore(str(temp_dir))
    full_batch_output = zarr.zeros(
        shape=(total_size, *sample_shape),
        chunks=(len(batch_output), *sample_shape),
        dtype=batch_output.dtype,
        store=store,
        overwrite=True,
    )

    # Place matching outputs using matching indices
    full_batch_output[matches] = batch_output

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/engines/test_semantic_segmentor.py

tiatoolbox/models/engine/semantic_segmentor.py

tests/engines/test_semantic_segmentor.py

tiatoolbox/models/engine/semantic_segmentor.py

shaneahmed · 2026-01-29T15:11:29Z

tiatoolbox/models/engine/semantic_segmentor.py

-    # Initialize full output array
-    full_batch_output = np.zeros(
-        shape=(total_size, *batch_output.shape[1:]),
+    store = zarr.DirectoryStore(str(temp_dir))


Wouldn't it be slow when you save to zarr instead of numpy?

This is actually the main cause of the memory issue.

File "/media/u1910100/data/GitHub/tiatoolbox/tiatoolbox/models/engine/semantic_segmentor.py", line 1381, in prepare_full_batch full_batch_output = np.zeros( ^^^^^^^^^ numpy._core._exceptions._ArrayMemoryError: Unable to allocate 33.0 GiB for an array with shape (6762, 512, 512, 5) and data type float32

This is the log when the program crashed. The numpy array is too large. We can't use numpy here.

Can we not check the size of the array before doing this? if the array is small in memory, it should just use numpy.

Yes, we can. I'll change it.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-29T20:18:40Z

tiatoolbox/models/engine/semantic_segmentor.py

+        pad_len = len(full_output_locs)
+        if not use_numpy:
+            # Resize zarr array to accommodate padding
+            full_batch_output.resize(total_size + pad_len, *sample_shape)


zarr.Array.resize expects the new shape as a single tuple. Calling full_batch_output.resize(total_size + pad_len, *sample_shape) will raise a TypeError (in Zarr v2) and break the last-batch padding path when use_numpy is false. Pass a single shape tuple instead (consistent with other resize calls in this file).

Suggested change

full_batch_output.resize(total_size + pad_len, *sample_shape)

full_batch_output.resize((total_size + pad_len, *sample_shape))

Copilot · 2026-01-29T20:18:41Z

tiatoolbox/models/engine/semantic_segmentor.py

+    # Calculate final size including potential padding
+    final_size = total_size
+    if is_last and len(full_output_locs):
+        final_size += len(full_output_locs)


final_size is computed using len(full_output_locs) before full_output_locs is sliced (full_output_locs = full_output_locs[total_size:]). For the last batch this overestimates the required padding and can significantly inflate the RAM size estimate / preallocated NumPy array. Compute padding based on the remaining full_output_locs after slicing (or compute pad_len = max(len(full_output_locs) - total_size, 0) up-front).

Suggested change

# Calculate final size including potential padding

final_size = total_size

if is_last and len(full_output_locs):

final_size += len(full_output_locs)

# Calculate final size including potential padding for remaining locations

pad_len = 0

if is_last and len(full_output_locs):

# Remaining locations after consuming indices up to total_size

pad_len = max(len(full_output_locs) - total_size, 0)

final_size = total_size + pad_len

Copilot · 2026-01-29T20:18:41Z

tiatoolbox/models/engine/semantic_segmentor.py

+    else:
+        # Array too large, use zarr backed by disk to avoid RAM spikes
+        # Use a unique temp subdirectory per call to avoid chunk-shape clashes
+        msg = (
+            f"Estimated peak memory usage for full batch output: "
+            f"{peak_bytes / (1024**3):.2f} GB exceeds threshold of "
+            f"{memory_available / (1024**3):.2f} GB."
+            f"Allocating full batch output of size "
+            f"{final_size}x{sample_shape} using Zarr on disk."
+        )
+        logger.info(msg)

-    total_size = np.max(matches).astype(np.uint32) + 1
+        save_path_dir = Path(save_path)
+        save_path_dir.mkdir(parents=True, exist_ok=True)
+        temp_dir = Path(
+            tempfile.mkdtemp(prefix="full_batch_tmp_", dir=str(save_path_dir))
+        )

-    # Initialize full output array
-    full_batch_output = np.zeros(
-        shape=(total_size, *batch_output.shape[1:]),
-        dtype=batch_output.dtype,
-    )
+        store = zarr.DirectoryStore(str(temp_dir))
+        full_batch_output = zarr.zeros(
+            shape=(total_size, *sample_shape),
+            chunks=(len(batch_output), *sample_shape),
+            dtype=batch_output.dtype,
+            store=store,
+            overwrite=True,
+        )


New behavior switches prepare_full_batch between in-memory NumPy and disk-backed Zarr based on available memory and also writes temp stores under full_batch_tmp. There isn’t a focused unit test covering the Zarr-backed branch (including correct shape/padding behavior and that temp directories get cleaned up). Consider adding a test that forces low available memory (mock psutil.virtual_memory().available) to exercise the Zarr path and verifies the returned array contents + cleanup.

fix memory spill issue

b3131e9

Jiaqi-Lv changed the title ~~:hamer Fix SemanticSegmentor Memory Spill Issue~~ :hammer Fix SemanticSegmentor Memory Spill Issue Jan 29, 2026

Jiaqi-Lv changed the title ~~:hammer Fix SemanticSegmentor Memory Spill Issue~~ 🔨 Fix SemanticSegmentor Memory Spill Issue Jan 29, 2026

Jiaqi-Lv mentioned this pull request Jan 29, 2026

🔨 Fix Semantic Segmentor Memory Spill Issue #988

Closed

Jiaqi-Lv self-assigned this Jan 29, 2026

Jiaqi-Lv added the enhancement New feature or request label Jan 29, 2026

Jiaqi-Lv added 2 commits January 29, 2026 13:34

revert some changes

f8e6a49

update docstring

7a2127a

Jiaqi-Lv marked this pull request as ready for review January 29, 2026 14:14

Jiaqi-Lv requested review from Copilot, measty and shaneahmed January 29, 2026 14:21

Copilot started reviewing on behalf of Jiaqi-Lv January 29, 2026 14:21 View session

Jiaqi-Lv requested a review from adamshephard January 29, 2026 14:26

Copilot AI reviewed Jan 29, 2026

View reviewed changes

shaneahmed added this to the Release v2.0.0 milestone Jan 29, 2026

shaneahmed reviewed Jan 29, 2026

View reviewed changes

tiatoolbox/models/engine/semantic_segmentor.py Show resolved Hide resolved

shaneahmed reviewed Jan 29, 2026

View reviewed changes

tiatoolbox/models/engine/semantic_segmentor.py Outdated Show resolved Hide resolved

shaneahmed reviewed Jan 29, 2026

View reviewed changes

tiatoolbox/models/engine/semantic_segmentor.py Outdated Show resolved Hide resolved

shaneahmed reviewed Jan 29, 2026

View reviewed changes

tiatoolbox/models/engine/semantic_segmentor.py Outdated Show resolved Hide resolved

shaneahmed reviewed Jan 29, 2026

View reviewed changes

tiatoolbox/models/engine/semantic_segmentor.py Outdated Show resolved Hide resolved

shaneahmed reviewed Jan 29, 2026

View reviewed changes

use numpy for full batch if possible

4c6c930

Jiaqi-Lv requested a review from Copilot January 29, 2026 20:13

Copilot started reviewing on behalf of Jiaqi-Lv January 29, 2026 20:13 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

	full_batch_output.resize(total_size + pad_len, *sample_shape)
	full_batch_output.resize((total_size + pad_len, *sample_shape))

🔨 Fix SemanticSegmentor Memory Spill Issue #990

Are you sure you want to change the base?

🔨 Fix SemanticSegmentor Memory Spill Issue #990

Uh oh!

Conversation

Jiaqi-Lv commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shaneahmed Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Jiaqi-Lv Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

shaneahmed Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Jiaqi-Lv Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

🔨 Fix `SemanticSegmentor` Memory Spill Issue #990

🔨 Fix `SemanticSegmentor` Memory Spill Issue #990

Jiaqi-Lv commented Jan 29, 2026 •

edited

Loading

codecov bot commented Jan 29, 2026 •

edited

Loading