Add the new readahead_v2 cache by googlyrahman · Pull Request #750 · fsspec/gcsfs

googlyrahman · 2026-01-19T20:37:44Z

Adds the new readahead_v2 cache specifically for GCSFS.

The Problem: The existing readahead cache minimizes network connections by over-fetching (e.g., fetching 10MB when 5MB is requested). However, it stores this data as a single, contiguous bytes object. Serving a read request requires slicing this object, which in Python triggers a memory copy. For large block sizes (e.g., 100MB or 500MB), this copying operation becomes CPU-intensive and blocks the asyncio event loop, degrading performance.

The Solution: The new readahead_v2 cache implementation stores fetched chunks as separate bytes objects rather than concatenating them. When a read request matches a stored chunk, the cache returns a direct reference to the object. This "zero-copy" approach eliminates the overhead of slicing and prevents CPU/event loop blocking, ensuring efficient memory usage when the request size aligns with the block size.

The cache is currently being integrated into the system only when EXPERIMENTAL environment variable is set, so basically as an opt-in feature. We plan to make it default in subsequent release.

NOTE: Although we could optimize memory usage further, we chose to maintain the exact semantics of the original readahead cache. This ensures stability for unlikely but valid read patterns.

martindurant · 2026-01-19T20:56:30Z

There is clearly some thought and effort going into this implementation. Might I suggest that we try to keep to memoryview objects instead of bytes? They certainly allow zero-copy referencing and slicing. Unfortunately, the fsspec APIs generally return bytes in all cases, so I wonder whether we can make a memoryview-backed bytes-like object.

This is connected to the effort in Jamie-Chang/aiointerpreters#8 to bring true parallelism to fsspec: memoryview objects are zero-cost passing between interpreters, the only python object this applies to.

googlyrahman · 2026-01-20T04:09:26Z

@martindurant, Thanks for taking an early look, and leaving your view.

I actually did consider a memoryview solution before arriving at this one, but my understanding is that it isn't performance-improving in this context. The main reason is exactly what you noted: "Unfortunately, the fsspec APIs generally return bytes in all cases." That is the core issue. We receive data in bytes from aiohttp (it allocates bytes internally even if we pass a buffer). We could convert that to a memoryview without hitting the CPU, but the moment we serve read calls to customers, we can't return the memoryview; we have to return bytes to the user to maintain backward compatibility and API contract, Doing so (memoryview.to_bytes()) creates a fresh copy, which hits the CPU and hampers performance.

This particular solution avoids creating copies and blocking the CPU. It stores those 5MB aiohttp responses as-is and serves the user a reference rather than creating a new copy. This works because Python bytes are immutable, so even if we pass the reference, the user cannot change it.

As I understand it, to use memoryview effectively, the fsspec API contract would need to change, which likely isn't feasible at the moment or maybe a long term effort where users would need to migrate their workload from bytes to memoryview, This solution maintains the backward compatibility, and improves performance. With that said, please let me know if I missed anything or if there is a different way to approach this.

googlyrahman · 2026-01-20T04:37:24Z

/gcbrun

googlyrahman · 2026-01-20T04:53:04Z

/gcbrun

googlyrahman · 2026-01-20T09:31:03Z

/gcbrun

ankitaluthra1 · 2026-01-20T10:25:30Z

/gcbrun

martindurant · 2026-01-20T15:22:59Z

Yeah, I did some reading around and indeed I see no way to ingest memoryviews directly out of aiohttp (or any other network client). This seems like a mistake to me! Maybe someone should implement the buffers in rust to give true zero-copy behaviour...

googlyrahman · 2026-01-20T18:52:04Z

Yeah that's correct, but that's only half of the problem. Even if we hypothetically achieved zero-copy ingestion using memoryview internally, My understanding says it would still be less efficient for a cache under the current API constraints. Since fsspec mandates returning bytes, a memoryview approach would force us to run .tobytes() on every cache hit to satisfy the contract. This means we would pay the CPU copy 'tax' repeatedly every time the user reads the same chunk, or there is cache hit.

By contrast, the readahead_v2 approach leverages the immutability of bytes. We pay the allocation tax exactly once (during the initial download). All subsequent reads whether immediate or from cache are just reference passing, which is truly zero-copy and zero-CPU. This also avoids a memory spike: with memoryview, serving a read would temporarily require 2x RAM (the internal buffer + the new bytes copy). With the current approach, the cache and the user share the exact same memory object.

To completely solve the problem for any request size, we would need two things:

Support for reading directly into a memoryview in aiohttp (or the network client) with zero copy. (Current not exists)
An update to the fsspec contract to return memoryview (which would be a breaking or backward incompatible change)

Until then, the proposed solution optimizes performance specifically for cases where request_size == block_size.

Let me know if you have any questions, or if you have a better approach to this I would love to hear it!

Jamie-Chang · 2026-01-20T20:31:05Z

Not particularly useful for you right now, but It might interest you that in Python 3.15 there will be a zero copy way to get bytes out of a bytearray https://docs.python.org/3.15/library/stdtypes.html#bytearray.take_bytes

googlyrahman

Addressed comments.

googlyrahman · 2026-01-21T13:12:41Z

Added the before vs after comparision as well in the description for regional as well as zonal. Please take a look once you've time :)

ankitaluthra1 · 2026-01-21T18:56:37Z

/gcbrun

ankitaluthra1 · 2026-01-22T18:17:57Z

/gcbrun

ankitaluthra1 · 2026-01-22T18:48:13Z

/gcbrun

ankitaluthra1 · 2026-01-23T13:05:34Z

/gcbrun

…eads

googlyrahman · 2026-01-23T13:11:27Z

Had to rebase, and force-push it as it was having merge conflicts with fsspec/main

ankitaluthra1 · 2026-01-24T17:44:44Z

/gcbrun

googlyrahman force-pushed the zonalcache branch from 2f3f115 to b834d74 Compare January 20, 2026 04:36

googlyrahman force-pushed the zonalcache branch from b834d74 to 703aa8f Compare January 20, 2026 04:38

googlyrahman marked this pull request as ready for review January 20, 2026 05:33

googlyrahman force-pushed the zonalcache branch 5 times, most recently from c82bb37 to 57538b5 Compare January 20, 2026 09:18

ankitaluthra1 reviewed Jan 20, 2026

View reviewed changes

Comment thread gcsfs/core.py Outdated

ankitaluthra1 requested changes Jan 21, 2026

View reviewed changes

Comment thread gcsfs/core.py Outdated

Comment thread gcsfs/extended_gcsfs.py

Comment thread gcsfs/zonal_file.py

googlyrahman commented Jan 21, 2026

View reviewed changes

googlyrahman requested a review from ankitaluthra1 January 21, 2026 14:11

martindurant reviewed Jan 22, 2026

View reviewed changes

Comment thread gcsfs/core.py Outdated

googlyrahman force-pushed the zonalcache branch from 0fc254a to 7dc29ee Compare January 22, 2026 16:24

googlyrahman force-pushed the zonalcache branch from 2d7ae5f to 9cfcf0a Compare January 23, 2026 13:07

Add readahead_v2 and integrate it into regional and zonal streaming r…

2f78eba

…eads

googlyrahman added 5 commits January 23, 2026 18:38

Adds additional tests for readahead_v2 cache

605523d

Address comments 1

b36aa67

revert the integration of readahead_v2 from regional buckets

0d1b3f1

Allow users to use the default readahead cache

6e69d4d

Rename the cache to readahead_chunked

6cf0910

googlyrahman force-pushed the zonalcache branch from 9cfcf0a to 6cf0910 Compare January 23, 2026 13:09

ankitaluthra1 approved these changes Jan 29, 2026

View reviewed changes

ankitaluthra1 merged commit fcefacb into fsspec:main Jan 29, 2026
7 checks passed

googlyrahman mentioned this pull request Feb 1, 2026

Remove cache_type option from ExtendedGCSFSFileSystem._open #754

Merged

Conversation

googlyrahman commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Jan 19, 2026

Uh oh!

googlyrahman commented Jan 20, 2026

Uh oh!

googlyrahman commented Jan 20, 2026

Uh oh!

googlyrahman commented Jan 20, 2026

Uh oh!

googlyrahman commented Jan 20, 2026

Uh oh!

ankitaluthra1 commented Jan 20, 2026

Uh oh!

martindurant commented Jan 20, 2026

Uh oh!

googlyrahman commented Jan 20, 2026

Uh oh!

Uh oh!

Jamie-Chang commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

googlyrahman left a comment

Choose a reason for hiding this comment

Uh oh!

googlyrahman commented Jan 21, 2026

Uh oh!

ankitaluthra1 commented Jan 21, 2026

Uh oh!

Uh oh!

ankitaluthra1 commented Jan 22, 2026

Uh oh!

ankitaluthra1 commented Jan 22, 2026

Uh oh!

ankitaluthra1 commented Jan 23, 2026

Uh oh!

googlyrahman commented Jan 23, 2026

Uh oh!

ankitaluthra1 commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

googlyrahman commented Jan 19, 2026 •

edited

Loading