|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "id": "b02d33be", |
6 | 6 | "metadata": {}, |
7 | | - "source": "# Working with Planetary Computer data using obstore\n\nThis notebook walks through reading Planetary Computer data with [obstore](https://developmentseed.org/obstore/) — a Python library that talks to cloud object stores (Azure Blob, S3, GCS) directly, without going through HTTP wrappers like fsspec. It's the foundation that higher-level libraries (async-geotiff, zarr-python, deck.gl-raster via Lonboard) sit on top of.\n\n**Five reasons to use it over the older `planetary_computer.sign() + fsspec` pattern:**\n\n1. **Reliability** — SAS tokens auto-refresh. No `TokenExpiredError` mid-job, no manual re-signing.\n2. **Cost** — range reads download only the bytes you need (e.g. a 16 KB COG header instead of a 100 MB file).\n3. **Speed** — async surface fires reads in parallel. Roughly N× faster than serial for multi-file workloads.\n4. **Composability** — the same store works with async-geotiff, zarr-python, Lonboard, etc.\n5. **Portability** — `AzureStore`, `S3Store`, `GCSStore` are interchangeable. Cloud-agnostic code.\n\nEach cell below calls out which of these it demonstrates. Speed-relevant cells use `%%time` so you can compare wall-clock numbers.\n\nThe companion [obstore tutorial](../overview/obstore.md) has the full narrative and migration reference." |
| 7 | + "source": "# Working with Planetary Computer data using obstore\n\nThis notebook walks through reading Planetary Computer data with [obstore](https://developmentseed.org/obstore/). Obstore is a Python library that talks to cloud object stores (Azure Blob, S3, GCS) directly, without going through HTTP wrappers like fsspec. This has a number of key benefits:**\n\n1. **Reliability** - SAS tokens auto-refresh. No `TokenExpiredError` mid-job, no manual re-signing.\n2. **Cost** — range reads download only the bytes you need.\n3. **Speed** — async surface fires reads in parallel. Roughly N× faster than serial for multi-file workloads.\n4. **Composability** — any library that accepts an [obspec](https://github.com/developmentseed/obspec)-compatible store reads through your authenticated connection without re-doing auth.\n5. **Portability** — `AzureStore`, `S3Store`, `GCSStore` are interchangeable. Cloud-agnostic code.\n\nEach cell below calls out which of these it demonstrates. Speed-relevant cells use `%%time` so you can compare wall-clock numbers.\n\nThe companion [obstore tutorial](../overview/obstore.md) has the full narrative and migration reference." |
8 | 8 | }, |
9 | 9 | { |
10 | 10 | "cell_type": "markdown", |
|
33 | 33 | "source": [ |
34 | 34 | "## Authenticate from a STAC asset\n", |
35 | 35 | "\n", |
36 | | - "**Demonstrates: reliability.** `PlanetaryComputerCredentialProvider` handles SAS token acquisition and refresh under the hood — no manual `planetary_computer.sign()` calls anywhere in this notebook. If a token expires mid-job, the provider re-acquires it transparently. The old fsspec pattern required you to handle re-signing and retry logic yourself.\n", |
| 36 | + "`PlanetaryComputerCredentialProvider` handles SAS token acquisition and refresh under the hood — no manual `planetary_computer.sign()` calls anywhere in this notebook. If a token expires mid-job, the provider re-acquires it transparently. The old fsspec pattern required you to handle re-signing and retry logic yourself.\n", |
37 | 37 | "\n", |
38 | 38 | "**Expected result:** working `provider` object, no output printed." |
39 | 39 | ] |
|
62 | 62 | "id": "b81cc8c9", |
63 | 63 | "metadata": {}, |
64 | 64 | "source": [ |
65 | | - "Notice the asset href is unsigned — no SAS query string appended. The provider signs it for you at read time." |
| 65 | + "Notice the asset href is unsigned without a SAS query string appended. The provider signs it for you at read time." |
66 | 66 | ] |
67 | 67 | }, |
68 | 68 | { |
|
82 | 82 | "source": [ |
83 | 83 | "## Build a store\n", |
84 | 84 | "\n", |
85 | | - "A *store* is obstore's connection to a specific cloud location. Once built, you hand it to any obstore read/write function — or to any higher-level library that accepts an obstore-compatible store.\n", |
| 85 | + "A *store* is obstore's connection to a specific cloud location. Once built, you hand it to any obstore read/write function, or to any higher-level library that accepts an obstore-compatible store.\n", |
86 | 86 | "\n", |
87 | 87 | "**Expected result:** working `store` object, no output printed." |
88 | 88 | ] |
|
106 | 106 | "source": [ |
107 | 107 | "## Read\n", |
108 | 108 | "\n", |
109 | | - "`from_asset()` scopes the store to that *specific blob* — the asset URL becomes the store's prefix. So every read uses an empty string as the path; obstore appends the path to the prefix, and you don't want it appending anything. (For multi-object access you'd build a container-scoped store instead — covered further down.)\n", |
110 | | - "\n", |
111 | | - "Three ways to read, each demonstrating a different value prop:\n", |
| 109 | + "There are three ways to read data, depending on your needs.:\n", |
112 | 110 | "\n", |
113 | 111 | "### 1. Read the entire file\n", |
114 | 112 | "\n", |
115 | | - "**Demonstrates: baseline (the slow path you want to avoid).** Use when you actually want all the bytes. NAIP scenes range 100–500 MB.\n", |
116 | | - "\n", |
117 | | - "A surprise here: this cell is slow even on a fast connection. Azure Blob caps single-stream downloads at roughly 8–15 MB/s — your home bandwidth doesn't help. To go faster you'd need parallel range reads against the same file (which is what async-geotiff does internally when it reads COG tiles).\n", |
118 | | - "\n", |
119 | | - "This cell is the foil for everything below. The whole point of range reads and async is to avoid this scenario.\n", |
120 | | - "\n", |
| 113 | + "**This is the slowest path. Use when you actually want all the bytes. For large files, this can take a long time. For example, these NAIP scenes range from 100–500 MB, taking a minute or more depending on your connection.\n", |
121 | 114 | "**Expected result:** 100–500 million bytes, 30–90 seconds depending on which NAIP scene came back." |
122 | 115 | ] |
123 | 116 | }, |
|
142 | 135 | "source": [ |
143 | 136 | "### 2. Read a byte range (16 KB)\n", |
144 | 137 | "\n", |
145 | | - "**Demonstrates: cost savings.** A Cloud Optimized GeoTIFF stores its header in the first few KB. Most libraries (async-geotiff, GDAL, rasterio) only need the header to start working — they don't need the pixel data until you ask for a specific window. Range reads make this possible.\n", |
| 138 | + "A Cloud Optimized GeoTIFF stores its header in the first few KB. Most libraries (async-geotiff, GDAL, rasterio) only need the header to start working. They don't need the pixel data until you ask for a specific window. Range reads make this possible.\n", |
146 | 139 | "\n", |
147 | | - "**Expected result:** 16,384 bytes, well under a second. Tens of thousands of times less data than the full file above (the exact multiple depends on your scene's size)." |
| 140 | + "**Expected result:** 16,384 bytes, well under a second. Much less data than the full file above." |
148 | 141 | ] |
149 | 142 | }, |
150 | 143 | { |
|
167 | 160 | "source": [ |
168 | 161 | "### 3. Read multiple byte ranges in one request\n", |
169 | 162 | "\n", |
170 | | - "**Demonstrates: latency savings.** When you need several slices of the same file — say, multiple COG tiles — you could issue separate `get_range` calls. Each one is a round-trip to Azure. `get_ranges` batches them into a single HTTP request, cutting round-trip latency.\n", |
| 163 | + "When you need several slices of the same file you could issue separate `get_range` calls. Each one is a round-trip to Azure. `get_ranges` batches them into a single HTTP request, cutting round-trip latency.\n", |
171 | 164 | "\n", |
172 | 165 | "**Expected result:** two ranges of 16 KB each, similar wall time to a single `get_range`." |
173 | 166 | ] |
|
193 | 186 | "source": [ |
194 | 187 | "## Listing requires a container-scoped store\n", |
195 | 188 | "\n", |
196 | | - "**Demonstrates: reach beyond a single asset.** Up to here we've worked with one blob. To enumerate objects under a prefix (\"show me every NAIP scene in Montana in 2023\"), the store needs to be scoped to the container *and* the credential provider needs container-level `List` permission. The asset-derived provider above only signs the single blob — it can't list — so we build a fresh provider against the container URL.\n", |
| 189 | + "To enumerate objects under a prefix (\"show me every NAIP scene in Montana in 2023\"), the store needs to be scoped to the container *and* the credential provider needs container-level `List` permission. The asset-derived provider above only signs the single blob so we build a fresh provider against the container URL.\n", |
197 | 190 | "\n", |
198 | 191 | "**Expected result:** three lines printed, each a blob path and its size in bytes." |
199 | 192 | ] |
|
225 | 218 | "id": "93160f86", |
226 | 219 | "metadata": {}, |
227 | 220 | "source": [ |
228 | | - "## Concurrent reads (async) — the speed payoff\n", |
| 221 | + "## Concurrent reads (async)\n", |
229 | 222 | "\n", |
230 | | - "**Demonstrates: speed via parallelism.** Up to here, reads happen one at a time. For multi-file workloads, running them in parallel is dramatically faster than serial. Below we read the same 4 KB header four times — first serially, then concurrently — and compare wall times.\n", |
| 223 | + "For multi-file workloads, running them in parallel is dramatically faster than serial. Below we read the same 4 KB header four times - first serially, then concurrently — and compare wall times.\n", |
231 | 224 | "\n", |
232 | 225 | "Async needs its own credential provider class (`PlanetaryComputerAsyncCredentialProvider`) backed by `aiohttp` instead of `requests`." |
233 | 226 | ] |
|
251 | 244 | "id": "79c84f80", |
252 | 245 | "metadata": {}, |
253 | 246 | "source": [ |
254 | | - "We'll time both with `time.perf_counter()` for an apples-to-apples comparison (the `%%time` magic doesn't play well with top-level `await`, so we measure manually).\n", |
255 | 247 | "\n", |
256 | 248 | "**Warmup the async store.** First call has to acquire a SAS token from Planetary Computer — a separate HTTP round-trip. We do one throwaway read so that overhead doesn't pollute the timing below. (The sync store was already warmed up by cells 1/2/3, which is why we only need to warm the async store.)" |
257 | 249 | ] |
|
336 | 328 | "source": [ |
337 | 329 | "## Hand the store to async-geotiff\n", |
338 | 330 | "\n", |
339 | | - "**Demonstrates: composability.** The whole point of obstore is that *other libraries* sit on top of it. Once you have a working `AzureStore`, you can hand it to any library that accepts an [obspec](https://github.com/developmentseed/obspec)-compatible store — async-geotiff, zarr-python, and others — and they'll read through your authenticated connection. No re-auth, no double signing." |
| 331 | + "Once you have a working `AzureStore`, you can hand it to any library that accepts an [obspec](https://github.com/developmentseed/obspec)-compatible store and they'll read through your authenticated connection." |
340 | 332 | ] |
341 | 333 | }, |
342 | 334 | { |
|
354 | 346 | "id": "1c801295", |
355 | 347 | "metadata": {}, |
356 | 348 | "source": [ |
357 | | - "Open the NAIP scene as a COG and read its metadata. `geotiff.transform` is the affine that maps pixel coordinates to geographic coordinates. `geotiff.crs` is the coordinate reference system.\n", |
| 349 | + "Open the NAIP scene as a COG and read its metadata. geotiff.transform tells you the scene's pixel size and corner position on the ground. \n", |
358 | 350 | "\n", |
359 | | - "**Expected result:** the affine transform and the CRS name (e.g. `NAD83 / UTM zone 11N` — varies by scene)." |
| 351 | + "**Expected result:** a transform and a CRS name (e.g. NAD83 / UTM zone 11N)." |
360 | 352 | ] |
361 | 353 | }, |
362 | 354 | { |
|
379 | 371 | "id": "f6d207db", |
380 | 372 | "metadata": {}, |
381 | 373 | "source": [ |
382 | | - "If you want the full CRS details (datum, axis order, area of use), just evaluate `geotiff.crs` on its own — pyproj prints a detailed dump.\n", |
| 374 | + "If you want the full CRS details (datum, axis order, area of use), just evaluate `geotiff.crs` on its own\n", |
383 | 375 | "\n", |
384 | 376 | "Notice async-geotiff only fetched ~16 KB to get this metadata, not the full file. The range-read win compounds at every level of the stack." |
385 | 377 | ] |
|
389 | 381 | "id": "9dd739eb", |
390 | 382 | "metadata": {}, |
391 | 383 | "source": [ |
392 | | - "## Bonus: portability\n", |
| 384 | + "## Portability\n", |
393 | 385 | "\n", |
394 | | - "**Demonstrates: cloud-agnostic code.** The same `obstore.get(store, ...)` call works against S3 or GCS — only the store constructor changes. This cell isn't runnable (we don't have S3 creds), but it shows the shape:\n", |
| 386 | + "The same `obstore.get(store, ...)` call works against S3 or GCS, only the store constructor changes. The example below shows the shape of the request in S3:\n", |
395 | 387 | "\n", |
396 | 388 | "```python\n", |
397 | 389 | "from obstore.store import S3Store\n", |
398 | 390 | "\n", |
399 | 391 | "s3_store = S3Store(bucket=\"my-bucket\", region=\"us-west-2\")\n", |
400 | 392 | "buf = obstore.get(s3_store, \"path/to/object\").bytes() # same call\n", |
401 | | - "```\n", |
402 | | - "\n", |
403 | | - "Any library that accepts an obspec-compatible store benefits automatically. async-geotiff opening a COG works identically against Azure, S3, or GCS." |
| 393 | + "```\n" |
404 | 394 | ] |
405 | 395 | }, |
406 | 396 | { |
407 | 397 | "cell_type": "markdown", |
408 | 398 | "id": "b2d0385b", |
409 | 399 | "metadata": {}, |
410 | | - "source": "## You're done\n\nIf you got expected output on every cell above, the obstore stack is wired up end-to-end:\n\n- **Reliability** — authenticated against Planetary Computer with auto-refreshing SAS tokens\n- **Cost** — read 16 KB of a multi-hundred-MB file via range reads (tens of thousands of times less egress)\n- **Latency** — batched multi-range read in a single round-trip\n- **Speed** — parallel reads several times faster than serial\n- **Composability** — handed the same store to async-geotiff, opened a COG with no re-auth\n- **Portability** — same API shape works for S3 and GCS\n\nFrom here, any obspec-compatible library plugs in the same way. Check the companion [Lonboard tutorial](../overview/lonboard.md) for interactive visualization or the [async-geotiff tutorial](../overview/async-geotiff.md) for pixel-level analysis." |
| 400 | + "source": "## You're done\n\nIf you got expected output on every cell above, the obstore stack is wired up end-to-end.\n\nFrom here, any obspec-compatible library plugs in the same way. Check the companion [Lonboard tutorial](../overview/lonboard.md) for interactive visualization or the [async-geotiff tutorial](../overview/async-geotiff.md) for pixel-level analysis." |
411 | 401 | } |
412 | 402 | ], |
413 | 403 | "metadata": { |
|
0 commit comments