feat(assets): extract image dimensions at ingest and emit on asset responses#13991
feat(assets): extract image dimensions at ingest and emit on asset responses#13991mattmillerai wants to merge 7 commits into
Conversation
…sponses
Image assets now carry width/height under the existing `metadata` field on
asset responses, shaped as `{"kind": "image", "width": W, "height": H}`.
This lets consumers get original dimensions (e.g. for clients that render
server-side thumbnails and can't recover them from naturalWidth/Height)
without an extra round-trip.
Dimensions are written to AssetReference.system_metadata across three
ingest paths:
- Direct file ingest (upload, in-place registration): Pillow reads the
image header right after hashing, while the file is still in OS page
cache. Non-image MIME types are skipped without touching the file.
- From-hash registration: this path never reads the file bytes, so
dimensions are best-effort copied from any prior sibling reference of
the same asset that already carries kind=image metadata. Missing
siblings, non-image siblings, or absent dimension keys leave the new
reference's metadata unchanged.
- Scanner enrichment: extends the existing system_metadata write in
enrich_asset so scanner-registered images get the same treatment as
uploaded ones.
Existing system_metadata keys (e.g. safetensors fields written by the
enricher, download provenance) are preserved through merge. Existing
assets ingested before this change retain their current metadata — no
automatic backfill in this PR.
Tests cover image emission, non-image no-op, merge preservation, and the
from-hash sibling back-fill (including the no-sibling and non-image-sibling
cases).
📝 WalkthroughHidden review stack artifact:WalkthroughThis PR adds image-dimension extraction and storage to assets. A new utility, extract_image_dimensions, reads image headers (via Pillow when available) and returns {kind, width, height}. Ingest integrates extraction for image/* files and backfills missing dimensions from sibling references when creating references from hash. The scanner also extracts and merges dimensions during metadata enrichment. Tests cover utility behavior, ingest enrichment, and sibling backfill cases. 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests-unit/assets_test/services/test_ingest.py (1)
456-488: ⚡ Quick winMake this test exercise the actual backfill-merge path.
This test currently mutates
ref.system_metadataafter backfill already happened, so it doesn't validate merge behavior during backfill execution.Possible adjustment
- # Seed a sentinel key and re-run back-fill via a second register call - # to exercise the merge path with pre-existing data. + # Seed a sentinel key, then re-run back-fill helper to exercise + # merge behavior with pre-existing system metadata. ref.system_metadata = {**(ref.system_metadata or {}), "source_url": "https://example/p"} session.commit() + from app.assets.services.ingest import _backfill_image_dimensions_from_siblings + _backfill_image_dimensions_from_siblings( + session, + asset_id=asset.id, + new_reference_id=ref.id, + current_system_metadata=ref.system_metadata, + ) + session.commit() + session.refresh(ref)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests-unit/assets_test/services/test_ingest.py` around lines 456 - 488, The test mutates ref.system_metadata after backfill finished instead of exercising the merge path; change it to run the backfill twice and seed the pre-existing metadata before the second backfill so the merge code runs: call _register_existing_asset once to create the initial reference, query and update that AssetReference.row's system_metadata in the DB (via session.add/commit) to include the sentinel "source_url", then call _register_existing_asset again (using the same asset_hash/name/owner_id) to trigger the backfill-merge path, and finally re-query the AssetReference to assert that "source_url" was preserved and kind/width/height were merged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/assets/services/ingest.py`:
- Around line 412-424: The backfill loop currently accepts any sibling with kind
== "image", copies partial keys and returns early, which can persist incomplete
metadata; change the logic in the loop that inspects meta/current so that you
first validate the sibling has both required image dimensions (check that all
keys in _IMAGE_DIMENSION_KEYS, e.g. "width" and "height", exist and are non-null
in meta) and skip (continue) if they are missing or invalid, then build merged =
dict(current) and copy keys from meta, and only call
set_reference_system_metadata(reference_id=new_reference_id,
system_metadata=merged, session=session) if merged != current; do not return
early—allow the loop to continue scanning other siblings.
---
Nitpick comments:
In `@tests-unit/assets_test/services/test_ingest.py`:
- Around line 456-488: The test mutates ref.system_metadata after backfill
finished instead of exercising the merge path; change it to run the backfill
twice and seed the pre-existing metadata before the second backfill so the merge
code runs: call _register_existing_asset once to create the initial reference,
query and update that AssetReference.row's system_metadata in the DB (via
session.add/commit) to include the sentinel "source_url", then call
_register_existing_asset again (using the same asset_hash/name/owner_id) to
trigger the backfill-merge path, and finally re-query the AssetReference to
assert that "source_url" was preserved and kind/width/height were merged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e33a12a2-24f8-4d51-bce5-8996ec70a603
📒 Files selected for processing (5)
app/assets/scanner.pyapp/assets/services/image_dimensions.pyapp/assets/services/ingest.pytests-unit/assets_test/services/test_image_dimensions.pytests-unit/assets_test/services/test_ingest.py
Per CodeRabbit review on #13991: the previous loop accepted any sibling with `kind == "image"` and copied whichever dimension keys happened to be present, then returned. A partial sibling (kind set but missing or invalid width/height) could persist incomplete metadata onto the new reference even when a later sibling had valid dimensions. Now we validate that the sibling has both width and height as positive integers before adopting its dimensions, and continue scanning to the next sibling otherwise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/assets/services/ingest.py`:
- Around line 416-421: The dimension validation in the backfill path currently
uses isinstance(width, int) and isinstance(height, int) which will accept
booleans; update the checks in the backfill function (the block referencing
width and height and the conditional that also checks width <= 0 or height <= 0)
to use strict type checks (type(width) is int and type(height) is int) so
True/False are rejected; keep the rest of the logic (non-positive checks)
unchanged and note that extract_image_dimensions remains the upstream filter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 333aeddb-0e02-418f-9a23-ca4b36ba5cd1
📒 Files selected for processing (1)
app/assets/services/ingest.py
…e-is) Per CodeRabbit follow-up on #13991: bool is a subclass of int in Python, so isinstance(True, int) is True. The previous strict-int gate would have accepted width=True (truthy + > 0) as a valid dimension. Realistic occurrence is low (extract_image_dimensions returns proper ints, JSON doesn't serialize bools as numbers), but the validation gate exists for defense-in-depth so it should be actually strict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per CodeRabbit review on #13991: the previous loop accepted any sibling with `kind == "image"` and copied whichever dimension keys happened to be present, then returned. A partial sibling (kind set but missing or invalid width/height) could persist incomplete metadata onto the new reference even when a later sibling had valid dimensions. Now we validate that the sibling has both width and height as positive integers before adopting its dimensions, and continue scanning to the next sibling otherwise.
…e-is) Per CodeRabbit follow-up on #13991: bool is a subclass of int in Python, so isinstance(True, int) is True. The previous strict-int gate would have accepted width=True (truthy + > 0) as a valid dimension. Realistic occurrence is low (extract_image_dimensions returns proper ints, JSON doesn't serialize bools as numbers), but the validation gate exists for defense-in-depth so it should be actually strict.
a3409f7 to
93f671d
Compare
| except ImportError: | ||
| logger.debug( | ||
| "Pillow not available; skipping image dimension extraction for %s", | ||
| file_path, | ||
| ) | ||
| return None |
There was a problem hiding this comment.
Do we actually get anything from this level of defensiveness? PIL is included in our dependencies and there are plenty of files unconditionally importing it already, right?
guill
left a comment
There was a problem hiding this comment.
Added one non-blocking comment, but we can simplify it in a follow-up commit -- it shouldn't actually hurt anything.
Summary
Image assets now carry width/height under the existing
metadatafield on asset responses, shaped as:Consumers that render server-side thumbnails (where
naturalWidth/naturalHeightaren't available on the actual image data) can read original dimensions directly off the asset response instead of making a second request.The nested-
metadatashape is forward-compatible — future media kinds (video duration/fps, audio sample rate, etc.) can extend this shape without further schema churn or top-level field additions.What changes
Dimensions are written to
AssetReference.system_metadataacross three ingest paths:upload_from_temp_path,register_file_in_place): Pillow reads the image header right after hashing, while the file is still in OS page cache. Non-image MIME types short-circuit without touching the file.POST /api/assets/from-hash, and the hash-match branch of upload): this path never reads the file bytes, so dimensions are best-effort copied from any prior sibling reference of the same asset that already carrieskind=imagemetadata. Missing siblings, non-image siblings, or absent dimension keys leave the new reference's metadata unchanged — no correctness regression vs. pre-PR behavior.enrich_asset): extends the existingsystem_metadatawrite so scanner-registered images get the same treatment as uploaded ones.Existing
system_metadatakeys (e.g. safetensors fields written by the enricher, download provenance) are preserved through merge.What doesn't change
metadataas nil (or whatever pre-existing system_metadata they had) — unchanged.AssetOpenAPI schema is unchanged — themetadatafield already existed; this PR just populates it for images.Test plan
extract_image_dimensions— PNG, JPEG, missing file, corrupt file, empty file, non-image MIME short-circuit.tests-unit/assets_test/suite passes (336 passed, 10 pre-existing skipped).Smoke test
Beyond the unit + integration suites, ran a CLI smoke test against real PNG and JPEG files generated on the fly. Verifies the extracted shape matches what surfaces on
Asset.metadata, and walks the post-CodeRabbit backfill validation (type(x) is intstrictness + bool rejection fromf2c73308and93f671d7).Confirms:
extract_image_dimensionsreturns the documented{"kind": "image", "width": W, "height": H}shape for real images andNonefor everything else (corrupt files, non-image MIME, missing files — all the failure modes route through the existing debug-level logger).system_metadata— no pre-existing keys exist on image refs today (the scanner only populatessystem_metadatafor files with extractable embedded metadata like safetensors). The merge logic iningest.pystill usesdict.update()so any future image-side writer can land additional keys without losing the dims.isinstance(x, int)→type(x) is intfix that closes the bool-subclass-of-int gap).