Skip to content

feat(assets): extract image dimensions at ingest and emit on asset responses#13991

Open
mattmillerai wants to merge 7 commits into
masterfrom
matt/asset-image-dimensions-metadata
Open

feat(assets): extract image dimensions at ingest and emit on asset responses#13991
mattmillerai wants to merge 7 commits into
masterfrom
matt/asset-image-dimensions-metadata

Conversation

@mattmillerai
Copy link
Copy Markdown
Contributor

@mattmillerai mattmillerai commented May 19, 2026

Summary

Image assets now carry width/height under the existing metadata field on asset responses, shaped as:

"metadata": {
  "kind": "image",
  "width": 1920,
  "height": 1080
}

Consumers that render server-side thumbnails (where naturalWidth/naturalHeight aren't available on the actual image data) can read original dimensions directly off the asset response instead of making a second request.

The nested-metadata shape is forward-compatible — future media kinds (video duration/fps, audio sample rate, etc.) can extend this shape without further schema churn or top-level field additions.

What changes

Dimensions are written to AssetReference.system_metadata across three ingest paths:

  1. Direct file ingest (upload_from_temp_path, register_file_in_place): Pillow reads the image header right after hashing, while the file is still in OS page cache. Non-image MIME types short-circuit without touching the file.
  2. From-hash registration (POST /api/assets/from-hash, and the hash-match branch of upload): this path never reads the file bytes, so dimensions are best-effort copied from any prior sibling reference of the same asset that already carries kind=image metadata. Missing siblings, non-image siblings, or absent dimension keys leave the new reference's metadata unchanged — no correctness regression vs. pre-PR behavior.
  3. Scanner enrichment (enrich_asset): extends the existing system_metadata write so scanner-registered images get the same treatment as uploaded ones.

Existing system_metadata keys (e.g. safetensors fields written by the enricher, download provenance) are preserved through merge.

What doesn't change

  • Existing assets ingested before this change retain their current metadata — no automatic backfill in this PR.
  • Non-image assets emit metadata as nil (or whatever pre-existing system_metadata they had) — unchanged.
  • The Asset OpenAPI schema is unchanged — the metadata field already existed; this PR just populates it for images.

Test plan

  • Unit tests for extract_image_dimensions — PNG, JPEG, missing file, corrupt file, empty file, non-image MIME short-circuit.
  • Integration tests for direct ingest — image asset emits dims, non-image asset leaves metadata empty, pre-existing system_metadata keys are preserved through merge.
  • Integration tests for the from-hash back-fill — sibling with image metadata back-fills correctly, non-image sibling does not back-fill, no sibling leaves metadata empty, caller-supplied keys survive back-fill.
  • Full tests-unit/assets_test/ suite passes (336 passed, 10 pre-existing skipped).

Smoke test

Beyond the unit + integration suites, ran a CLI smoke test against real PNG and JPEG files generated on the fly. Verifies the extracted shape matches what surfaces on Asset.metadata, and walks the post-CodeRabbit backfill validation (type(x) is int strictness + bool rejection from f2c73308 and 93f671d7).

$ .venv/bin/python smoke_test_pr_13991.py

--- Scenario 1: PNG (1920x1080) with mime hint 'image/png' ---
Returned dict: {"kind": "image", "width": 1920, "height": 1080}
Verdict: PASS

--- Scenario 2: JPEG (800x600) with no MIME hint ---
Returned dict: {"kind": "image", "width": 800, "height": 600}
Verdict: PASS

--- Scenario 3: text file claiming to be image/png (corrupt/lying) ---
Returned: None
Verdict: PASS (None — Pillow rejects, function logs debug + returns)

--- Scenario 4: non-image MIME ('application/octet-stream') ---
Returned: None
Verdict: PASS (short-circuits without opening the file)

--- Scenario 5: Wire shape — how the result lands on Asset.metadata ---
Pre-existing system_metadata: {}  (empty for plain image ingest)
After ingest merge:           {"kind": "image", "width": 3840, "height": 2160}
This is what surfaces on:     GET /api/assets/<id> -> response.metadata

--- Scenario 6: Backfill sibling-dimension validation ---
  [PASS] valid: positive ints                (w=1920       h=1080      ) → accept
  [PASS] missing width                       (w=None       h=1080      ) → reject
  [PASS] missing height                      (w=1920       h=None      ) → reject
  [PASS] zero width                          (w=0          h=1080      ) → reject
  [PASS] negative height                     (w=1920       h=-1        ) → reject
  [PASS] string width                        (w='1920'     h=1080      ) → reject
  [PASS] float dims                          (w=1920.0     h=1080.0    ) → reject
  [PASS] BOOL width (post-93f671d7 fix)      (w=True       h=1080      ) → reject
  [PASS] BOOL both (post-93f671d7 fix)       (w=True       h=True      ) → reject

Confirms:

  • extract_image_dimensions returns the documented {"kind": "image", "width": W, "height": H} shape for real images and None for everything else (corrupt files, non-image MIME, missing files — all the failure modes route through the existing debug-level logger).
  • For plain image ingest, the dict is the only entry on system_metadata — no pre-existing keys exist on image refs today (the scanner only populates system_metadata for files with extractable embedded metadata like safetensors). The merge logic in ingest.py still uses dict.update() so any future image-side writer can land additional keys without losing the dims.
  • The backfill sibling validation (post-CodeRabbit fixes) correctly rejects every malformed input: missing keys, non-positive values, wrong types (string/float), and booleans (the isinstance(x, int)type(x) is int fix that closes the bool-subclass-of-int gap).

…sponses

Image assets now carry width/height under the existing `metadata` field on
asset responses, shaped as `{"kind": "image", "width": W, "height": H}`.
This lets consumers get original dimensions (e.g. for clients that render
server-side thumbnails and can't recover them from naturalWidth/Height)
without an extra round-trip.

Dimensions are written to AssetReference.system_metadata across three
ingest paths:

- Direct file ingest (upload, in-place registration): Pillow reads the
  image header right after hashing, while the file is still in OS page
  cache. Non-image MIME types are skipped without touching the file.
- From-hash registration: this path never reads the file bytes, so
  dimensions are best-effort copied from any prior sibling reference of
  the same asset that already carries kind=image metadata. Missing
  siblings, non-image siblings, or absent dimension keys leave the new
  reference's metadata unchanged.
- Scanner enrichment: extends the existing system_metadata write in
  enrich_asset so scanner-registered images get the same treatment as
  uploaded ones.

Existing system_metadata keys (e.g. safetensors fields written by the
enricher, download provenance) are preserved through merge. Existing
assets ingested before this change retain their current metadata — no
automatic backfill in this PR.

Tests cover image emission, non-image no-op, merge preservation, and the
from-hash sibling back-fill (including the no-sibling and non-image-sibling
cases).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Review Change Stack

📝 Walkthrough

Hidden review stack artifact:

Walkthrough

This PR adds image-dimension extraction and storage to assets. A new utility, extract_image_dimensions, reads image headers (via Pillow when available) and returns {kind, width, height}. Ingest integrates extraction for image/* files and backfills missing dimensions from sibling references when creating references from hash. The scanner also extracts and merges dimensions during metadata enrichment. Tests cover utility behavior, ingest enrichment, and sibling backfill cases.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.24% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main change: extracting image dimensions during asset ingest and emitting them on asset responses.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly explains the feature (image dimension extraction and metadata emission), implementation across three ingest paths, and includes comprehensive testing with smoke test validation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests-unit/assets_test/services/test_ingest.py (1)

456-488: ⚡ Quick win

Make this test exercise the actual backfill-merge path.

This test currently mutates ref.system_metadata after backfill already happened, so it doesn't validate merge behavior during backfill execution.

Possible adjustment
-        # Seed a sentinel key and re-run back-fill via a second register call
-        # to exercise the merge path with pre-existing data.
+        # Seed a sentinel key, then re-run back-fill helper to exercise
+        # merge behavior with pre-existing system metadata.
         ref.system_metadata = {**(ref.system_metadata or {}), "source_url": "https://example/p"}
         session.commit()
+        from app.assets.services.ingest import _backfill_image_dimensions_from_siblings
+        _backfill_image_dimensions_from_siblings(
+            session,
+            asset_id=asset.id,
+            new_reference_id=ref.id,
+            current_system_metadata=ref.system_metadata,
+        )
+        session.commit()
+        session.refresh(ref)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests-unit/assets_test/services/test_ingest.py` around lines 456 - 488, The
test mutates ref.system_metadata after backfill finished instead of exercising
the merge path; change it to run the backfill twice and seed the pre-existing
metadata before the second backfill so the merge code runs: call
_register_existing_asset once to create the initial reference, query and update
that AssetReference.row's system_metadata in the DB (via session.add/commit) to
include the sentinel "source_url", then call _register_existing_asset again
(using the same asset_hash/name/owner_id) to trigger the backfill-merge path,
and finally re-query the AssetReference to assert that "source_url" was
preserved and kind/width/height were merged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/assets/services/ingest.py`:
- Around line 412-424: The backfill loop currently accepts any sibling with kind
== "image", copies partial keys and returns early, which can persist incomplete
metadata; change the logic in the loop that inspects meta/current so that you
first validate the sibling has both required image dimensions (check that all
keys in _IMAGE_DIMENSION_KEYS, e.g. "width" and "height", exist and are non-null
in meta) and skip (continue) if they are missing or invalid, then build merged =
dict(current) and copy keys from meta, and only call
set_reference_system_metadata(reference_id=new_reference_id,
system_metadata=merged, session=session) if merged != current; do not return
early—allow the loop to continue scanning other siblings.

---

Nitpick comments:
In `@tests-unit/assets_test/services/test_ingest.py`:
- Around line 456-488: The test mutates ref.system_metadata after backfill
finished instead of exercising the merge path; change it to run the backfill
twice and seed the pre-existing metadata before the second backfill so the merge
code runs: call _register_existing_asset once to create the initial reference,
query and update that AssetReference.row's system_metadata in the DB (via
session.add/commit) to include the sentinel "source_url", then call
_register_existing_asset again (using the same asset_hash/name/owner_id) to
trigger the backfill-merge path, and finally re-query the AssetReference to
assert that "source_url" was preserved and kind/width/height were merged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e33a12a2-24f8-4d51-bce5-8996ec70a603

📥 Commits

Reviewing files that changed from the base of the PR and between d0328b4 and c5b55ba.

📒 Files selected for processing (5)
  • app/assets/scanner.py
  • app/assets/services/image_dimensions.py
  • app/assets/services/ingest.py
  • tests-unit/assets_test/services/test_image_dimensions.py
  • tests-unit/assets_test/services/test_ingest.py

Comment thread app/assets/services/ingest.py
mattmillerai added a commit that referenced this pull request May 19, 2026
Per CodeRabbit review on #13991: the previous loop accepted any sibling
with `kind == "image"` and copied whichever dimension keys happened to
be present, then returned. A partial sibling (kind set but missing or
invalid width/height) could persist incomplete metadata onto the new
reference even when a later sibling had valid dimensions.

Now we validate that the sibling has both width and height as positive
integers before adopting its dimensions, and continue scanning to the
next sibling otherwise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/assets/services/ingest.py`:
- Around line 416-421: The dimension validation in the backfill path currently
uses isinstance(width, int) and isinstance(height, int) which will accept
booleans; update the checks in the backfill function (the block referencing
width and height and the conditional that also checks width <= 0 or height <= 0)
to use strict type checks (type(width) is int and type(height) is int) so
True/False are rejected; keep the rest of the logic (non-positive checks)
unchanged and note that extract_image_dimensions remains the upstream filter.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 333aeddb-0e02-418f-9a23-ca4b36ba5cd1

📥 Commits

Reviewing files that changed from the base of the PR and between c5b55ba and 5d2a33e.

📒 Files selected for processing (1)
  • app/assets/services/ingest.py

Comment thread app/assets/services/ingest.py
mattmillerai added a commit that referenced this pull request May 19, 2026
…e-is)

Per CodeRabbit follow-up on #13991: bool is a subclass of int in Python,
so isinstance(True, int) is True. The previous strict-int gate would
have accepted width=True (truthy + > 0) as a valid dimension.
Realistic occurrence is low (extract_image_dimensions returns proper
ints, JSON doesn't serialize bools as numbers), but the validation gate
exists for defense-in-depth so it should be actually strict.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per CodeRabbit review on #13991: the previous loop accepted any sibling
with `kind == "image"` and copied whichever dimension keys happened to
be present, then returned. A partial sibling (kind set but missing or
invalid width/height) could persist incomplete metadata onto the new
reference even when a later sibling had valid dimensions.

Now we validate that the sibling has both width and height as positive
integers before adopting its dimensions, and continue scanning to the
next sibling otherwise.
…e-is)

Per CodeRabbit follow-up on #13991: bool is a subclass of int in Python,
so isinstance(True, int) is True. The previous strict-int gate would
have accepted width=True (truthy + > 0) as a valid dimension.
Realistic occurrence is low (extract_image_dimensions returns proper
ints, JSON doesn't serialize bools as numbers), but the validation gate
exists for defense-in-depth so it should be actually strict.
@mattmillerai mattmillerai force-pushed the matt/asset-image-dimensions-metadata branch from a3409f7 to 93f671d Compare May 19, 2026 20:48
@mattmillerai mattmillerai added the Core Core team dependency label May 22, 2026
Comment on lines +39 to +44
except ImportError:
logger.debug(
"Pillow not available; skipping image dimension extraction for %s",
file_path,
)
return None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually get anything from this level of defensiveness? PIL is included in our dependencies and there are plenty of files unconditionally importing it already, right?

Copy link
Copy Markdown
Member

@guill guill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one non-blocking comment, but we can simplify it in a follow-up commit -- it shouldn't actually hurt anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core Core team dependency

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants