Skip to content

fix(scraper): tighten gamelist metadata scraping#858

Open
wizzomafizzo wants to merge 9 commits into
mainfrom
fix/scraper-companion-semantics
Open

fix(scraper): tighten gamelist metadata scraping#858
wizzomafizzo wants to merge 9 commits into
mainfrom
fix/scraper-companion-semantics

Conversation

@wizzomafizzo
Copy link
Copy Markdown
Member

@wizzomafizzo wizzomafizzo commented May 28, 2026

Summary

  • update scraper subsystem docs to match current platform scraper architecture
  • tighten ZaparooCompanion handling with atomic scrape writes, sentinels, safer matching, and aggregate progress
  • infer content types for path-backed media metadata/images
  • import regular gamelist region/lang as per-media tags using media-row matching

Validation

  • task lint-fix
  • task test

Summary by CodeRabbit

  • Documentation

    • Major rewrite of scraper docs: clarified scanner vs scraper roles, gamelist.xml handling, companion-entry processing, tag/property rules, sentinel semantics, lifecycle, and media-level invariants.
  • New Features

    • Documented scrape controls: status, cancel, resume; expanded progress fields (totalScraped, scraping, done, paused).
  • Bug Fixes / Improvements

    • Image handling: prefer thumbnails, infer content-type/extension from paths, expanded image mapping and fallback rules; ensure region/lang media tags are emitted.
  • Tests

    • Added/updated tests covering image inference, companion artwork selection, and scraper matching/processing.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3c7c4b66-9bb2-49d5-9e24-f041f31776c6

📥 Commits

Reviewing files that changed from the base of the PR and between ee7515a and dde6bcc.

📒 Files selected for processing (5)
  • docs/scraper.md
  • pkg/api/methods/media_image.go
  • pkg/api/methods/media_image_test.go
  • pkg/database/scraper/gamelistxml/scraper.go
  • pkg/database/scraper/gamelistxml/scraper_test.go
✅ Files skipped from review due to trivial changes (1)
  • docs/scraper.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/api/methods/media_image.go
  • pkg/api/methods/media_image_test.go
  • pkg/database/scraper/gamelistxml/scraper_test.go

📝 Walkthrough

Walkthrough

This PR refactors the gamelist.xml scraper to use normalized path-folded ROM matching, adds ROM-level region/lang MediaTags, rewrites companion parent/child processing to atomic ApplyScrapeResult writes with companion stats, and introduces content-type inference for path-backed media properties; tests and documentation updated accordingly.

Changes

Content Type Inference for Media Properties

Layer / File(s) Summary
Content type helper and API integration
pkg/api/methods/media_content.go, pkg/api/methods/media_image.go, pkg/api/methods/media_meta.go
Adds mediaContentType to derive MIME types from file extensions when no explicit ContentType is provided; integrates it into image loading and property mapping so responses include derived ContentType and Extension.
Test coverage
pkg/api/methods/media_image_test.go, pkg/api/methods/media_meta_test.go
Tests verify image TypeTag selection/fallback, inference of ContentType/Extension from file paths, and that returned base64 data decodes to the original file bytes.

Gamelist XML Scraper Path-Based Refactoring

Layer / File(s) Summary
Scraper subsystem documentation
docs/scraper.md
Docs rewritten to describe platform-based scraper registration, run-loop semantics, path-folded game matching, ROM vs title tag scope, expanded gamelist.xml field mappings (image variants and filesystem fallback), ZaparooCompanion parent/child processing, and new media.scrape status/cancel/resume endpoints.
LoadRecords path-folded matching
pkg/database/scraper/gamelistxml/scraper.go
LoadRecords signature changed to accept a mediaByPathFold map and matches <game> entries by folded/normalized resolved ROM paths, setting MatchedMediaDBID/MatchedTitleDBID and removing matched entries to prevent duplicates (with .zip prefix support).
Scrape loop and progress updates
pkg/database/scraper/gamelistxml/scraper.go
scrapeLoop runs companion processing (per-system stats), builds filtered mediaByPathFold (respecting Force/already-scraped), skips systems with no eligible media, invokes updated LoadRecords, aggregates deltas for ScrapeUpdate, and includes mapped.MediaTags on ApplyScrapeResult writes.
MapToDB ROM-level region/language tags & artwork fallback
pkg/database/scraper/gamelistxml/scraper.go
MapToDB normalizes Region/Lang, parses CSV values into ROM-level MediaTags, computes artworkFallbackNames for ROM-relative filesystem probing, threads fallback candidates into image lookup, and returns MediaTags in MapResult.
Path folding & artwork probing helpers
pkg/database/scraper/gamelistxml/scraper.go
Adds/updates pathFoldKey, matchMediaByResolvedPath, artworkFallbackNames, and refactors artwork probing to accept fallback candidates with ambiguity handling for .zip directory-like prefixes.
Companion entry processing
pkg/database/scraper/gamelistxml/scraper.go
processCompanionEntries rewritten to return companionStats, build media-by-title indices, match children by folded path/slug/filename-suffix (with ambiguity handling), optionally skip already-scraped media unless Force, generate child MediaTags, and write a single ApplyScrapeResult per matched child including sentinel and parent TitleTags/TitleProps.
Test suite refactoring
pkg/database/scraper/gamelistxml/scraper_test.go
Tests updated to use mediaByPath indexed lookup, remove mediascanner dependency, expand LoadRecords path-matching cases, validate MapToDB nested/alias artwork fallbacks, add companion write matchers, and mock GetMediaBySystemID/GetScrapedMediaIDs/FindMediaTitleBySystemAndSlug to assert writes, companionStats, and scrape-loop counters across success, force, and error scenarios.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

Suggested reviewers

  • BossRighteous

🐰 Paths fold with care, and tags now bloom anew,
Companions match by their nature true,
Content types whisper from filenames deep,
The scraper now keeps atomic writes to keep,
While docs sing of the journey through! 📚✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.12% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly relates to the main changes: tightening gamelist metadata scraping through improved content-type inference, ZaparooCompanion handling, and media-row matching.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/scraper-companion-semantics

Comment @coderabbitai help to get the list of available commands and usage tips.

@sentry
Copy link
Copy Markdown

sentry Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 73.68421% with 60 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/database/scraper/gamelistxml/scraper.go 72.35% 43 Missing and 17 partials ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
pkg/database/scraper/gamelistxml/scraper_test.go (1)

1381-1389: ⚡ Quick win

Strengthen companionWriteMatcher to validate parent title metadata too.

The matcher currently ignores TitleTags and TitleProps, so atomic-write regressions on companion parent metadata would still pass.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/database/scraper/gamelistxml/scraper_test.go` around lines 1381 - 1389,
The matcher companionWriteMatcher currently only checks Sentinel and MediaTags;
modify it to also validate the parent title metadata on the captured
*database.ScrapeWrite*. Change the signature of companionWriteMatcher to accept
expected title metadata (e.g., expectedTitleTags []database.TagInfo and
expectedTitleProps map[string]string), and inside the mock.MatchedBy closure
verify that w.TitleTags equals expectedTitleTags and w.TitleProps equals
expectedTitleProps (use assert.ObjectsAreEqual or reflect.DeepEqual for the
comparisons). Ensure you still check w != nil and the existing Sentinel and
MediaTags checks so the matcher rejects writes missing the parent title
metadata.
pkg/database/scraper/gamelistxml/scraper.go (1)

934-939: ⚡ Quick win

Move companionStats to the top-level type/const section.

This new type is declared mid-file after many functions. Please move it near other type/const declarations for guideline compliance.

As per coding guidelines, **/*.go: "Define Go types and consts near the top of the file, before functions and methods".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/database/scraper/gamelistxml/scraper.go` around lines 934 - 939, Move the
companionStats type declaration out of the middle of the file and place it with
the other top-level types/consts near the top of the file (before any functions
or methods); specifically locate the companionStats type and cut/paste it into
the file’s existing type/const block so it sits alongside other package-level
type definitions, ensure no references need adjusting, then run gofmt/govet to
verify formatting and imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/api/methods/media_content.go`:
- Around line 49-52: The function mediaContentType currently checks
strings.TrimSpace(contentType) but returns the original untrimmed contentType;
update mediaContentType to return the trimmed value (use
strings.TrimSpace(contentType)) when the check passes, and similarly ensure any
fallback return (the text parameter) is also strings.TrimSpace(text) so no
whitespace-padded MIME values are exposed.

In `@pkg/database/scraper/gamelistxml/scraper.go`:
- Around line 1223-1231: The companionChildTags function currently appends raw
Region and Lang values which can create inconsistent tags (e.g., "USA, EUR");
update companionChildTags to normalize these fields the same way MapToDB does:
split CSV values, trim whitespace, lowercase (or apply the same normalization
function MapToDB uses), and append each resulting token as a separate
database.TagInfo with Type string(tags.TagTypeRegion) or
string(tags.TagTypeLang) respectively; locate companionChildTags and reuse or
mirror the normalization logic used elsewhere (e.g., the MapToDB helper or
tokenization utility) so region/lang produce consistent individual tags.

---

Nitpick comments:
In `@pkg/database/scraper/gamelistxml/scraper_test.go`:
- Around line 1381-1389: The matcher companionWriteMatcher currently only checks
Sentinel and MediaTags; modify it to also validate the parent title metadata on
the captured *database.ScrapeWrite*. Change the signature of
companionWriteMatcher to accept expected title metadata (e.g., expectedTitleTags
[]database.TagInfo and expectedTitleProps map[string]string), and inside the
mock.MatchedBy closure verify that w.TitleTags equals expectedTitleTags and
w.TitleProps equals expectedTitleProps (use assert.ObjectsAreEqual or
reflect.DeepEqual for the comparisons). Ensure you still check w != nil and the
existing Sentinel and MediaTags checks so the matcher rejects writes missing the
parent title metadata.

In `@pkg/database/scraper/gamelistxml/scraper.go`:
- Around line 934-939: Move the companionStats type declaration out of the
middle of the file and place it with the other top-level types/consts near the
top of the file (before any functions or methods); specifically locate the
companionStats type and cut/paste it into the file’s existing type/const block
so it sits alongside other package-level type definitions, ensure no references
need adjusting, then run gofmt/govet to verify formatting and imports.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 024e2c7a-f4ed-475e-ab56-937c7ab997f2

📥 Commits

Reviewing files that changed from the base of the PR and between 7570989 and 64454a9.

📒 Files selected for processing (8)
  • docs/scraper.md
  • pkg/api/methods/media_content.go
  • pkg/api/methods/media_image.go
  • pkg/api/methods/media_image_test.go
  • pkg/api/methods/media_meta.go
  • pkg/api/methods/media_meta_test.go
  • pkg/database/scraper/gamelistxml/scraper.go
  • pkg/database/scraper/gamelistxml/scraper_test.go

Comment on lines +49 to +52
func mediaContentType(contentType, text string) string {
if strings.TrimSpace(contentType) != "" {
return contentType
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Trim the returned content type before exposing it.

Line 50 checks a trimmed value, but Line 51 returns the untrimmed string. This can leak whitespace-padded MIME values into responses.

Suggested fix
 func mediaContentType(contentType, text string) string {
-	if strings.TrimSpace(contentType) != "" {
-		return contentType
+	if ct := strings.TrimSpace(contentType); ct != "" {
+		return ct
 	}
 	ext := extensionFromTextPath(text)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func mediaContentType(contentType, text string) string {
if strings.TrimSpace(contentType) != "" {
return contentType
}
func mediaContentType(contentType, text string) string {
if ct := strings.TrimSpace(contentType); ct != "" {
return ct
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/api/methods/media_content.go` around lines 49 - 52, The function
mediaContentType currently checks strings.TrimSpace(contentType) but returns the
original untrimmed contentType; update mediaContentType to return the trimmed
value (use strings.TrimSpace(contentType)) when the check passes, and similarly
ensure any fallback return (the text parameter) is also strings.TrimSpace(text)
so no whitespace-padded MIME values are exposed.

Comment on lines +1223 to +1231
func companionChildTags(c companionChild) []database.TagInfo {
var childTags []database.TagInfo
if c.Region != "" {
childTags = append(childTags, database.TagInfo{Type: string(tags.TagTypeRegion), Tag: c.Region})
}
if c.Lang != "" {
childTags = append(childTags, database.TagInfo{Type: string(tags.TagTypeLang), Tag: c.Lang})
}
return childTags
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize companion child region/lang tags the same way as normal records.

Line 1226 and Line 1229 write raw values (no CSV split / lowercase), which can produce inconsistent tags versus MapToDB behavior (e.g., "USA, EUR" as one tag).

💡 Suggested fix
 func companionChildTags(c companionChild) []database.TagInfo {
 	var childTags []database.TagInfo
-	if c.Region != "" {
-		childTags = append(childTags, database.TagInfo{Type: string(tags.TagTypeRegion), Tag: c.Region})
-	}
-	if c.Lang != "" {
-		childTags = append(childTags, database.TagInfo{Type: string(tags.TagTypeLang), Tag: c.Lang})
-	}
+	childTags = appendCSVTags(childTags, string(tags.TagTypeRegion), c.Region)
+	childTags = appendCSVTags(childTags, string(tags.TagTypeLang), c.Lang)
 	return childTags
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func companionChildTags(c companionChild) []database.TagInfo {
var childTags []database.TagInfo
if c.Region != "" {
childTags = append(childTags, database.TagInfo{Type: string(tags.TagTypeRegion), Tag: c.Region})
}
if c.Lang != "" {
childTags = append(childTags, database.TagInfo{Type: string(tags.TagTypeLang), Tag: c.Lang})
}
return childTags
func companionChildTags(c companionChild) []database.TagInfo {
var childTags []database.TagInfo
childTags = appendCSVTags(childTags, string(tags.TagTypeRegion), c.Region)
childTags = appendCSVTags(childTags, string(tags.TagTypeLang), c.Lang)
return childTags
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/database/scraper/gamelistxml/scraper.go` around lines 1223 - 1231, The
companionChildTags function currently appends raw Region and Lang values which
can create inconsistent tags (e.g., "USA, EUR"); update companionChildTags to
normalize these fields the same way MapToDB does: split CSV values, trim
whitespace, lowercase (or apply the same normalization function MapToDB uses),
and append each resulting token as a separate database.TagInfo with Type
string(tags.TagTypeRegion) or string(tags.TagTypeLang) respectively; locate
companionChildTags and reuse or mirror the normalization logic used elsewhere
(e.g., the MapToDB helper or tokenization utility) so region/lang produce
consistent individual tags.

@wizzomafizzo
Copy link
Copy Markdown
Member Author

Added core-only regression coverage for frontend issue ZaparooProject/zaparoo-frontend#161:

  • Companion XML with screenshot, titlescreen, boxart2d, boxart3d, and logo now asserts all five title image properties are written.
  • media.image with no imageTypes now has explicit default-auto coverage for path-backed Companion artwork.
  • fallback coverage proves Core returns the next available default artwork when boxart is missing.

Validation rerun:

  • go test ./pkg/database/scraper/...
  • go test ./pkg/api/methods/ -run 'MediaImage|Scrape'
  • task lint-fix
  • task test

@wizzomafizzo
Copy link
Copy Markdown
Member Author

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@wizzomafizzo
Copy link
Copy Markdown
Member Author

Added a follow-up fix for #794: regular gamelist.xml matching still uses concrete media paths, but now handles MiSTer zip-as-dir indexing by falling back from an unmatched .zip entry to exactly one indexed child path under that archive. Ambiguous multi-child archives are skipped rather than falling back to slug matching.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@wizzomafizzo
Copy link
Copy Markdown
Member Author

Added a follow-up for #808. Coverage now distinguishes nested explicit artwork (already supported), nested filesystem fallback artwork, ignored nested gamelist.xml files, and thumbnail-only artwork. Fixes include mirrored subfolder media fallback (e.g. media/images/Japan/Game.png), box2dfront aliases, and media.image thumbnail support/default fallback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant