Skip to content

feat(extractor): extract default HTML metadata + cache XPath expressions#164

Open
marevol wants to merge 5 commits into
masterfrom
fix/extractor-html-metadata
Open

feat(extractor): extract default HTML metadata + cache XPath expressions#164
marevol wants to merge 5 commits into
masterfrom
fix/extractor-html-metadata

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 4, 2026

Summary

  • Extract standard HTML metadata by default: title, description, OpenGraph (og:title, og:description, og:image, og:type, og:url), Twitter Card, canonical URL, keywords, author.
  • Parse <script type="application/ld+json"> blocks; expose jsonld.type and jsonld.raw (multivalue). Malformed JSON is skipped with a warn log.
  • Cache compiled XPathExpression per thread (via ThreadLocal<Map> and ThreadLocal<XPath>) to eliminate per-call recompilation under high crawl rate.
  • Add setDefaultFieldRules(Map), setExtractDefaultMetadata(boolean), setExtractJsonLd(boolean) for full opt-out / customization. New clearXPathCache() for dynamic rule changes.

Why

Most search use cases want title and description for snippet rendering; users currently have to wire each XPath manually. JSON-LD provides high-quality structured data signals. XPath compilation in a hot loop was wasted work.

Threat model

HTML content is untrusted. JSON-LD parsing uses Jackson with default settings; malformed input is caught and logged, never fails extraction. XPath cache key is the expression string (admin-configured), bounded by configured rules — no unbounded growth from untrusted input.

Tests

  • 12 new tests added (total 18 passing); existing 6 still pass.
  • Default metadata extraction: title, description, OpenGraph, canonical, keywords, author.
  • JSON-LD: single block, multiple blocks with array @type, malformed JSON resilience.
  • XPath cache: same compiled instance reused across calls; clearXPathCache() empties cache.
  • Opt-out flags actually disable each subsystem.
  • User-provided rule map overrides defaults.

Verification

  • mvn -pl fess-crawler test -Dtest=HtmlExtractorTest → 18/18 pass.
  • mvn -pl fess-crawler test → 1706 run, 0 failures, 55 pre-existing env-dependent errors (Docker/LibreOffice).
  • mvn formatter:format && mvn license:format clean.

Test plan

  • CI green
  • Manual review of XPath cache thread-safety (per-thread cache + ThreadLocal XPath)
  • Verify no regression on existing fixture tests

marevol added 5 commits May 5, 2026 07:46
Populate ExtractData with standard HTML metadata by default (title,
description, OpenGraph, Twitter Card, canonical, keywords, author),
parse <script type="application/ld+json"> blocks into jsonld.type and
jsonld.raw, and cache compiled XPathExpression objects per thread to
eliminate per-call recompilation under high crawl rates.

The default-field rule map is fully overridable via setDefaultFieldRules
and both subsystems can be disabled independently with
setExtractDefaultMetadata / setExtractJsonLd. Malformed JSON-LD blocks
are logged and skipped without aborting extraction.
Three regressions / gaps were uncovered in the HtmlExtractor PR #164 review:

1. Malformed XPath expressions (in contentXpath or metadataXpathMap) used to
   log a warning and yield empty values — XPathAPI.eval threw XPathException
   for both compile and evaluate failures and the catch handled them
   uniformly. The compile cache split that into a separate getXPathExpression
   path that throws CrawlerSystemException, which was not caught downstream
   and therefore propagated out of createExtractData, aborting the whole
   extraction. Catch CrawlerSystemException in getStringsByXPath (and in
   extractJsonLd, for symmetry) and restore the warn+empty contract.

2. extractJsonLd unconditionally putValues for jsonld.raw / jsonld.type,
   silently overwriting any value that an operator-supplied
   addMetadata("jsonld.raw"/"jsonld.type", ...) rule had already populated.
   Mirror the precedence rule used by applyDefaultFieldRules: only auto-
   populate when the key is absent.

3. collectTypeNodes only inspected @type on the immediate object (or array
   elements). Schema.org markup commonly nests typed entities under @graph,
   mainEntity, author, publisher, etc.; those @type values were therefore
   never exposed via jsonld.type. Walk every object child recursively
   (skipping @type / @context to avoid double-collection and vocabulary
   leakage). Recursion is bounded by the parser's existing
   JSONLD_MAX_NESTING_DEPTH guard.

Five regression tests added: malformed metadata XPath, malformed
contentXpath, custom jsonld metadata key precedence, @graph type
collection, and the @context-object negative case.
…s blank, match JSON-LD type case-insensitively

Two regressions surfaced in code review of PR #164:

1. extractor.xml in fess-crawler-lasta registers
   addMetadata("title", "//TITLE"), so the metadataXpathMap loop
   unconditionally calls putValues("title", []) on pages without a
   <title>. The default-rule existence check (getValues != null) then
   sees the empty array and skips the og:title fallback, silently
   disabling the PR's "extract default HTML metadata" intent in real
   deployments. Switch the predicate to "has a non-blank value" so
   default rules backfill when the custom rule produced nothing.

2. JSONLD_XPATH matched only the literal lowercase
   'application/ld+json'. Per RFC 6838 / HTML5 the type attribute is
   case-insensitive and may carry surrounding whitespace; NekoHTML
   uppercases element names but preserves attribute values verbatim,
   so 'Application/LD+JSON' or '  application/ld+json  ' was missed.
   Use translate(normalize-space(@type), ...) so common real-world
   variants are picked up.
…N-LD, and XPath cache

CRITICAL fixes:
- Remove warnOnMetadataKeyCollisions: the warning fired on every default
  Fess deployment because the lasta extractor.xml registers
  addMetadata("title","//TITLE") via XML <postConstruct> before @PostConstruct
  init() runs. The precedence is already documented in applyDefaultFieldRules.
- Catch RuntimeException (DOMException, etc.) per JSON-LD node and at the
  outer extractJsonLd boundary so a single broken script node does not abort
  the entire extraction, honouring the documented "log and skip" contract.
- evaluateNonNodeSet now adds previousFailure as a suppressed exception to
  the fallback failure, preserving full diagnostic context that was being
  silently discarded.

MAJOR fixes:
- title / description default rules are now single-source XPaths; ordered
  fallback to og:title / og:description is expressed via a new
  defaultFieldFallbackRules map (target -> source key) applied after primary
  rules. XPath '|' union cannot express ordered preference.
- Add JSONLD_MAX_BLOCK_COUNT (64), JSONLD_MAX_RAW_TOTAL_BYTES (1 MiB), and
  JSONLD_MAX_TYPES_PER_BLOCK (256) bounds; short-circuit extractJsonLd loop
  and collectTypeNodes recursion to prevent unbounded memory growth on
  adversarial pages.
- Add twitter:title, twitter:description, twitter:image, twitter:site to
  default rules.
- metadataXpathMap is now LinkedHashMap so values[0] order is deterministic.
- getStringsByXPath NODESET branch now catches DOMException per node.

MINOR fixes:
- setDefaultFieldRules(null) now immediately restores built-in defaults
  via the new createDefaultFieldRules() factory; matching behaviour for
  setDefaultFieldFallbackRules(null).
- Remove dead null check in collectJsonLdTypes (getObjectMapper never
  returns null).
- Sanitise \r\n\t in Jackson exception messages to prevent log injection.
- evaluateNonNodeSet logs DEBUG on entry so silent NODESET->non-NODESET
  coercion is observable.
- Add DEBUG logs when JSON-LD auto-fill is skipped due to a pre-existing
  custom value.

Tests: 19 new tests covering concurrent extraction, per-thread cache
isolation, cross-thread destroy/clearXPathCache limits, JSON-LD memory
bounds, top-level JSON-LD array, empty/numeric/Japanese @type, multiple
og:image, Twitter Card defaults, fallback rules customisation, and
corruption resilience. Total HtmlExtractorTest: 48 tests, all passing.
Full fess-crawler suite: 1736 tests, 0 failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant