feat(extractor): extract default HTML metadata + cache XPath expressions#164
Open
marevol wants to merge 5 commits into
Open
feat(extractor): extract default HTML metadata + cache XPath expressions#164marevol wants to merge 5 commits into
marevol wants to merge 5 commits into
Conversation
Populate ExtractData with standard HTML metadata by default (title, description, OpenGraph, Twitter Card, canonical, keywords, author), parse <script type="application/ld+json"> blocks into jsonld.type and jsonld.raw, and cache compiled XPathExpression objects per thread to eliminate per-call recompilation under high crawl rates. The default-field rule map is fully overridable via setDefaultFieldRules and both subsystems can be disabled independently with setExtractDefaultMetadata / setExtractJsonLd. Malformed JSON-LD blocks are logged and skipped without aborting extraction.
…anup, warn on metadata collisions
Three regressions / gaps were uncovered in the HtmlExtractor PR #164 review: 1. Malformed XPath expressions (in contentXpath or metadataXpathMap) used to log a warning and yield empty values — XPathAPI.eval threw XPathException for both compile and evaluate failures and the catch handled them uniformly. The compile cache split that into a separate getXPathExpression path that throws CrawlerSystemException, which was not caught downstream and therefore propagated out of createExtractData, aborting the whole extraction. Catch CrawlerSystemException in getStringsByXPath (and in extractJsonLd, for symmetry) and restore the warn+empty contract. 2. extractJsonLd unconditionally putValues for jsonld.raw / jsonld.type, silently overwriting any value that an operator-supplied addMetadata("jsonld.raw"/"jsonld.type", ...) rule had already populated. Mirror the precedence rule used by applyDefaultFieldRules: only auto- populate when the key is absent. 3. collectTypeNodes only inspected @type on the immediate object (or array elements). Schema.org markup commonly nests typed entities under @graph, mainEntity, author, publisher, etc.; those @type values were therefore never exposed via jsonld.type. Walk every object child recursively (skipping @type / @context to avoid double-collection and vocabulary leakage). Recursion is bounded by the parser's existing JSONLD_MAX_NESTING_DEPTH guard. Five regression tests added: malformed metadata XPath, malformed contentXpath, custom jsonld metadata key precedence, @graph type collection, and the @context-object negative case.
…s blank, match JSON-LD type case-insensitively Two regressions surfaced in code review of PR #164: 1. extractor.xml in fess-crawler-lasta registers addMetadata("title", "//TITLE"), so the metadataXpathMap loop unconditionally calls putValues("title", []) on pages without a <title>. The default-rule existence check (getValues != null) then sees the empty array and skips the og:title fallback, silently disabling the PR's "extract default HTML metadata" intent in real deployments. Switch the predicate to "has a non-blank value" so default rules backfill when the custom rule produced nothing. 2. JSONLD_XPATH matched only the literal lowercase 'application/ld+json'. Per RFC 6838 / HTML5 the type attribute is case-insensitive and may carry surrounding whitespace; NekoHTML uppercases element names but preserves attribute values verbatim, so 'Application/LD+JSON' or ' application/ld+json ' was missed. Use translate(normalize-space(@type), ...) so common real-world variants are picked up.
…N-LD, and XPath cache
CRITICAL fixes:
- Remove warnOnMetadataKeyCollisions: the warning fired on every default
Fess deployment because the lasta extractor.xml registers
addMetadata("title","//TITLE") via XML <postConstruct> before @PostConstruct
init() runs. The precedence is already documented in applyDefaultFieldRules.
- Catch RuntimeException (DOMException, etc.) per JSON-LD node and at the
outer extractJsonLd boundary so a single broken script node does not abort
the entire extraction, honouring the documented "log and skip" contract.
- evaluateNonNodeSet now adds previousFailure as a suppressed exception to
the fallback failure, preserving full diagnostic context that was being
silently discarded.
MAJOR fixes:
- title / description default rules are now single-source XPaths; ordered
fallback to og:title / og:description is expressed via a new
defaultFieldFallbackRules map (target -> source key) applied after primary
rules. XPath '|' union cannot express ordered preference.
- Add JSONLD_MAX_BLOCK_COUNT (64), JSONLD_MAX_RAW_TOTAL_BYTES (1 MiB), and
JSONLD_MAX_TYPES_PER_BLOCK (256) bounds; short-circuit extractJsonLd loop
and collectTypeNodes recursion to prevent unbounded memory growth on
adversarial pages.
- Add twitter:title, twitter:description, twitter:image, twitter:site to
default rules.
- metadataXpathMap is now LinkedHashMap so values[0] order is deterministic.
- getStringsByXPath NODESET branch now catches DOMException per node.
MINOR fixes:
- setDefaultFieldRules(null) now immediately restores built-in defaults
via the new createDefaultFieldRules() factory; matching behaviour for
setDefaultFieldFallbackRules(null).
- Remove dead null check in collectJsonLdTypes (getObjectMapper never
returns null).
- Sanitise \r\n\t in Jackson exception messages to prevent log injection.
- evaluateNonNodeSet logs DEBUG on entry so silent NODESET->non-NODESET
coercion is observable.
- Add DEBUG logs when JSON-LD auto-fill is skipped due to a pre-existing
custom value.
Tests: 19 new tests covering concurrent extraction, per-thread cache
isolation, cross-thread destroy/clearXPathCache limits, JSON-LD memory
bounds, top-level JSON-LD array, empty/numeric/Japanese @type, multiple
og:image, Twitter Card defaults, fallback rules customisation, and
corruption resilience. Total HtmlExtractorTest: 48 tests, all passing.
Full fess-crawler suite: 1736 tests, 0 failures.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
<script type="application/ld+json">blocks; exposejsonld.typeandjsonld.raw(multivalue). Malformed JSON is skipped with a warn log.XPathExpressionper thread (viaThreadLocal<Map>andThreadLocal<XPath>) to eliminate per-call recompilation under high crawl rate.setDefaultFieldRules(Map),setExtractDefaultMetadata(boolean),setExtractJsonLd(boolean)for full opt-out / customization. NewclearXPathCache()for dynamic rule changes.Why
Most search use cases want title and description for snippet rendering; users currently have to wire each XPath manually. JSON-LD provides high-quality structured data signals. XPath compilation in a hot loop was wasted work.
Threat model
HTML content is untrusted. JSON-LD parsing uses Jackson with default settings; malformed input is caught and logged, never fails extraction. XPath cache key is the expression string (admin-configured), bounded by configured rules — no unbounded growth from untrusted input.
Tests
@type, malformed JSON resilience.clearXPathCache()empties cache.Verification
mvn -pl fess-crawler test -Dtest=HtmlExtractorTest→ 18/18 pass.mvn -pl fess-crawler test→ 1706 run, 0 failures, 55 pre-existing env-dependent errors (Docker/LibreOffice).mvn formatter:format && mvn license:formatclean.Test plan