Skip to content

NUTCH-3164 Catch specific exceptions in CrawlDbFilter so plugin errors no longer silently drop URLs#918

Open
prakharchaube wants to merge 1 commit into
apache:masterfrom
prakharchaube:NUTCH-3164
Open

NUTCH-3164 Catch specific exceptions in CrawlDbFilter so plugin errors no longer silently drop URLs#918
prakharchaube wants to merge 1 commit into
apache:masterfrom
prakharchaube:NUTCH-3164

Conversation

@prakharchaube
Copy link
Copy Markdown
Contributor

Summary

Both catch blocks in CrawlDbFilter.map() caught generic Exception and set url = null, which silently dropped URLs both for legitimate filtering reasons and for plugin
programming errors (NPE, etc.). The latter masked plugin bugs as ordinary filtering decisions.

Changes

Normalizer block

  • MalformedURLException — the only legitimate reason to drop. Tracked via ErrorTracker (ErrorType.URL) and no longer increments urlsFilteredCounter, which conflated
    filtering with malformed input.
  • RuntimeException — logged at ERROR and tracked, URL is not dropped so plugin bugs do not silently delete data.

Filter block

  • URLFilterException — per the URLFilter contract, reserved for internal filter failures (rejection is signaled by returning null). Logged at ERROR and tracked, URL is
    not dropped.
  • RuntimeException — same handling as above.

All error paths now use ErrorTracker for categorized counters and log at ERROR rather than WARN, per @lewismc's recommendation in the JIRA
discussion
.

JIRA

NUTCH-3164

Test plan

  • ant compile passes locally
  • No new unit tests in this PR; happy to add one if reviewers want a mock plugin throwing each exception type (flag as follow-up otherwise)

Out of scope (separate tickets if desired)

  • Same catch (Exception) pattern exists in Injector.filterNormalize and ~20 other call sites of normalizers.normalize / filters.filter — per lewismc comment, those
    should be rolled out separately.
  • Whether to track normalized URLs as a metric system-wide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant