Skip to content

NUTCH-3150 Expand Caching Hadoop Counter References#892

Merged
lewismc merged 1 commit intoapache:masterfrom
lewismc:NUTCH-3150
Feb 10, 2026
Merged

NUTCH-3150 Expand Caching Hadoop Counter References#892
lewismc merged 1 commit intoapache:masterfrom
lewismc:NUTCH-3150

Conversation

@lewismc
Copy link
Copy Markdown
Member

@lewismc lewismc commented Feb 8, 2026

PR for NUTCH-3150 which Implements comprehensive counter caching optimization across all MapReduce jobs to eliminate repeated context.getCounter() lookups in hot paths.

Breaking this PR down...

  • Counter caching is now implemented in 16 MapReduce classes using a standardized initCounters(Context context) pattern which I think improves code interpretation aallowing for more intuitive future development around metrics. I saw @igiguere evolving metrics counters in NUTCH-1732: allow deleting non-parsable documents #891 which is excellent :)
  • Migrated DomainStatistics.java from custom enum to NutchMetrics constants with cached counters.
  • Refactored inline counter initialization to dedicated initCounters() methods for consistency across:
    • Core crawl jobs: Fetcher, Generator, Injector, CrawlDbFilter, CrawlDbReducer
    • Post-processing: DeduplicationJob, CleaningJob, ParseSegment
    • Analytics: DomainStatistics, WebGraph, SitemapProcessor
    • HostDB: UpdateHostDbMapper, UpdateHostDbReducer, ResolverThread
    • Export: WARCExporter
    • Indexing: IndexerMapReduce

... the metrics journey continues.

EDIT: I'll add that I'm absolutely fine with this NOT being included in 1.22. I plan to continue evolving Nutch metrics in 1.23 development drive.

Copy link
Copy Markdown
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Run several Nutch tools in local and pseudo-distributed mode to validate for potential issues.

@lewismc lewismc merged commit f7c7e1a into apache:master Feb 10, 2026
6 checks passed
@lewismc lewismc deleted the NUTCH-3150 branch February 10, 2026 19:09
@lewismc
Copy link
Copy Markdown
Member Author

lewismc commented Feb 10, 2026

Thanks @sebastian-nagel for review.

sebastian-nagel pushed a commit to commoncrawl/nutch that referenced this pull request Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants