Skip to content

NUTCH-1732: allow deleting non-parsable documents#891

Open
igiguere wants to merge 8 commits intoapache:masterfrom
igiguere:NUTCH-1732-delete-not-parsable-documents
Open

NUTCH-1732: allow deleting non-parsable documents#891
igiguere wants to merge 8 commits intoapache:masterfrom
igiguere:NUTCH-1732-delete-not-parsable-documents

Conversation

@igiguere
Copy link
Copy Markdown

@igiguere igiguere commented Feb 5, 2026

NUTCH-1732

Add CrawlDatum status "parse_failed"/"db_parse_failed", set status in CrawlDbReducer.

Handle deletion in IndexerMapReduce.

Add unit tests.

Adjust documentation in nutch-default.xml: parser.delete.failed is configured deparately from delete.gone, so user can decide to delete or not the parse failures.

ant clean runtime test passed.

I'm not sure how to run a functional test... how do I make parsing fail the 2nd time a page is fetched?

Add CrawlDatum status "parse_failed"/"db_parse_failed", set status in
CrawlDbReducer.

Handle deletion in IndexerMapReduce.

Add unit tests.
@igiguere igiguere marked this pull request as ready for review February 11, 2026 21:41
Copy link
Copy Markdown
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @igiguere I had a think about this over the weekend and will share my thoughts on two issues

Functional test

The new private CrawlTestParserFailure inner class always fails parsing. To test the realistic scenario of "succeed first, fail second", you could create/introduce a flip-flopping variant similar to how CrawlTestSignatureReset alternates betweenSTATUS_FETCH_SUCCESS and STATUS_FETCH_GONE. Something like

private class CrawlTestParserFailureAlternating extends ContinuousCrawlTestUtil {
  int counter = 0;

  CrawlTestParserFailureAlternating(Context context) {
    super(context);
  }

  @Override
  protected CrawlDatum fetch(CrawlDatum datum, long currentTime) {
    counter++;
    datum.setStatus(STATUS_FETCH_SUCCESS); // always fetched OK
    datum.setFetchTime(currentTime);
    return datum;
  }

  @Override
  protected List<CrawlDatum> parse(CrawlDatum fetchDatum) {
    List<CrawlDatum> parseDatums = new ArrayList<>(0);
    if (counter % 2 == 0) {
      // Even fetches: parsing fails
      parseDatums.add(new CrawlDatum(STATUS_PARSE_FAILED, 0));
    } else {
      // Odd fetches: parsing succeeds (emit signature)
      CrawlDatum sig = new CrawlDatum(STATUS_SIGNATURE, 0);
      sig.setSignature(getSignature());
      parseDatums.add(sig);
    }
    return parseDatums;
  }

  @Override
  protected boolean check(CrawlDatum result) {
    if (counter % 2 == 0) {
      return result.getStatus() == STATUS_DB_PARSE_FAILED;
    } else {
      return result.getStatus() == STATUS_DB_FETCHED
          || result.getStatus() == STATUS_DB_NOTMODIFIED;
    }
  }
}

is seems fairly practical and aligns with existing test patterns.

There is another solution! We already have CrawlDBTestUtil.getServer() for spinning up an embedded Jetty server. You could replace the static ResourceHandler with a custom handler that tracks request count per URL and serves different content on subsequent requests (similar principle as above and maybe more representative of real life fetching). This is also a good test... your code could be something like

import org.eclipse.jetty.server.Request;
import org.eclipse.jetty.server.handler.AbstractHandler;

public class FlipFlopHandler extends AbstractHandler {
  private final AtomicInteger requestCount = new AtomicInteger(0);

  @Override
  public void handle(String target, Request baseRequest,
      HttpServletRequest request, HttpServletResponse response)
      throws IOException, ServletException {
    int count = requestCount.incrementAndGet();
    response.setStatus(HttpServletResponse.SC_OK);
    if (count % 2 == 1) {
      // 1st fetch: valid HTML
      response.setContentType("text/html");
      response.getWriter().write("<html><head><title>Test</title></head>"
          + "<body>Hello World</body></html>");
    } else {
      // 2nd fetch: serve binary garbage with text/html MIME type
      // This will cause the HTML parser to fail
      response.setContentType("text/html");
      byte[] garbage = new byte[1024];
      new java.util.Random(42).nextBytes(garbage);
      response.getOutputStream().write(garbage);
    }
    baseRequest.setHandled(true);
  }
}

It would be good to implement some kind of test though. I agree.

Detailed Trace of testDeleteParseFailure

Design Intent

Your intent is that parser.delete.failed is configured separately from indexer.delete (which controls deletion of gone/redirect/duplicate documents), so the user can independently decide whether to delete parse failures. Does this mean the parse-failure deletion path in IndexerMapReduce should be gated on parser.delete.failed, not onindexer.delete?

If that is true, I think there are three issues in the current implementation and test.

Issue 1: The reducer guards parse-failure deletion on indexer.delete, not parser.delete.failed

The test correctly sets:

configuration.setBoolean(ParseSegment.DELETE_FAILED_PARSE, true);

However, the reducer's setup() method never reads parser.delete.failed. Instead, the parse-failure block at line 361 of IndexerMapReduce.java is guarded by the delete variable, which is wired to indexer.delete:

// in setup():
delete = conf.getBoolean(INDEXER_DELETE, false);  // "indexer.delete", default false
// in reduce():
// Whether to delete pages where parsing failed
if (delete && fetchDatum != null) {
  if (fetchDatum.getStatus() == CrawlDatum.STATUS_PARSE_FAILED
      || dbDatum != null && dbDatum.getStatus() == CrawlDatum.STATUS_DB_PARSE_FAILED) {
    deletedFailedParseCounter.increment(1);
    context.write(key, DELETE_ACTION);
    return;
  }
}

Since indexer.delete defaults to false and the test doesn't set it, the condition if (delete && ...) is always false. The parse-failure deletion code is never executed.

To match the design intent, setup() should read parser.delete.failed into its own boolean, and the guard should use that instead of delete:

// in setup():
deleteFailedParse = conf.getBoolean(ParseSegment.DELETE_FAILED_PARSE, false);

// in reduce():
if (deleteFailedParse && fetchDatum != null) {
  if (fetchDatum.getStatus() == CrawlDatum.STATUS_PARSE_FAILED
      || dbDatum != null && dbDatum.getStatus() == CrawlDatum.STATUS_DB_PARSE_FAILED) {
    deletedFailedParseCounter.increment(1);
    context.write(key, DELETE_ACTION);
    return;
  }
}

This way a user can set parser.delete.failed=true without needing indexer.delete=true, and conversely can have indexer.delete=true (for gone/redirects/duplicates) without automatically also deleting parse failures.

Issue 2: STATUS_PARSE_FAILED (0x45) would throw a RuntimeException in the classification loop

In the reducer's value classification loop:

} else if (value instanceof CrawlDatum) {
  final CrawlDatum datum = (CrawlDatum) value;
  if (CrawlDatum.hasDbStatus(datum)) {
    dbDatum = datum;
  } else if (CrawlDatum.hasFetchStatus(datum)) {
    if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
      fetchDatum = datum;
    }
  } else if (CrawlDatum.STATUS_LINKED == datum.getStatus()
      || CrawlDatum.STATUS_SIGNATURE == datum.getStatus()
      || CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {
    continue;
  } else {
    throw new RuntimeException("Unexpected status: " + datum.getStatus());
  }

STATUS_PARSE_FAILED = 0x45. The range checks:

  • hasDbStatus(): status <= 0x1ffalse (0x45 > 0x1f)
  • hasFetchStatus(): status > 0x1f && status <= 0x3ffalse (0x45 > 0x3f)
  • Explicit matches: STATUS_LINKED (0x43), STATUS_SIGNATURE (0x41),
    STATUS_PARSE_META (0x44)none match 0x45
  • Falls to elseRuntimeException("Unexpected status: 69")

So if a CrawlDatum with STATUS_PARSE_FAILED were passed into the indexer reducer's values, it would crash. The test avoids this by passing crawlDatumFetchSuccess (STATUS_FETCH_SUCCESS = 0x21) and crawlDatumDbFetched (STATUS_DB_FETCHED = 0x02), but this means the fetchDatum.getStatus() == CrawlDatum.STATUS_PARSE_FAILED branch of the condition at line 362 is untestable without first adding STATUS_PARSE_FAILED to the classification logic (either as a recognized skip status or as a fetch-like status).

Issue 3: The test passes for the wrong reason

When failParse is true (i=0, 2, 4), the test asserts assertNull(doc). Here is what actually happens:

  1. The reducer classifies values normally: crawlDatumDbFetched -> dbDatum,
    crawlDatumFetchSuccess -> fetchDatum, failedParseData -> parseData,
    parseText -> parseText, content -> content.

  2. The parse-failure deletion block is skipped (Issue 1).

  3. Execution reaches line 388:

    if (fetchDatum == null || parseText == null || parseData == null) {
      return; // only have inlinks
    }

    All three are non-null, so execution continues.

  4. Execution reaches line 405:

    if (!parseData.getStatus().isSuccess()
        || fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
      return;
    }

    When failedParseData is used (which has ParseStatus.STATUS_FAILURE),
    parseData.getStatus().isSuccess() returns false, so !isSuccess() is true, and the
    reducer returns early without writing anything to the context.

  5. Since nothing is written to reduceResult, runIndexer returns null, and
    assertNull(doc) passes.

Summary

The test passes, but for the wrong reason.

What the test intends to verify What actually happens
parser.delete.failed=true triggers deletion of failed-parse docs parser.delete.failed is never read by the reducer; the deletion block is gated on indexer.delete which defaults to false
The new STATUS_PARSE_FAILED / STATUS_DB_PARSE_FAILED check at lines 360–368 fires That block is never entered because delete == false
assertNull(doc) confirms a DELETE action was emitted doc is null because the pre-existing guard at line 405 silently returns when ParseStatus is not isSuccess() — no DELETE is emitted, the document is simply skipped

The document is not indexed on failParse iterations, but it is silently skipped, not explicitly deleted. These are very different outcomes: a silent skip means the document stays in the index from a previous successful crawl (which is the exact problem NUTCH-1732 is trying to solve).

What Needs to Be Fixed

  1. The reducer must read parser.delete.failed in setup() and use it as an independent
    guard for the parse-failure deletion block, instead of reusing the delete variable
    (which is tied to indexer.delete). This is the core fix needed to match the design intent:

    // in setup():
    deleteFailedParse = conf.getBoolean(ParseSegment.DELETE_FAILED_PARSE, false);
    
    // in reduce():
    if (deleteFailedParse && fetchDatum != null) {
      if (fetchDatum.getStatus() == CrawlDatum.STATUS_PARSE_FAILED
          || dbDatum != null && dbDatum.getStatus() == CrawlDatum.STATUS_DB_PARSE_FAILED) {
        deletedFailedParseCounter.increment(1);
        context.write(key, DELETE_ACTION);
        return;
      }
    }
  2. STATUS_PARSE_FAILED (0x45) needs to be recognized in the CrawlDatum classification
    loop (lines 318–333). Currently it would throw
    RuntimeException("Unexpected status: 69"). It should either be added as a skip
    (continue) alongside STATUS_LINKED / STATUS_SIGNATURE / STATUS_PARSE_META, or
    handled in a way that allows the delete check to use it.

  3. The test should pass a CrawlDatum with STATUS_PARSE_FAILED or
    STATUS_DB_PARSE_FAILED instead of (or in addition to) relying on ParseData with
    STATUS_FAILURE. The NUTCH-1732 logic at lines 360–368 checks the CrawlDatum status,
    not the ParseData status.

  4. The test should verify that a DELETE action was actually emitted, not just that doc is
    null. Currently the runIndexer helper filters out DELETE actions (line 217:
    if (e.getValue().action != NutchIndexAction.DELETE)) and returns null for them, which is
    indistinguishable from a silent skip. A more robust assertion would check that
    reduceResult contains a NutchIndexAction.DELETE entry.

Additional comments

  1. I think the new configuration property should be changed from parser.delete.failed to parser.delete.failed.parse but this is minor in comparison to the above.
  2. Total side observation (unrelated to your PR) I noticed that the indexer.delete configuration property is absent from nutch-default.xml... is this intentional?

@igiguere
Copy link
Copy Markdown
Author

igiguere commented Feb 18, 2026

Hi @igiguere I had a think about this over the weekend and will share my thoughts on two issues
(...)

What Needs to Be Fixed

1. The reducer must read `parser.delete.failed` in `setup()` and use it as an independent
   guard for the parse-failure deletion block, instead of reusing the `delete` variable
   (which is tied to `indexer.delete`). 

(...)

2. `STATUS_PARSE_FAILED` (0x45) needs to be recognized in the `CrawlDatum` classification
   loop (lines 318–333).

(...)

3. The test should pass a `CrawlDatum` with `STATUS_PARSE_FAILED` or
   `STATUS_DB_PARSE_FAILED` instead of (or in addition to) relying on `ParseData` with
   `STATUS_FAILURE`. 

(...)

4. The test should verify that a DELETE action was actually emitted, not just that `doc` is
   null. 

(...)

Additional comments

1. I think the new configuration property should be changed from `parser.delete.failed` to `parser.delete.failed.parse` but this is minor in comparison to the above.

2. Total side observation (unrelated to your PR) I noticed that the `indexer.delete` configuration property is absent from `nutch-default.xml`... is this intentional?

Thanks for the review, @lewismc
And sorry.
Really. I feel like an idiot. This PR should not have been published in it's current state.
I have to think back 10 years to remember any review on this scale. I don't know what happened. I should have seen at least 90% of all that myself.
I'm fixing this. I hope my brain activates this time.
Thanks for the tip about CrawlDBTestUtil.getServer(). I'll use that.

Note: the indexer.delete configuration property pre-dates this PR. So that, at least, is not on me. I'll add it to nutch-default

@lewismc
Copy link
Copy Markdown
Member

lewismc commented Feb 18, 2026

This was a really good refresher for me as well. It touches parts of the codebase I hadn't looked at for a while... but I must admit, it took me a while. I was scratching my head in the debugger for ages.
@igiguere also, apologies my comment above was a bit short. No need to apologize at all. Solid peer review is the beauty of open source and definitely the part thats had the most impact on me personally. Please ping me if and when you want more peer review. Thank you.

Isabelle Giguere and others added 2 commits February 24, 2026 16:35
Fix IndexerMapReduce.

Improve test for failed parsing in TestCrawlDbStates and
TestIndexerMapReduce.

TestFetchWithPArseFailure "should" test the fetcher, but it strangely
does not fully run most of the time.  Disabled while thinking of a
solution.
@igiguere
Copy link
Copy Markdown
Author

igiguere commented Feb 24, 2026

This was a really good refresher for me as well. It touches parts of the codebase I hadn't looked at for a while... but I must admit, it took me a while. I was scratching my head in the debugger for ages. @igiguere also, apologies my comment above was a bit short. No need to apologize at all. Solid peer review is the beauty of open source and definitely the part thats had the most impact on me personally. Please ping me if and when you want more peer review. Thank you.

@lewismc

I fixed IndexerMapReduce, so it is now properly configured.

I improved test for failed parsing in TestCrawlDbStates and TestIndexerMapReduce.

The new test TestFetchWithParseFailure "should" test the fetcher, but it strangely does not fully run most of the time. It's disabled while thinking of a solution. Feel free to pitch in with ideas.

UPDATE: Silly me. The full log is logs/hadoop.log, not the output file under build/test/... That file looks like a half-run test. The content of logs/hadoop.log show the full test run.

UPDATE 2 : The default HTML parser (Neko) is a lot more tolerant than expected. Even random bytes (weird enough to prevent opening the log file if output) do not produce a parsing error.

UPDATE 3 : Also tried the "Tagsoup" HTML parser, but that one does not produce a failed parse either. I also tried to see if I could extend HtmlParser, but plugin classes are not visible on the unit test classpath, so it does not even compile. Otherwise, there's no hook to easily insert a fake or mock plugin to override the behavior of HtmlParser.
I'm thinking the fetcher part is just not testable for this feature.

UPDATE 4: I found a way to test! Just set mime.type.magic=false, and let the ParserFactory throw an exception. At least it tests the fetcher logic with property parser.delete.failed.parse=true. However, the case where the HTML parser returns a non-null ParseResult with failed status is not tested. Ideally, I would have liked to inject a mock, if not a mock parser, as in my previous update, at least a mock ParseUtil. But that is constructed in FetcherThread.

Isabelle Giguere and others added 2 commits February 24, 2026 17:27
Change the type of ExecutorService, to no avail.

Fix test conf to have just 1 fetcher thread.
@lewismc
Copy link
Copy Markdown
Member

lewismc commented Mar 21, 2026

Hi @igiguere I'll look into this shortly. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants