NUTCH-1732: allow deleting non-parsable documents#891
NUTCH-1732: allow deleting non-parsable documents#891igiguere wants to merge 8 commits intoapache:masterfrom
Conversation
Add CrawlDatum status "parse_failed"/"db_parse_failed", set status in CrawlDbReducer. Handle deletion in IndexerMapReduce. Add unit tests.
There was a problem hiding this comment.
Hi @igiguere I had a think about this over the weekend and will share my thoughts on two issues
Functional test
The new private CrawlTestParserFailure inner class always fails parsing. To test the realistic scenario of "succeed first, fail second", you could create/introduce a flip-flopping variant similar to how CrawlTestSignatureReset alternates betweenSTATUS_FETCH_SUCCESS and STATUS_FETCH_GONE. Something like
private class CrawlTestParserFailureAlternating extends ContinuousCrawlTestUtil {
int counter = 0;
CrawlTestParserFailureAlternating(Context context) {
super(context);
}
@Override
protected CrawlDatum fetch(CrawlDatum datum, long currentTime) {
counter++;
datum.setStatus(STATUS_FETCH_SUCCESS); // always fetched OK
datum.setFetchTime(currentTime);
return datum;
}
@Override
protected List<CrawlDatum> parse(CrawlDatum fetchDatum) {
List<CrawlDatum> parseDatums = new ArrayList<>(0);
if (counter % 2 == 0) {
// Even fetches: parsing fails
parseDatums.add(new CrawlDatum(STATUS_PARSE_FAILED, 0));
} else {
// Odd fetches: parsing succeeds (emit signature)
CrawlDatum sig = new CrawlDatum(STATUS_SIGNATURE, 0);
sig.setSignature(getSignature());
parseDatums.add(sig);
}
return parseDatums;
}
@Override
protected boolean check(CrawlDatum result) {
if (counter % 2 == 0) {
return result.getStatus() == STATUS_DB_PARSE_FAILED;
} else {
return result.getStatus() == STATUS_DB_FETCHED
|| result.getStatus() == STATUS_DB_NOTMODIFIED;
}
}
}is seems fairly practical and aligns with existing test patterns.
There is another solution! We already have CrawlDBTestUtil.getServer() for spinning up an embedded Jetty server. You could replace the static ResourceHandler with a custom handler that tracks request count per URL and serves different content on subsequent requests (similar principle as above and maybe more representative of real life fetching). This is also a good test... your code could be something like
import org.eclipse.jetty.server.Request;
import org.eclipse.jetty.server.handler.AbstractHandler;
public class FlipFlopHandler extends AbstractHandler {
private final AtomicInteger requestCount = new AtomicInteger(0);
@Override
public void handle(String target, Request baseRequest,
HttpServletRequest request, HttpServletResponse response)
throws IOException, ServletException {
int count = requestCount.incrementAndGet();
response.setStatus(HttpServletResponse.SC_OK);
if (count % 2 == 1) {
// 1st fetch: valid HTML
response.setContentType("text/html");
response.getWriter().write("<html><head><title>Test</title></head>"
+ "<body>Hello World</body></html>");
} else {
// 2nd fetch: serve binary garbage with text/html MIME type
// This will cause the HTML parser to fail
response.setContentType("text/html");
byte[] garbage = new byte[1024];
new java.util.Random(42).nextBytes(garbage);
response.getOutputStream().write(garbage);
}
baseRequest.setHandled(true);
}
}It would be good to implement some kind of test though. I agree.
Detailed Trace of testDeleteParseFailure
Design Intent
Your intent is that parser.delete.failed is configured separately from indexer.delete (which controls deletion of gone/redirect/duplicate documents), so the user can independently decide whether to delete parse failures. Does this mean the parse-failure deletion path in IndexerMapReduce should be gated on parser.delete.failed, not onindexer.delete?
If that is true, I think there are three issues in the current implementation and test.
Issue 1: The reducer guards parse-failure deletion on indexer.delete, not parser.delete.failed
The test correctly sets:
configuration.setBoolean(ParseSegment.DELETE_FAILED_PARSE, true);However, the reducer's setup() method never reads parser.delete.failed. Instead, the parse-failure block at line 361 of IndexerMapReduce.java is guarded by the delete variable, which is wired to indexer.delete:
// in setup():
delete = conf.getBoolean(INDEXER_DELETE, false); // "indexer.delete", default false// in reduce():
// Whether to delete pages where parsing failed
if (delete && fetchDatum != null) {
if (fetchDatum.getStatus() == CrawlDatum.STATUS_PARSE_FAILED
|| dbDatum != null && dbDatum.getStatus() == CrawlDatum.STATUS_DB_PARSE_FAILED) {
deletedFailedParseCounter.increment(1);
context.write(key, DELETE_ACTION);
return;
}
}Since indexer.delete defaults to false and the test doesn't set it, the condition if (delete && ...) is always false. The parse-failure deletion code is never executed.
To match the design intent, setup() should read parser.delete.failed into its own boolean, and the guard should use that instead of delete:
// in setup():
deleteFailedParse = conf.getBoolean(ParseSegment.DELETE_FAILED_PARSE, false);
// in reduce():
if (deleteFailedParse && fetchDatum != null) {
if (fetchDatum.getStatus() == CrawlDatum.STATUS_PARSE_FAILED
|| dbDatum != null && dbDatum.getStatus() == CrawlDatum.STATUS_DB_PARSE_FAILED) {
deletedFailedParseCounter.increment(1);
context.write(key, DELETE_ACTION);
return;
}
}This way a user can set parser.delete.failed=true without needing indexer.delete=true, and conversely can have indexer.delete=true (for gone/redirects/duplicates) without automatically also deleting parse failures.
Issue 2: STATUS_PARSE_FAILED (0x45) would throw a RuntimeException in the classification loop
In the reducer's value classification loop:
} else if (value instanceof CrawlDatum) {
final CrawlDatum datum = (CrawlDatum) value;
if (CrawlDatum.hasDbStatus(datum)) {
dbDatum = datum;
} else if (CrawlDatum.hasFetchStatus(datum)) {
if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
fetchDatum = datum;
}
} else if (CrawlDatum.STATUS_LINKED == datum.getStatus()
|| CrawlDatum.STATUS_SIGNATURE == datum.getStatus()
|| CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {
continue;
} else {
throw new RuntimeException("Unexpected status: " + datum.getStatus());
}STATUS_PARSE_FAILED = 0x45. The range checks:
hasDbStatus():status <= 0x1f— false (0x45 > 0x1f)hasFetchStatus():status > 0x1f && status <= 0x3f— false (0x45 > 0x3f)- Explicit matches:
STATUS_LINKED (0x43),STATUS_SIGNATURE (0x41),
STATUS_PARSE_META (0x44)— none match 0x45 - Falls to
else—RuntimeException("Unexpected status: 69")
So if a CrawlDatum with STATUS_PARSE_FAILED were passed into the indexer reducer's values, it would crash. The test avoids this by passing crawlDatumFetchSuccess (STATUS_FETCH_SUCCESS = 0x21) and crawlDatumDbFetched (STATUS_DB_FETCHED = 0x02), but this means the fetchDatum.getStatus() == CrawlDatum.STATUS_PARSE_FAILED branch of the condition at line 362 is untestable without first adding STATUS_PARSE_FAILED to the classification logic (either as a recognized skip status or as a fetch-like status).
Issue 3: The test passes for the wrong reason
When failParse is true (i=0, 2, 4), the test asserts assertNull(doc). Here is what actually happens:
-
The reducer classifies values normally:
crawlDatumDbFetched->dbDatum,
crawlDatumFetchSuccess->fetchDatum,failedParseData->parseData,
parseText->parseText,content->content. -
The parse-failure deletion block is skipped (Issue 1).
-
Execution reaches line 388:
if (fetchDatum == null || parseText == null || parseData == null) { return; // only have inlinks }
All three are non-null, so execution continues.
-
Execution reaches line 405:
if (!parseData.getStatus().isSuccess() || fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) { return; }
When
failedParseDatais used (which hasParseStatus.STATUS_FAILURE),
parseData.getStatus().isSuccess()returnsfalse, so!isSuccess()istrue, and the
reducer returns early without writing anything to the context. -
Since nothing is written to
reduceResult,runIndexerreturnsnull, and
assertNull(doc)passes.
Summary
The test passes, but for the wrong reason.
| What the test intends to verify | What actually happens |
|---|---|
parser.delete.failed=true triggers deletion of failed-parse docs |
parser.delete.failed is never read by the reducer; the deletion block is gated on indexer.delete which defaults to false |
The new STATUS_PARSE_FAILED / STATUS_DB_PARSE_FAILED check at lines 360–368 fires |
That block is never entered because delete == false |
assertNull(doc) confirms a DELETE action was emitted |
doc is null because the pre-existing guard at line 405 silently returns when ParseStatus is not isSuccess() — no DELETE is emitted, the document is simply skipped |
The document is not indexed on failParse iterations, but it is silently skipped, not explicitly deleted. These are very different outcomes: a silent skip means the document stays in the index from a previous successful crawl (which is the exact problem NUTCH-1732 is trying to solve).
What Needs to Be Fixed
-
The reducer must read
parser.delete.failedinsetup()and use it as an independent
guard for the parse-failure deletion block, instead of reusing thedeletevariable
(which is tied toindexer.delete). This is the core fix needed to match the design intent:// in setup(): deleteFailedParse = conf.getBoolean(ParseSegment.DELETE_FAILED_PARSE, false); // in reduce(): if (deleteFailedParse && fetchDatum != null) { if (fetchDatum.getStatus() == CrawlDatum.STATUS_PARSE_FAILED || dbDatum != null && dbDatum.getStatus() == CrawlDatum.STATUS_DB_PARSE_FAILED) { deletedFailedParseCounter.increment(1); context.write(key, DELETE_ACTION); return; } }
-
STATUS_PARSE_FAILED(0x45) needs to be recognized in theCrawlDatumclassification
loop (lines 318–333). Currently it would throw
RuntimeException("Unexpected status: 69"). It should either be added as a skip
(continue) alongsideSTATUS_LINKED/STATUS_SIGNATURE/STATUS_PARSE_META, or
handled in a way that allows the delete check to use it. -
The test should pass a
CrawlDatumwithSTATUS_PARSE_FAILEDor
STATUS_DB_PARSE_FAILEDinstead of (or in addition to) relying onParseDatawith
STATUS_FAILURE. The NUTCH-1732 logic at lines 360–368 checks the CrawlDatum status,
not theParseDatastatus. -
The test should verify that a DELETE action was actually emitted, not just that
docis
null. Currently therunIndexerhelper filters out DELETE actions (line 217:
if (e.getValue().action != NutchIndexAction.DELETE)) and returns null for them, which is
indistinguishable from a silent skip. A more robust assertion would check that
reduceResultcontains aNutchIndexAction.DELETEentry.
Additional comments
- I think the new configuration property should be changed from
parser.delete.failedtoparser.delete.failed.parsebut this is minor in comparison to the above. - Total side observation (unrelated to your PR) I noticed that the
indexer.deleteconfiguration property is absent fromnutch-default.xml... is this intentional?
Thanks for the review, @lewismc Note: the |
|
This was a really good refresher for me as well. It touches parts of the codebase I hadn't looked at for a while... but I must admit, it took me a while. I was scratching my head in the debugger for ages. |
Fix IndexerMapReduce. Improve test for failed parsing in TestCrawlDbStates and TestIndexerMapReduce. TestFetchWithPArseFailure "should" test the fetcher, but it strangely does not fully run most of the time. Disabled while thinking of a solution.
I fixed I improved test for failed parsing in The new test UPDATE: Silly me. The full log is logs/hadoop.log, not the output file under build/test/... That file looks like a half-run test. The content of logs/hadoop.log show the full test run. UPDATE 2 : The default HTML parser (Neko) is a lot more tolerant than expected. Even random bytes (weird enough to prevent opening the log file if output) do not produce a parsing error. UPDATE 3 : Also tried the "Tagsoup" HTML parser, but that one does not produce a failed parse either. I also tried to see if I could extend HtmlParser, but plugin classes are not visible on the unit test classpath, so it does not even compile. Otherwise, there's no hook to easily insert a fake or mock plugin to override the behavior of HtmlParser. UPDATE 4: I found a way to test! Just set |
Change the type of ExecutorService, to no avail. Fix test conf to have just 1 fetcher thread.
|
Hi @igiguere I'll look into this shortly. Thank you. |
NUTCH-1732
Add CrawlDatum status "parse_failed"/"db_parse_failed", set status in CrawlDbReducer.
Handle deletion in IndexerMapReduce.
Add unit tests.
Adjust documentation in nutch-default.xml:
parser.delete.failedis configured deparately fromdelete.gone, so user can decide to delete or not the parse failures.ant clean runtime testpassed.I'm not sure how to run a functional test... how do I make parsing fail the 2nd time a page is fetched?