fix(extractor): stream text/markdown/xml extraction with BOM and length cap by marevol · Pull Request #165 · codelibs/fess-crawler

marevol · 2026-05-04T22:57:36Z

Summary

Detect UTF-8/UTF-16/UTF-32 BOMs via BOMInputStream and decode accordingly; fall back to configured encoding when absent.
Replace whole-file IOUtils.toString / new String(getBytes()) with BufferedReader streaming bounded by maxTextLength (default unlimited; configurable).
AbstractXmlExtractor now strips BOM bytes from the actual content stream before decoding so the XML parser does not see a leading .
MarkdownExtractor reuses the same Reader/BOM pipeline before handing the source to commonmark; YAML front-matter extraction is unaffected.

Why

Large text/markdown/XML files cause unnecessary heap pressure since the entire byte buffer was materialized as a String before processing. Non-UTF-8 BOMs were silently misdecoded with the configured encoding, and the leading BOM character sometimes leaked into the extracted content.

Tests

BOM variants for UTF-8 / UTF-16 LE / UTF-16 BE on TextExtractor, MarkdownExtractor, XmlExtractor.
Shift_JIS without BOM (configured encoding wins).
Truncation at maxTextLength.
Large-file (10 MiB) streaming verifies exact length and head/tail bytes.
Markdown body / YAML front matter / no-front-matter paths.
XML extractor BOM tests using existing extractor/xml/test_utf8bom.xml, test_utf16lebom.xml, test_utf16bebom.xml fixtures.
All existing extractor tests still pass.

Test plan

CI green
Manual review of BOM + reader composition

…th cap Replaces full byte-buffering (IOUtils.toString / new String(getBytes())) in TextExtractor, MarkdownExtractor and AbstractXmlExtractor with Reader-based streaming through a BufferedReader. Detects UTF-8 / UTF-16 LE/BE / UTF-32 LE/BE BOMs via BOMInputStream and decodes accordingly, falling back to the configured encoding when no BOM is present. AbstractXmlExtractor also strips BOM bytes from the actual content stream so the parser sees pure XML. Adds a configurable maxTextLength on each extractor (default Long.MAX_VALUE, i.e. unlimited) to bound heap usage on very large inputs and stop reading early when reached. Tests cover UTF-8/UTF-16 LE/UTF-16 BE BOM stripping, Shift_JIS without BOM, truncation at maxTextLength, large-file (10 MiB) streaming, Markdown body + YAML front matter + BOM-prefixed Markdown, and XML BOM extraction. All existing extractor tests still pass.

@SuppressWarnings

Address review feedback for #165. - Emit truncated=true and maxTextLength metadata on ExtractData when maxTextLength clips the input, plus a WARN log; keep the partial body instead of throwing (which would mismatch the intent of a soft cap). - Move BOMInputStream into try-with-resources in TextExtractor and MarkdownExtractor; replace the @SuppressWarnings("resource") block in AbstractXmlExtractor.getText so the BOM-stripping stream is closed through the same chain. - Drop a trailing unpaired high surrogate at the truncation boundary so the returned string is always a valid Java UTF-16 sequence. - Clarify JavaDoc for maxTextLength: char (UTF-16 code unit) basis, Long.MAX_VALUE described as "effectively unlimited", values <= 0 explicitly disable the limit. Document that getText closes the stream and that truncation may break YAML front-matter recovery. - Migrate AbstractXmlExtractor.getEncoding off the deprecated BOMInputStream constructor to the builder API for consistency. - Strengthen existing XML BOM tests (assert real attribute values survive extractString) and add 19 tests covering UTF-32 BOMs, Shift_JIS Markdown without BOM, surrogate-pair boundary, mid-stream BOM passthrough, maxTextLength=0/-1 unlimited, exact-length and one-char caps, across-buffer-boundary truncation, the truncated metadata flag, and large XML streaming.

marevol added 2 commits May 5, 2026 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(extractor): stream text/markdown/xml extraction with BOM and length cap#165

fix(extractor): stream text/markdown/xml extraction with BOM and length cap#165
marevol wants to merge 2 commits into
masterfrom
fix/extractor-text-streaming

marevol commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marevol commented May 4, 2026

Summary

Why

Tests

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant