Skip to content

fix(extractor): stream text/markdown/xml extraction with BOM and length cap#165

Open
marevol wants to merge 2 commits into
masterfrom
fix/extractor-text-streaming
Open

fix(extractor): stream text/markdown/xml extraction with BOM and length cap#165
marevol wants to merge 2 commits into
masterfrom
fix/extractor-text-streaming

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 4, 2026

Summary

  • Detect UTF-8/UTF-16/UTF-32 BOMs via BOMInputStream and decode accordingly; fall back to configured encoding when absent.
  • Replace whole-file IOUtils.toString / new String(getBytes()) with BufferedReader streaming bounded by maxTextLength (default unlimited; configurable).
  • AbstractXmlExtractor now strips BOM bytes from the actual content stream before decoding so the XML parser does not see a leading .
  • MarkdownExtractor reuses the same Reader/BOM pipeline before handing the source to commonmark; YAML front-matter extraction is unaffected.

Why

Large text/markdown/XML files cause unnecessary heap pressure since the entire byte buffer was materialized as a String before processing. Non-UTF-8 BOMs were silently misdecoded with the configured encoding, and the leading BOM character sometimes leaked into the extracted content.

Tests

  • BOM variants for UTF-8 / UTF-16 LE / UTF-16 BE on TextExtractor, MarkdownExtractor, XmlExtractor.
  • Shift_JIS without BOM (configured encoding wins).
  • Truncation at maxTextLength.
  • Large-file (10 MiB) streaming verifies exact length and head/tail bytes.
  • Markdown body / YAML front matter / no-front-matter paths.
  • XML extractor BOM tests using existing extractor/xml/test_utf8bom.xml, test_utf16lebom.xml, test_utf16bebom.xml fixtures.
  • All existing extractor tests still pass.

Test plan

  • CI green
  • Manual review of BOM + reader composition

marevol added 2 commits May 5, 2026 07:57
…th cap

Replaces full byte-buffering (IOUtils.toString / new String(getBytes())) in
TextExtractor, MarkdownExtractor and AbstractXmlExtractor with Reader-based
streaming through a BufferedReader. Detects UTF-8 / UTF-16 LE/BE / UTF-32
LE/BE BOMs via BOMInputStream and decodes accordingly, falling back to the
configured encoding when no BOM is present. AbstractXmlExtractor also strips
BOM bytes from the actual content stream so the parser sees pure XML.

Adds a configurable maxTextLength on each extractor (default Long.MAX_VALUE,
i.e. unlimited) to bound heap usage on very large inputs and stop reading
early when reached.

Tests cover UTF-8/UTF-16 LE/UTF-16 BE BOM stripping, Shift_JIS without BOM,
truncation at maxTextLength, large-file (10 MiB) streaming, Markdown body
+ YAML front matter + BOM-prefixed Markdown, and XML BOM extraction. All
existing extractor tests still pass.
Address review feedback for #165.

- Emit truncated=true and maxTextLength metadata on ExtractData when
  maxTextLength clips the input, plus a WARN log; keep the partial body
  instead of throwing (which would mismatch the intent of a soft cap).
- Move BOMInputStream into try-with-resources in TextExtractor and
  MarkdownExtractor; replace the @SuppressWarnings("resource") block in
  AbstractXmlExtractor.getText so the BOM-stripping stream is closed
  through the same chain.
- Drop a trailing unpaired high surrogate at the truncation boundary so
  the returned string is always a valid Java UTF-16 sequence.
- Clarify JavaDoc for maxTextLength: char (UTF-16 code unit) basis,
  Long.MAX_VALUE described as "effectively unlimited", values <= 0
  explicitly disable the limit. Document that getText closes the stream
  and that truncation may break YAML front-matter recovery.
- Migrate AbstractXmlExtractor.getEncoding off the deprecated
  BOMInputStream constructor to the builder API for consistency.
- Strengthen existing XML BOM tests (assert real attribute values
  survive extractString) and add 19 tests covering UTF-32 BOMs,
  Shift_JIS Markdown without BOM, surrogate-pair boundary, mid-stream
  BOM passthrough, maxTextLength=0/-1 unlimited, exact-length and
  one-char caps, across-buffer-boundary truncation, the truncated
  metadata flag, and large XML streaming.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant