fix(extractor): stream text/markdown/xml extraction with BOM and length cap#165
Open
marevol wants to merge 2 commits into
Open
fix(extractor): stream text/markdown/xml extraction with BOM and length cap#165marevol wants to merge 2 commits into
marevol wants to merge 2 commits into
Conversation
…th cap Replaces full byte-buffering (IOUtils.toString / new String(getBytes())) in TextExtractor, MarkdownExtractor and AbstractXmlExtractor with Reader-based streaming through a BufferedReader. Detects UTF-8 / UTF-16 LE/BE / UTF-32 LE/BE BOMs via BOMInputStream and decodes accordingly, falling back to the configured encoding when no BOM is present. AbstractXmlExtractor also strips BOM bytes from the actual content stream so the parser sees pure XML. Adds a configurable maxTextLength on each extractor (default Long.MAX_VALUE, i.e. unlimited) to bound heap usage on very large inputs and stop reading early when reached. Tests cover UTF-8/UTF-16 LE/UTF-16 BE BOM stripping, Shift_JIS without BOM, truncation at maxTextLength, large-file (10 MiB) streaming, Markdown body + YAML front matter + BOM-prefixed Markdown, and XML BOM extraction. All existing extractor tests still pass.
Address review feedback for #165. - Emit truncated=true and maxTextLength metadata on ExtractData when maxTextLength clips the input, plus a WARN log; keep the partial body instead of throwing (which would mismatch the intent of a soft cap). - Move BOMInputStream into try-with-resources in TextExtractor and MarkdownExtractor; replace the @SuppressWarnings("resource") block in AbstractXmlExtractor.getText so the BOM-stripping stream is closed through the same chain. - Drop a trailing unpaired high surrogate at the truncation boundary so the returned string is always a valid Java UTF-16 sequence. - Clarify JavaDoc for maxTextLength: char (UTF-16 code unit) basis, Long.MAX_VALUE described as "effectively unlimited", values <= 0 explicitly disable the limit. Document that getText closes the stream and that truncation may break YAML front-matter recovery. - Migrate AbstractXmlExtractor.getEncoding off the deprecated BOMInputStream constructor to the builder API for consistency. - Strengthen existing XML BOM tests (assert real attribute values survive extractString) and add 19 tests covering UTF-32 BOMs, Shift_JIS Markdown without BOM, surrogate-pair boundary, mid-stream BOM passthrough, maxTextLength=0/-1 unlimited, exact-length and one-char caps, across-buffer-boundary truncation, the truncated metadata flag, and large XML streaming.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BOMInputStreamand decode accordingly; fall back to configured encoding when absent.IOUtils.toString/new String(getBytes())withBufferedReaderstreaming bounded bymaxTextLength(default unlimited; configurable).AbstractXmlExtractornow strips BOM bytes from the actual content stream before decoding so the XML parser does not see a leading.MarkdownExtractorreuses the same Reader/BOM pipeline before handing the source to commonmark; YAML front-matter extraction is unaffected.Why
Large text/markdown/XML files cause unnecessary heap pressure since the entire byte buffer was materialized as a String before processing. Non-UTF-8 BOMs were silently misdecoded with the configured encoding, and the leading BOM character sometimes leaked into the extracted content.
Tests
TextExtractor,MarkdownExtractor,XmlExtractor.maxTextLength.extractor/xml/test_utf8bom.xml,test_utf16lebom.xml,test_utf16bebom.xmlfixtures.Test plan