Skip to content

fix(extractor): decode RFC 2047 headers and bound EML recursion/bytes#166

Open
marevol wants to merge 5 commits into
masterfrom
fix/extractor-eml-headers
Open

fix(extractor): decode RFC 2047 headers and bound EML recursion/bytes#166
marevol wants to merge 5 commits into
masterfrom
fix/extractor-eml-headers

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 4, 2026

Summary

  • Decode RFC 2047 encoded-word headers (Subject, From, To, Cc, Bcc, Reply-To) via MimeUtility.decodeText.
  • Add maxRecursionDepth (default 10) for nested message/rfc822 parts.
  • Add maxParts (default 1000) and maxBodyBytes (default 50 MiB) DoS guards.
  • Expose attachmentNames (multivalue metadata) without extracting binary content.
  • Set common metadata: subject, from, to, cc, sentDate, receivedDate, messageId.

Threat model

EML content is untrusted. The new bounds defend against deeply nested or massively multi-part messages that could exhaust memory.

Tests

  • RFC 2047 decoding for Subject and From.
  • Attachment filename extraction.
  • Recursion bomb rejected.
  • maxParts / maxBodyBytes enforced.
  • Existing happy-path tests still pass.

Test plan

  • CI green
  • Manual review of recursion-depth bookkeeping

marevol added 5 commits May 5, 2026 07:57
EML content is untrusted. The new bounds defend against deeply nested
or massively multi-part messages that could exhaust memory.

- Decode RFC 2047 encoded-word headers (Subject, From, To, Cc, Bcc,
  Reply-To) via MimeUtility.decodeText for the new normalized
  metadata keys (subject, from, to, cc, bcc, replyTo).
- Add maxRecursionDepth (default 10) for nested message/rfc822 and
  multipart parts; throw MaxLengthExceededException when exceeded.
- Add maxParts (default 1000) and maxBodyBytes (default 50 MiB) DoS
  guards.
- Expose attachmentNames (multivalue metadata) without extracting
  binary content.
- Set common metadata: subject, from, to, cc, bcc, replyTo, sentDate,
  receivedDate, messageId.
- Preserve previous behavior: text alternatives prefer text/plain,
  legacy headers (Subject, From, To, ...) remain available.

Adds tests for body extraction, RFC 2047 decoding (Subject and From
display name), attachment filename collection, recursion bomb, max
parts, body byte truncation, and multipart/alternative preference.
…boundary walk

The previous CharsetEncoder approach allocated a ByteBuffer sized to the
entire remaining maxBodyBytes budget (50 MiB by default) on every appendBody
call — even for small text parts. Under concurrent multipart EML processing
this multiplied to gigabytes of throwaway allocations. Encode the text once
with String.getBytes(UTF_8) (memory proportional to input, not budget) and
walk back over UTF-8 continuation bytes to land on a code-point boundary
when truncation is needed.

Adds test_maxBodyBytes_truncatesAtUtf8CodePointBoundary verifying the
boundary walk-back never produces a U+FFFD replacement char.
Three audit findings on PR #166:

- multipart/alternative previously charged only the chosen child to
  ctx.partCount, so an attacker could bypass maxParts by stuffing
  thousands of unused alternatives. Now charges count - 1 for skipped
  alternatives (the chosen one is counted via its own extractBody call),
  re-checking the cap before recursion.
- text/* parts were fully decoded into a String via Part.getContent()
  before any maxBodyBytes check, peaking heap at multiples of the part
  size. Replaced with a streaming read from Part.getInputStream() capped
  at 4 * remaining-UTF-8-budget + 16 bytes (enough to fill any UTF-8 cap
  regardless of source charset, but bounded relative to maxBodyBytes
  rather than to the part size).
- appendBody appended a trailing space even when the encoded text
  exactly filled the remaining budget, exceeding maxBodyBytes by 1.
  Reserve the separator byte before taking the fit branch and guard
  cutoff < bytes.length when walking back continuation bytes.

Adds regression tests for the alternative-bypass and the strict cap.
Address multi-agent code review findings on PR #166. Tightens defensive
bounds, error handling, and input validation in EmlExtractor without
weakening any existing happy-path behavior.

Security / DoS:
- Add maxMessageBytes (default 100 MiB) enforced via LimitedInputStream
  before MimeMessage parses, closing the parser-stage memory hole.
- Wrap attachment streams in LimitedInputStream sized to the remaining
  UTF-8 budget so nested extractors cannot bypass maxBodyBytes.
- Re-throw MaxLengthExceededException from appendAttachment instead of
  swallowing it, preserving the security signal up the call chain.
- Replace ByteArrayOutputStream-based text part decoding with a streaming
  InputStreamReader capped at remaining chars, eliminating amplification
  on wide source charsets (UTF-32 etc.).
- Cap Received header iteration at 100 entries and hoist MailDateFormat
  out of the loop.

Correctness:
- getDateString now anchors on the rightmost ';' per RFC 5322 §3.6.7,
  falling back to the day-of-week scan only when no ';' is present.
  Avoids false-positive matches inside Received-header comments.
- getDecodeText returns the raw value (not empty string) on
  UnsupportedEncodingException, so malformed headers stay non-empty.
- Cache getReceivedDate(message) so legacy and normalized keys reuse the
  same result.
- Replace Session.getDefaultInstance with Session.getInstance so
  per-call mail properties are honored.

Error handling:
- Surface text-part read failures at WARN (matching HtmlExtractor /
  PdfExtractor / TikaExtractor conventions); previously hidden at DEBUG.
- Narrow charset-resolution catch to IllegalCharsetNameException /
  UnsupportedCharsetException and log fallback to UTF-8 at WARN.
- Narrow putValue catch from Exception to RuntimeException and log at
  WARN with key=value format.
- Replace silent '// ignore' on ParseException with a DEBUG log.

API:
- Reject non-positive values in setMaxRecursionDepth (negative only),
  setMaxParts, setMaxBodyBytes, setMaxMessageBytes via
  IllegalArgumentException.
- @deprecated appendAttachment(StringBuilder, BodyPart) and document
  that it does not honor maxBodyBytes cumulatively.

Tests:
- Add 17 new tests covering maxMessageBytes enforcement, attachment
  budget propagation, MaxLengthExceededException re-throw, recursion
  boundary, RFC 2047 decoding for to/cc/bcc/replyTo, normalized date
  and messageId metadata, ISO-2022-JP / unknown / missing charsets,
  multiple and inline attachments, cross-part body budget, setter
  validation, Received-header parsing edge cases, and getDecodeText
  fallback.
- Tighten existing tests: test_decodesRfc2047Subject now asserts equality
  instead of permitting either form; remove flaky timing assertion from
  test_maxBodyBytes_largeInputIsBounded; tighten threshold in
  test_maxBodyBytes_truncates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant