Skip to content

fix(extractor): defend against archive bomb, recursion, and Zip Slip#161

Open
marevol wants to merge 6 commits into
masterfrom
fix/extractor-archive-bomb-defense
Open

fix(extractor): defend against archive bomb, recursion, and Zip Slip#161
marevol wants to merge 6 commits into
masterfrom
fix/extractor-archive-bomb-defense

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 4, 2026

Summary

Hardens ZipExtractor, TarExtractor and LhaExtractor against
content-driven attacks. The input stream is treated as untrusted; the
extractor params map remains admin-configured / trusted.

Protections added

  • Recursion-depth limit — new extractorDepth param tracked through
    AbstractExtractor (EXTRACTOR_DEPTH_KEY, getCurrentDepth,
    incrementDepth, checkDepth). Each archive extractor calls
    checkDepth on entry and incrementDepth when delegating to a nested
    extractor. Default maxArchiveDepth = 10.
  • Total uncompressed-byte cap (maxBytes, default 2 GiB) measured
    from the bytes actually read out of each entry stream — not from the
    declared entry.getSize() which can be -1 or fabricated.
  • Maximum entry count (maxEntries, default 100,000) to defend
    against many-zero-byte-entries DoS.
  • Compression-ratio threshold (maxCompressionRatio, default 100:1
    for entries > 1 MiB) using the central-directory compressed size where
    present and the bytes consumed from a CountingInputStream when the
    local header carries a data descriptor.
  • Zip Slip path-traversal rejection — entry names are normalised via
    Path.normalize() and rejected when they escape the conceptual
    extraction root, are absolute, or contain .. segments.
  • Tar symlink / hardlink skipping — entries with LF_SYMLINK /
    LF_LINK are skipped with a debug log so they cannot leak content
    from outside the archive.
  • Configurable filename encoding (filenameEncoding, default
    UTF-8; supports CP932 / MS932 for Japanese archives).
  • Early ZIP signature validation so a clearly non-ZIP blob is
    reported as ExtractException instead of returning empty content
    silently.

How to override

Field setters via DI (these are admin-configured, not params):

Setter Default
setMaxArchiveDepth(int) (AbstractExtractor) 10
setMaxBytes(long) 2 GiB
setMaxEntries(int) 100_000
setMaxCompressionRatio(long) (Zip) 100
setFilenameEncoding(String) (Zip) UTF-8

Param key (caller-controlled, used for recursive calls):

  • extractorDepth — current depth value, parsed as integer.

Threat model

  • Untrusted: archive content stream
  • Trusted: extractor params map (admin-configured / internal)

Test plan

  • AbstractExtractorTest: depth helper coverage (null/blank/garbage,
    incrementDepth returns NEW map, checkDepth at/above limit).
  • ArchiveExtractorSecurityTest (new): 15 synthetic-archive tests:
    • test_zipBomb_byteLimit
    • test_zipBomb_entryLimit
    • test_zipSlip_pathTraversal, test_zipSlip_absolutePath
    • test_recursionDepth_exceeded,
      test_recursionDepth_belowLimit_succeeds
    • test_cp932Filename
    • test_tar_symlinkSkipped, test_tar_hardlinkSkipped,
      test_tar_pathTraversal
    • test_compressionRatioExceeded
    • test_tarBomb_byteLimit, test_tarBomb_entryLimit,
      test_tar_recursionDepth_exceeded
    • test_lha_recursionDepth_exceeded
  • All previously-passing archive tests still pass
    (ZipExtractorTest, TarExtractorTest, LhaExtractorTest,
    ArchiveExtractorErrorHandlingTest).
  • No regressions in the wider test suite — only env-dependent
    failures remain (Docker for SMB / S3 / GCS / Storage clients,
    LibreOffice for JodExtractor).

marevol and others added 6 commits May 5, 2026 07:31
Adds defensive limits to ZipExtractor, TarExtractor and LhaExtractor so
that a malicious content stream can no longer cause unbounded resource
consumption or write outside its conceptual extraction root.

Threat model assumption: the InputStream is untrusted; the params Map is
admin-configured / trusted.

Protections:
- AbstractExtractor: extractorDepth param + checkDepth/getCurrentDepth/
  incrementDepth helpers, plus configurable maxArchiveDepth (default 10).
  Recursive archive-in-archive expansions now propagate an incremented
  depth and abort with MaxLengthExceededException at the threshold.
- ZipExtractor: per-archive caps (maxBytes 2 GiB, maxEntries 100k,
  maxCompressionRatio 100:1 for entries > 1 MiB), Zip Slip path-traversal
  rejection via Path.normalize(), early ZIP signature validation, and
  configurable filenameEncoding (UTF-8 / CP932 / MS932) for non-UTF8
  archive names. Compression ratio uses the central directory size when
  available and falls back to bytes consumed from a CountingInputStream.
- TarExtractor: maxBytes / maxEntries caps, depth check, path-traversal
  rejection, and explicit skipping of symbolic-link and hard-link entries
  to avoid leaking content from outside the archive sandbox.
- LhaExtractor: maxBytes / maxEntries caps, depth check, path-traversal
  rejection.

Tests:
- AbstractExtractorTest: new coverage for getCurrentDepth /
  incrementDepth (returns NEW map, leaves input untouched) / checkDepth
  pass and fail paths.
- ArchiveExtractorSecurityTest: 15 synthetic-archive tests covering each
  threat (zip byte / entry / ratio bombs, Zip Slip, absolute paths,
  recursion-depth, CP932 filenames, tar symlink/hardlink/path-traversal,
  tar byte/entry caps and depth, lha depth).

All previously-passing extractor tests continue to pass; failures in the
suite are limited to environment-dependent tests (Docker for SMB / S3 /
GCS / Storage clients, LibreOffice for JodExtractor) which are unchanged
by this commit.
- LhaExtractor now stages the input archive through copyBounded with a
  configurable maxInputBytes cap (default 1 GiB) so a hostile producer
  cannot fill local storage before LhaFile is opened.
- The per-entry size cap is enforced against bytes actually decompressed
  (mirroring Zip/Tar) instead of the attacker-controlled
  LhaHeader#getOriginalSize. Entries are now buffered into a bounded
  ByteArrayOutputStream and forwarded to downstream extractors via
  ByteArrayInputStream, dropping the unbounded IgnoreCloseInputStream
  hand-off.
- Down-scale test_perEntryCapEnforced (300 MiB -> 2 MiB payload with a
  1 MiB cap) so it stays cheap on parallel/low-memory CI, and add
  test_lha_maxInputBytes_capsStaging covering the new input staging
  cap.
… into read budget

Addresses PR #161 review feedback:

1) Unsupported entries (no registered Extractor) are now skipped without
   buffering, mirroring the pre-defence behaviour. A large irrelevant
   entry (e.g. a video alongside a small .txt) no longer consumes the
   per-entry / total caps that should be reserved for entries the
   crawler actually wants to extract.

2) The legacy maxContentSize field is folded into the per-entry read
   budget (alongside maxBytes and maxBytesPerEntry). A small legacy cap
   (e.g. 10 MiB) now stops the buffer growing past it, instead of being
   checked only after up to 256 MiB+1 had been buffered.

For Zip, the compressed-bytes anchor is also advanced for skipped /
rejected entries so the next supported entry's compression-ratio is
computed against its own compressed bytes, not also those of the
skipped ones.

Tests: updates test_perEntryCapEnforced to use a supported (.txt) entry
and adds three regressions:
- test_zip_unsupportedEntryDoesNotConsumeCaps
- test_tar_unsupportedEntryDoesNotConsumeCaps
- test_zip_maxContentSize_capsBufferBeforePerEntryCap
Addresses correctness, security, and observability issues raised in the
review of the archive-bomb / Zip-Slip / recursion-depth defenses.

Correctness
- AbstractExtractor: new addOneSaturating(long) clamps the read-limit
  "+1" margin at Long.MAX_VALUE so an admin who sets a cap near
  Long.MAX_VALUE no longer wraps to Long.MIN_VALUE and silently buffers
  zero bytes. All "+1L" sites in Zip/Tar/Lha now go through it.
- AbstractExtractor.copyBounded: a negative limit now throws
  IllegalArgumentException instead of silently returning 0, so a
  misconfiguration upstream surfaces immediately.
- AbstractExtractor.isPathTraversal: normalise backslash to forward
  slash BEFORE calling Paths.get so single-segment names like "a\\.."
  are recognised as traversal on Linux/macOS JVMs (where backslash is
  a literal filename character).

Security
- ZipExtractor: tighten the early signature check to PK\x03\x04 only;
  short-circuit valid empty archives that start with PK\x05\x06; reject
  PK\x07\x08 (data-descriptor) and other non-opening signatures.
- ZipExtractor: flip allowStoredEntriesWithDataDescriptor back to false
  (the pre-PR factory default) to avoid widening the attack surface.
- ZipExtractor: compression-ratio check now uses
  min(header_compressed_size, measured_compressed_size) so a header
  that lies about the compressed size cannot suppress the check.
- TarExtractor: skip PAX extended/global header pseudo-entries before
  incrementing entryCount so they do not consume the maxEntries budget.
- TarExtractor: symbolic/hard-link skip log upgraded from DEBUG to WARN
  for parity with path-traversal rejection.

Observability
- ZipExtractor / TarExtractor / LhaExtractor: emit a summary WARN when
  the entry loop completes normally with failedEntries > 0. Operators
  no longer have to enable DEBUG to notice partial extraction.
- LhaExtractor: widen the per-entry catch to (IOException |
  RuntimeException) and guard against a null InputStream from
  LhaFile#getInputStream — the dangan library does not declare checked
  exceptions for corrupt/unknown compression methods and can return
  null when a header is missing. Failed entries are now logged at WARN
  with the filename and counted in failedEntries.
- LhaExtractor: replace the empty catch on LhaFile#close() with a WARN
  log so a failing close (e.g. EBADF, disk full) is no longer silently
  hidden.
- LhaExtractor: use validateInputStream(in) instead of the bespoke null
  check so the message and exception type match Zip/Tar.

Tests
- AbstractExtractorTest: 9 new isPathTraversal cases (null, empty,
  drive letter, leading slash, leading backslash, lone "..", classic
  traversal, safe relative, NUL char) plus the new single-segment
  backslash regression that exercises the Paths.get fix; 2 new
  addOneSaturating cases.
- ArchiveExtractorSecurityTest: 15 new tests covering the ZIP
  signature variants, the saturating-add at Long.MAX_VALUE, the
  min(header,measured) ratio path, nested-archive recursion-depth
  increment, per-entry/total cap matrix with each cap disabled, the
  CP932 test via Assumptions.assumeTrue, a tightened ratio assertion,
  the Tar PAX-header skip, the Tar per-entry cap, the Tar symlink
  WARN behaviour, and the setMaxArchiveDepth setter.

All 83 tests in the archive-extractor test classes pass. The single
flaky failure in the wider suite (Hc5HttpClientTest#test_doHead_
accessTimeoutTarget) is pre-existing and unrelated.
…adoc

Javadoc tag @throws CrawlerSystemException on getText() referred to a
class not on the file's import list, breaking maven-javadoc-plugin with
'reference not found' on CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant