fix(extractor): defend against archive bomb, recursion, and Zip Slip by marevol · Pull Request #161 · codelibs/fess-crawler

marevol · 2026-05-04T22:32:30Z

Summary

Hardens ZipExtractor, TarExtractor and LhaExtractor against
content-driven attacks. The input stream is treated as untrusted; the
extractor params map remains admin-configured / trusted.

Protections added

Recursion-depth limit — new extractorDepth param tracked through
AbstractExtractor (EXTRACTOR_DEPTH_KEY, getCurrentDepth,
incrementDepth, checkDepth). Each archive extractor calls
checkDepth on entry and incrementDepth when delegating to a nested
extractor. Default maxArchiveDepth = 10.
Total uncompressed-byte cap (maxBytes, default 2 GiB) measured
from the bytes actually read out of each entry stream — not from the
declared entry.getSize() which can be -1 or fabricated.
Maximum entry count (maxEntries, default 100,000) to defend
against many-zero-byte-entries DoS.
Compression-ratio threshold (maxCompressionRatio, default 100:1
for entries > 1 MiB) using the central-directory compressed size where
present and the bytes consumed from a CountingInputStream when the
local header carries a data descriptor.
Zip Slip path-traversal rejection — entry names are normalised via
Path.normalize() and rejected when they escape the conceptual
extraction root, are absolute, or contain .. segments.
Tar symlink / hardlink skipping — entries with LF_SYMLINK /
LF_LINK are skipped with a debug log so they cannot leak content
from outside the archive.
Configurable filename encoding (filenameEncoding, default
UTF-8; supports CP932 / MS932 for Japanese archives).
Early ZIP signature validation so a clearly non-ZIP blob is
reported as ExtractException instead of returning empty content
silently.

How to override

Field setters via DI (these are admin-configured, not params):

Setter	Default
`setMaxArchiveDepth(int)` (AbstractExtractor)	`10`
`setMaxBytes(long)`	`2 GiB`
`setMaxEntries(int)`	`100_000`
`setMaxCompressionRatio(long)` (Zip)	`100`
`setFilenameEncoding(String)` (Zip)	`UTF-8`

Param key (caller-controlled, used for recursive calls):

extractorDepth — current depth value, parsed as integer.

Threat model

Untrusted: archive content stream
Trusted: extractor params map (admin-configured / internal)

Test plan

AbstractExtractorTest: depth helper coverage (null/blank/garbage,
incrementDepth returns NEW map, checkDepth at/above limit).
ArchiveExtractorSecurityTest (new): 15 synthetic-archive tests:
- test_zipBomb_byteLimit
- test_zipBomb_entryLimit
- test_zipSlip_pathTraversal, test_zipSlip_absolutePath
- test_recursionDepth_exceeded,
  test_recursionDepth_belowLimit_succeeds
- test_cp932Filename
- test_tar_symlinkSkipped, test_tar_hardlinkSkipped,
  test_tar_pathTraversal
- test_compressionRatioExceeded
- test_tarBomb_byteLimit, test_tarBomb_entryLimit,
  test_tar_recursionDepth_exceeded
- test_lha_recursionDepth_exceeded
All previously-passing archive tests still pass
(ZipExtractorTest, TarExtractorTest, LhaExtractorTest,
ArchiveExtractorErrorHandlingTest).
No regressions in the wider test suite — only env-dependent
failures remain (Docker for SMB / S3 / GCS / Storage clients,
LibreOffice for JodExtractor).

Adds defensive limits to ZipExtractor, TarExtractor and LhaExtractor so that a malicious content stream can no longer cause unbounded resource consumption or write outside its conceptual extraction root. Threat model assumption: the InputStream is untrusted; the params Map is admin-configured / trusted. Protections: - AbstractExtractor: extractorDepth param + checkDepth/getCurrentDepth/ incrementDepth helpers, plus configurable maxArchiveDepth (default 10). Recursive archive-in-archive expansions now propagate an incremented depth and abort with MaxLengthExceededException at the threshold. - ZipExtractor: per-archive caps (maxBytes 2 GiB, maxEntries 100k, maxCompressionRatio 100:1 for entries > 1 MiB), Zip Slip path-traversal rejection via Path.normalize(), early ZIP signature validation, and configurable filenameEncoding (UTF-8 / CP932 / MS932) for non-UTF8 archive names. Compression ratio uses the central directory size when available and falls back to bytes consumed from a CountingInputStream. - TarExtractor: maxBytes / maxEntries caps, depth check, path-traversal rejection, and explicit skipping of symbolic-link and hard-link entries to avoid leaking content from outside the archive sandbox. - LhaExtractor: maxBytes / maxEntries caps, depth check, path-traversal rejection. Tests: - AbstractExtractorTest: new coverage for getCurrentDepth / incrementDepth (returns NEW map, leaves input untouched) / checkDepth pass and fail paths. - ArchiveExtractorSecurityTest: 15 synthetic-archive tests covering each threat (zip byte / entry / ratio bombs, Zip Slip, absolute paths, recursion-depth, CP932 filenames, tar symlink/hardlink/path-traversal, tar byte/entry caps and depth, lha depth). All previously-passing extractor tests continue to pass; failures in the suite are limited to environment-dependent tests (Docker for SMB / S3 / GCS / Storage clients, LibreOffice for JodExtractor) which are unchanged by this commit.

…nded helpers

- LhaExtractor now stages the input archive through copyBounded with a configurable maxInputBytes cap (default 1 GiB) so a hostile producer cannot fill local storage before LhaFile is opened. - The per-entry size cap is enforced against bytes actually decompressed (mirroring Zip/Tar) instead of the attacker-controlled LhaHeader#getOriginalSize. Entries are now buffered into a bounded ByteArrayOutputStream and forwarded to downstream extractors via ByteArrayInputStream, dropping the unbounded IgnoreCloseInputStream hand-off. - Down-scale test_perEntryCapEnforced (300 MiB -> 2 MiB payload with a 1 MiB cap) so it stays cheap on parallel/low-memory CI, and add test_lha_maxInputBytes_capsStaging covering the new input staging cap.

… into read budget Addresses PR #161 review feedback: 1) Unsupported entries (no registered Extractor) are now skipped without buffering, mirroring the pre-defence behaviour. A large irrelevant entry (e.g. a video alongside a small .txt) no longer consumes the per-entry / total caps that should be reserved for entries the crawler actually wants to extract. 2) The legacy maxContentSize field is folded into the per-entry read budget (alongside maxBytes and maxBytesPerEntry). A small legacy cap (e.g. 10 MiB) now stops the buffer growing past it, instead of being checked only after up to 256 MiB+1 had been buffered. For Zip, the compressed-bytes anchor is also advanced for skipped / rejected entries so the next supported entry's compression-ratio is computed against its own compressed bytes, not also those of the skipped ones. Tests: updates test_perEntryCapEnforced to use a supported (.txt) entry and adds three regressions: - test_zip_unsupportedEntryDoesNotConsumeCaps - test_tar_unsupportedEntryDoesNotConsumeCaps - test_zip_maxContentSize_capsBufferBeforePerEntryCap

Addresses correctness, security, and observability issues raised in the review of the archive-bomb / Zip-Slip / recursion-depth defenses. Correctness - AbstractExtractor: new addOneSaturating(long) clamps the read-limit "+1" margin at Long.MAX_VALUE so an admin who sets a cap near Long.MAX_VALUE no longer wraps to Long.MIN_VALUE and silently buffers zero bytes. All "+1L" sites in Zip/Tar/Lha now go through it. - AbstractExtractor.copyBounded: a negative limit now throws IllegalArgumentException instead of silently returning 0, so a misconfiguration upstream surfaces immediately. - AbstractExtractor.isPathTraversal: normalise backslash to forward slash BEFORE calling Paths.get so single-segment names like "a\\.." are recognised as traversal on Linux/macOS JVMs (where backslash is a literal filename character). Security - ZipExtractor: tighten the early signature check to PK\x03\x04 only; short-circuit valid empty archives that start with PK\x05\x06; reject PK\x07\x08 (data-descriptor) and other non-opening signatures. - ZipExtractor: flip allowStoredEntriesWithDataDescriptor back to false (the pre-PR factory default) to avoid widening the attack surface. - ZipExtractor: compression-ratio check now uses min(header_compressed_size, measured_compressed_size) so a header that lies about the compressed size cannot suppress the check. - TarExtractor: skip PAX extended/global header pseudo-entries before incrementing entryCount so they do not consume the maxEntries budget. - TarExtractor: symbolic/hard-link skip log upgraded from DEBUG to WARN for parity with path-traversal rejection. Observability - ZipExtractor / TarExtractor / LhaExtractor: emit a summary WARN when the entry loop completes normally with failedEntries > 0. Operators no longer have to enable DEBUG to notice partial extraction. - LhaExtractor: widen the per-entry catch to (IOException | RuntimeException) and guard against a null InputStream from LhaFile#getInputStream — the dangan library does not declare checked exceptions for corrupt/unknown compression methods and can return null when a header is missing. Failed entries are now logged at WARN with the filename and counted in failedEntries. - LhaExtractor: replace the empty catch on LhaFile#close() with a WARN log so a failing close (e.g. EBADF, disk full) is no longer silently hidden. - LhaExtractor: use validateInputStream(in) instead of the bespoke null check so the message and exception type match Zip/Tar. Tests - AbstractExtractorTest: 9 new isPathTraversal cases (null, empty, drive letter, leading slash, leading backslash, lone "..", classic traversal, safe relative, NUL char) plus the new single-segment backslash regression that exercises the Paths.get fix; 2 new addOneSaturating cases. - ArchiveExtractorSecurityTest: 15 new tests covering the ZIP signature variants, the saturating-add at Long.MAX_VALUE, the min(header,measured) ratio path, nested-archive recursion-depth increment, per-entry/total cap matrix with each cap disabled, the CP932 test via Assumptions.assumeTrue, a tightened ratio assertion, the Tar PAX-header skip, the Tar per-entry cap, the Tar symlink WARN behaviour, and the setMaxArchiveDepth setter. All 83 tests in the archive-extractor test classes pass. The single flaky failure in the wider suite (Hc5HttpClientTest#test_doHead_ accessTimeoutTarget) is pre-existing and unrelated.

@throws

…adoc Javadoc tag @throws CrawlerSystemException on getText() referred to a class not on the file's import list, breaking maven-javadoc-plugin with 'reference not found' on CI.

marevol and others added 6 commits May 5, 2026 07:31

fix(extractor): add per-entry size cap and dedupe Zip Slip / copy-bou…

102b22f

…nded helpers

fix(extractor): import CrawlerSystemException in LhaExtractor for jav…

bc2ab79

…adoc Javadoc tag @throws CrawlerSystemException on getText() referred to a class not on the file's import list, breaking maven-javadoc-plugin with 'reference not found' on CI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(extractor): defend against archive bomb, recursion, and Zip Slip#161

fix(extractor): defend against archive bomb, recursion, and Zip Slip#161
marevol wants to merge 6 commits into
masterfrom
fix/extractor-archive-bomb-defense

marevol commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marevol commented May 4, 2026

Summary

Protections added

How to override

Threat model

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant