fix(extractor): defend against archive bomb, recursion, and Zip Slip#161
Open
marevol wants to merge 6 commits into
Open
fix(extractor): defend against archive bomb, recursion, and Zip Slip#161marevol wants to merge 6 commits into
marevol wants to merge 6 commits into
Conversation
Adds defensive limits to ZipExtractor, TarExtractor and LhaExtractor so that a malicious content stream can no longer cause unbounded resource consumption or write outside its conceptual extraction root. Threat model assumption: the InputStream is untrusted; the params Map is admin-configured / trusted. Protections: - AbstractExtractor: extractorDepth param + checkDepth/getCurrentDepth/ incrementDepth helpers, plus configurable maxArchiveDepth (default 10). Recursive archive-in-archive expansions now propagate an incremented depth and abort with MaxLengthExceededException at the threshold. - ZipExtractor: per-archive caps (maxBytes 2 GiB, maxEntries 100k, maxCompressionRatio 100:1 for entries > 1 MiB), Zip Slip path-traversal rejection via Path.normalize(), early ZIP signature validation, and configurable filenameEncoding (UTF-8 / CP932 / MS932) for non-UTF8 archive names. Compression ratio uses the central directory size when available and falls back to bytes consumed from a CountingInputStream. - TarExtractor: maxBytes / maxEntries caps, depth check, path-traversal rejection, and explicit skipping of symbolic-link and hard-link entries to avoid leaking content from outside the archive sandbox. - LhaExtractor: maxBytes / maxEntries caps, depth check, path-traversal rejection. Tests: - AbstractExtractorTest: new coverage for getCurrentDepth / incrementDepth (returns NEW map, leaves input untouched) / checkDepth pass and fail paths. - ArchiveExtractorSecurityTest: 15 synthetic-archive tests covering each threat (zip byte / entry / ratio bombs, Zip Slip, absolute paths, recursion-depth, CP932 filenames, tar symlink/hardlink/path-traversal, tar byte/entry caps and depth, lha depth). All previously-passing extractor tests continue to pass; failures in the suite are limited to environment-dependent tests (Docker for SMB / S3 / GCS / Storage clients, LibreOffice for JodExtractor) which are unchanged by this commit.
- LhaExtractor now stages the input archive through copyBounded with a configurable maxInputBytes cap (default 1 GiB) so a hostile producer cannot fill local storage before LhaFile is opened. - The per-entry size cap is enforced against bytes actually decompressed (mirroring Zip/Tar) instead of the attacker-controlled LhaHeader#getOriginalSize. Entries are now buffered into a bounded ByteArrayOutputStream and forwarded to downstream extractors via ByteArrayInputStream, dropping the unbounded IgnoreCloseInputStream hand-off. - Down-scale test_perEntryCapEnforced (300 MiB -> 2 MiB payload with a 1 MiB cap) so it stays cheap on parallel/low-memory CI, and add test_lha_maxInputBytes_capsStaging covering the new input staging cap.
… into read budget Addresses PR #161 review feedback: 1) Unsupported entries (no registered Extractor) are now skipped without buffering, mirroring the pre-defence behaviour. A large irrelevant entry (e.g. a video alongside a small .txt) no longer consumes the per-entry / total caps that should be reserved for entries the crawler actually wants to extract. 2) The legacy maxContentSize field is folded into the per-entry read budget (alongside maxBytes and maxBytesPerEntry). A small legacy cap (e.g. 10 MiB) now stops the buffer growing past it, instead of being checked only after up to 256 MiB+1 had been buffered. For Zip, the compressed-bytes anchor is also advanced for skipped / rejected entries so the next supported entry's compression-ratio is computed against its own compressed bytes, not also those of the skipped ones. Tests: updates test_perEntryCapEnforced to use a supported (.txt) entry and adds three regressions: - test_zip_unsupportedEntryDoesNotConsumeCaps - test_tar_unsupportedEntryDoesNotConsumeCaps - test_zip_maxContentSize_capsBufferBeforePerEntryCap
Addresses correctness, security, and observability issues raised in the review of the archive-bomb / Zip-Slip / recursion-depth defenses. Correctness - AbstractExtractor: new addOneSaturating(long) clamps the read-limit "+1" margin at Long.MAX_VALUE so an admin who sets a cap near Long.MAX_VALUE no longer wraps to Long.MIN_VALUE and silently buffers zero bytes. All "+1L" sites in Zip/Tar/Lha now go through it. - AbstractExtractor.copyBounded: a negative limit now throws IllegalArgumentException instead of silently returning 0, so a misconfiguration upstream surfaces immediately. - AbstractExtractor.isPathTraversal: normalise backslash to forward slash BEFORE calling Paths.get so single-segment names like "a\\.." are recognised as traversal on Linux/macOS JVMs (where backslash is a literal filename character). Security - ZipExtractor: tighten the early signature check to PK\x03\x04 only; short-circuit valid empty archives that start with PK\x05\x06; reject PK\x07\x08 (data-descriptor) and other non-opening signatures. - ZipExtractor: flip allowStoredEntriesWithDataDescriptor back to false (the pre-PR factory default) to avoid widening the attack surface. - ZipExtractor: compression-ratio check now uses min(header_compressed_size, measured_compressed_size) so a header that lies about the compressed size cannot suppress the check. - TarExtractor: skip PAX extended/global header pseudo-entries before incrementing entryCount so they do not consume the maxEntries budget. - TarExtractor: symbolic/hard-link skip log upgraded from DEBUG to WARN for parity with path-traversal rejection. Observability - ZipExtractor / TarExtractor / LhaExtractor: emit a summary WARN when the entry loop completes normally with failedEntries > 0. Operators no longer have to enable DEBUG to notice partial extraction. - LhaExtractor: widen the per-entry catch to (IOException | RuntimeException) and guard against a null InputStream from LhaFile#getInputStream — the dangan library does not declare checked exceptions for corrupt/unknown compression methods and can return null when a header is missing. Failed entries are now logged at WARN with the filename and counted in failedEntries. - LhaExtractor: replace the empty catch on LhaFile#close() with a WARN log so a failing close (e.g. EBADF, disk full) is no longer silently hidden. - LhaExtractor: use validateInputStream(in) instead of the bespoke null check so the message and exception type match Zip/Tar. Tests - AbstractExtractorTest: 9 new isPathTraversal cases (null, empty, drive letter, leading slash, leading backslash, lone "..", classic traversal, safe relative, NUL char) plus the new single-segment backslash regression that exercises the Paths.get fix; 2 new addOneSaturating cases. - ArchiveExtractorSecurityTest: 15 new tests covering the ZIP signature variants, the saturating-add at Long.MAX_VALUE, the min(header,measured) ratio path, nested-archive recursion-depth increment, per-entry/total cap matrix with each cap disabled, the CP932 test via Assumptions.assumeTrue, a tightened ratio assertion, the Tar PAX-header skip, the Tar per-entry cap, the Tar symlink WARN behaviour, and the setMaxArchiveDepth setter. All 83 tests in the archive-extractor test classes pass. The single flaky failure in the wider suite (Hc5HttpClientTest#test_doHead_ accessTimeoutTarget) is pre-existing and unrelated.
…adoc Javadoc tag @throws CrawlerSystemException on getText() referred to a class not on the file's import list, breaking maven-javadoc-plugin with 'reference not found' on CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens
ZipExtractor,TarExtractorandLhaExtractoragainstcontent-driven attacks. The input stream is treated as untrusted; the
extractor
paramsmap remains admin-configured / trusted.Protections added
extractorDepthparam tracked throughAbstractExtractor(EXTRACTOR_DEPTH_KEY,getCurrentDepth,incrementDepth,checkDepth). Each archive extractor callscheckDepthon entry andincrementDepthwhen delegating to a nestedextractor. Default
maxArchiveDepth = 10.maxBytes, default 2 GiB) measuredfrom the bytes actually read out of each entry stream — not from the
declared
entry.getSize()which can be-1or fabricated.maxEntries, default 100,000) to defendagainst many-zero-byte-entries DoS.
maxCompressionRatio, default 100:1for entries > 1 MiB) using the central-directory compressed size where
present and the bytes consumed from a
CountingInputStreamwhen thelocal header carries a data descriptor.
Path.normalize()and rejected when they escape the conceptualextraction root, are absolute, or contain
..segments.LF_SYMLINK/LF_LINKare skipped with a debug log so they cannot leak contentfrom outside the archive.
filenameEncoding, defaultUTF-8; supportsCP932/MS932for Japanese archives).reported as
ExtractExceptioninstead of returning empty contentsilently.
How to override
Field setters via DI (these are admin-configured, not params):
setMaxArchiveDepth(int)(AbstractExtractor)10setMaxBytes(long)2 GiBsetMaxEntries(int)100_000setMaxCompressionRatio(long)(Zip)100setFilenameEncoding(String)(Zip)UTF-8Param key (caller-controlled, used for recursive calls):
extractorDepth— current depth value, parsed as integer.Threat model
paramsmap (admin-configured / internal)Test plan
AbstractExtractorTest: depth helper coverage (null/blank/garbage,incrementDepthreturns NEW map,checkDepthat/above limit).ArchiveExtractorSecurityTest(new): 15 synthetic-archive tests:test_zipBomb_byteLimittest_zipBomb_entryLimittest_zipSlip_pathTraversal,test_zipSlip_absolutePathtest_recursionDepth_exceeded,test_recursionDepth_belowLimit_succeedstest_cp932Filenametest_tar_symlinkSkipped,test_tar_hardlinkSkipped,test_tar_pathTraversaltest_compressionRatioExceededtest_tarBomb_byteLimit,test_tarBomb_entryLimit,test_tar_recursionDepth_exceededtest_lha_recursionDepth_exceeded(
ZipExtractorTest,TarExtractorTest,LhaExtractorTest,ArchiveExtractorErrorHandlingTest).failures remain (Docker for SMB / S3 / GCS / Storage clients,
LibreOffice for
JodExtractor).