fix(extractor): harden CommandExtractor with timeout, output cap, descendant kill#159
Open
marevol wants to merge 5 commits into
Open
fix(extractor): harden CommandExtractor with timeout, output cap, descendant kill#159marevol wants to merge 5 commits into
marevol wants to merge 5 commits into
Conversation
…t, and orphan processes - Replace polling MonitorThread with Process.waitFor(timeout, TimeUnit) - Drain both stdout and stderr concurrently via daemon executor to avoid pipe-buffer deadlocks and to prevent stderr from filling indefinitely - Cap subprocess output (stdout/stderr) at maxOutputSize (default 10 MiB); on overflow, kill the process tree and throw ExtractException - Cap input copied to the temp file at maxInputSize (default 100 MiB); fail fast before invoking the command - On timeout or overflow, destroy descendants via ProcessHandle followed by destroyForcibly() so orphaned children are not left behind - Use a daemon thread factory for stream-drain workers so they do not block JVM shutdown - Remove dead MonitorThread/InputStreamThread inner classes and the isTeminated typo - Add tests for timeout kill, output-size cap, stderr drain (no deadlock), input-size cap, and descendant-kill best-effort
… stderr-in-output for backward compat
…dant snapshot kill, overflow responsibility, setMaxOutputLine compat Fix 1 (High): Add pre-read length check on outputFile before FileUtil.readBytes to enforce maxOutputSize for both standardOutput=true (via BoundedFileWriter) and standardOutput=false (post-exec length guard). Replace pb.redirectOutput(outputFile) with piped BoundedFileWriter so stdout is bounded even when standardOutput=true. Fix 2 (High): Snapshot process.descendants() before calling process.destroy() so children reparented to PID 1 after parent exits on SIGTERM are still reachable for destroyForcibly. Fix 3 (Medium): Remove eager process.destroyForcibly() from BoundedStreamReader on overflow; delegate tree-kill entirely to executeCommand's destroyProcessTree call, which now carries the descendant snapshot. Fix 4 (Medium): Make setMaxOutputLine(n) conservatively shrink maxOutputSize to max(64KiB, n*1024) when setMaxOutputSize has not been called explicitly; add maxOutputSizeExplicit guard and getMaxOutputSize() accessor. Tests added: test_outputFileExceedsMaxOutputSize_throws, test_outputFileExceedsMaxOutputSize_standardOutputFalse_throws, test_processDescendants_killed_orphan_gone, test_overflow_killsDescendants, test_setMaxOutputLine_shrinksMaxOutputSize, test_setMaxOutputLine_doesNotOverrideExplicitMaxOutputSize.
… overflow - Default includeStderrInOutput to false to match pre-refactor behavior; original redirectErrorStream(true) only routed stderr to logs, never to extracted text. - Poll overflowFlag while waiting on the process so an OutputSizeExceededException from BoundedStreamReader kills the process tree immediately instead of blocking the main thread for the full executionTimeout.
…ibility, sanitization
Critical fixes:
- BoundedFileWriter now signals overflowFlag so standardOutput=true overflow
fails fast instead of waiting executionTimeout
- 5s future-collection TimeoutException cancels futures, kills process tree,
and throws ExtractException (previously returned partial/empty data as success)
- Introduce shared shuttingDown flag so genuine pipe IOExceptions are surfaced
instead of silently treated as EOF
- Narrow Charset.forName fallback to IllegalCharsetNameException/
UnsupportedCharsetException with warn log (previously silent UTF-8 fallback)
Major fixes:
- Restore interrupt flag in overflow-collection loop (was dropped)
- Replace raw Future[] iteration with List.of for type safety
- Standardize info logs on key=value format per project convention
- Warn-log non-zero exit code (lenient contract preserved)
- Debug-log every kill failure in destroyProcessTree with pid context
- isAlive() check before destroyForcibly + 3-iteration rescan loop to close
the fork-after-snapshot window
- Wrap unexpected ExecutionException causes in CrawlerSystemException with
cause chain instead of dropping them
- Sanitize filePrefix/extention via allow-list and handle Windows backslash
to prevent argument injection through crafted resourceName
- Warn when setMaxOutputSize exceeds 512 MiB (heap multiplier risk)
- awaitTermination(1s) after streamPool.shutdownNow
- Clarify catch-all message ("Failed to execute command" instead of
"Process terminated")
- Add @deprecated to internal methods of deprecated stub classes
Tests added:
- standardOutput=true overflow fast-fail timing
- invalid commandOutputEncoding UTF-8 fallback
- blank command and nonexistent binary CrawlerSystemException
- non-zero exit lenient-contract assertion
- exact 64 KiB floor for setMaxOutputLine
- resourceName metacharacter sanitization
- getFileName handles backslash
- thread interrupt propagation
- invalid workingDirectory
- deprecated MonitorThread / InputStreamThread NOP stubs
- tightened fast-fail elapsed-time assertion on existing output-size test
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens
CommandExtractoragainst five subprocess threats: hangs, output blow-ups, zombie/orphan children, stderr-pipe deadlocks, and oversized inputs.Process.waitFor(timeout, TimeUnit)instead of a pollingMonitorThread(deleted as dead code along withInputStreamThread).standardOutput=truemode stderr was completely undrained.maxOutputSize, default 10 MiB) per stream. On overflow the process tree is destroyed and anExtractExceptionis thrown.maxInputSize, default 100 MiB) when copying the input stream into the temp file; fails fast before the command is invoked.process.descendants().forEach(ProcessHandle::destroyForcibly)followed byprocess.destroyForcibly(), so child processes started by the extractor command do not survive the parent.isTeminatedtypo along with the obsoleteMonitorThread/InputStreamThreadinner classes.Test plan
mvn -pl fess-crawler test -Dtest=CommandExtractorTest— all 12 tests pass (7 existing + 5 new)test_timeout_killsProcess—sh -c "sleep 30"is killed within 5s when timeout=500mstest_outputSizeExceeded_throwsExtractException— 5 MiB stderr output capped at 1 MiB throwstest_stderrDrained_doesNotDeadlock— script writes 1 MiB to stderr without deadlocktest_inputSizeExceeded_throwsExtractException— 64 KiB input with cap=1 KiB throwstest_processDescendants_killed—sh -c "sleep 30 & wait"returns within 10smvn -pl fess-crawler test— no new failures vs master (pre-existing failures: JodExtractor needs LibreOffice, S3/SMB/Storage/Gcs need Docker)mvn formatter:format && mvn license:formatapplied