Skip to content

fix(extractor): harden CommandExtractor with timeout, output cap, descendant kill#159

Open
marevol wants to merge 5 commits into
masterfrom
fix/extractor-command-robustness
Open

fix(extractor): harden CommandExtractor with timeout, output cap, descendant kill#159
marevol wants to merge 5 commits into
masterfrom
fix/extractor-command-robustness

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 4, 2026

Summary

Hardens CommandExtractor against five subprocess threats: hangs, output blow-ups, zombie/orphan children, stderr-pipe deadlocks, and oversized inputs.

  • Timeout via Process.waitFor(timeout, TimeUnit) instead of a polling MonitorThread (deleted as dead code along with InputStreamThread).
  • Both stdout and stderr are drained concurrently by daemon threads so a full pipe buffer cannot block the child. Previously stderr was merged into stdout in capture mode, but in standardOutput=true mode stderr was completely undrained.
  • Output cap (maxOutputSize, default 10 MiB) per stream. On overflow the process tree is destroyed and an ExtractException is thrown.
  • Input cap (maxInputSize, default 100 MiB) when copying the input stream into the temp file; fails fast before the command is invoked.
  • Descendant kill on timeout/overflow: process.descendants().forEach(ProcessHandle::destroyForcibly) followed by process.destroyForcibly(), so child processes started by the extractor command do not survive the parent.
  • Daemon thread factory for stream-drain workers so they cannot block JVM shutdown.
  • Removed isTeminated typo along with the obsolete MonitorThread/InputStreamThread inner classes.

Test plan

  • mvn -pl fess-crawler test -Dtest=CommandExtractorTest — all 12 tests pass (7 existing + 5 new)
  • New tests:
    • test_timeout_killsProcesssh -c "sleep 30" is killed within 5s when timeout=500ms
    • test_outputSizeExceeded_throwsExtractException — 5 MiB stderr output capped at 1 MiB throws
    • test_stderrDrained_doesNotDeadlock — script writes 1 MiB to stderr without deadlock
    • test_inputSizeExceeded_throwsExtractException — 64 KiB input with cap=1 KiB throws
    • test_processDescendants_killedsh -c "sleep 30 & wait" returns within 10s
  • mvn -pl fess-crawler test — no new failures vs master (pre-existing failures: JodExtractor needs LibreOffice, S3/SMB/Storage/Gcs need Docker)
  • mvn formatter:format && mvn license:format applied

marevol added 5 commits May 5, 2026 07:27
…t, and orphan processes

- Replace polling MonitorThread with Process.waitFor(timeout, TimeUnit)
- Drain both stdout and stderr concurrently via daemon executor to avoid
  pipe-buffer deadlocks and to prevent stderr from filling indefinitely
- Cap subprocess output (stdout/stderr) at maxOutputSize (default 10 MiB);
  on overflow, kill the process tree and throw ExtractException
- Cap input copied to the temp file at maxInputSize (default 100 MiB);
  fail fast before invoking the command
- On timeout or overflow, destroy descendants via ProcessHandle followed
  by destroyForcibly() so orphaned children are not left behind
- Use a daemon thread factory for stream-drain workers so they do not
  block JVM shutdown
- Remove dead MonitorThread/InputStreamThread inner classes and the
  isTeminated typo
- Add tests for timeout kill, output-size cap, stderr drain (no
  deadlock), input-size cap, and descendant-kill best-effort
…dant snapshot kill, overflow responsibility, setMaxOutputLine compat

Fix 1 (High): Add pre-read length check on outputFile before FileUtil.readBytes to enforce
maxOutputSize for both standardOutput=true (via BoundedFileWriter) and standardOutput=false
(post-exec length guard). Replace pb.redirectOutput(outputFile) with piped BoundedFileWriter
so stdout is bounded even when standardOutput=true.

Fix 2 (High): Snapshot process.descendants() before calling process.destroy() so children
reparented to PID 1 after parent exits on SIGTERM are still reachable for destroyForcibly.

Fix 3 (Medium): Remove eager process.destroyForcibly() from BoundedStreamReader on overflow;
delegate tree-kill entirely to executeCommand's destroyProcessTree call, which now carries
the descendant snapshot.

Fix 4 (Medium): Make setMaxOutputLine(n) conservatively shrink maxOutputSize to
max(64KiB, n*1024) when setMaxOutputSize has not been called explicitly; add
maxOutputSizeExplicit guard and getMaxOutputSize() accessor.

Tests added: test_outputFileExceedsMaxOutputSize_throws,
test_outputFileExceedsMaxOutputSize_standardOutputFalse_throws,
test_processDescendants_killed_orphan_gone, test_overflow_killsDescendants,
test_setMaxOutputLine_shrinksMaxOutputSize,
test_setMaxOutputLine_doesNotOverrideExplicitMaxOutputSize.
… overflow

- Default includeStderrInOutput to false to match pre-refactor
  behavior; original redirectErrorStream(true) only routed stderr
  to logs, never to extracted text.
- Poll overflowFlag while waiting on the process so an
  OutputSizeExceededException from BoundedStreamReader kills the
  process tree immediately instead of blocking the main thread
  for the full executionTimeout.
…ibility, sanitization

Critical fixes:
- BoundedFileWriter now signals overflowFlag so standardOutput=true overflow
  fails fast instead of waiting executionTimeout
- 5s future-collection TimeoutException cancels futures, kills process tree,
  and throws ExtractException (previously returned partial/empty data as success)
- Introduce shared shuttingDown flag so genuine pipe IOExceptions are surfaced
  instead of silently treated as EOF
- Narrow Charset.forName fallback to IllegalCharsetNameException/
  UnsupportedCharsetException with warn log (previously silent UTF-8 fallback)

Major fixes:
- Restore interrupt flag in overflow-collection loop (was dropped)
- Replace raw Future[] iteration with List.of for type safety
- Standardize info logs on key=value format per project convention
- Warn-log non-zero exit code (lenient contract preserved)
- Debug-log every kill failure in destroyProcessTree with pid context
- isAlive() check before destroyForcibly + 3-iteration rescan loop to close
  the fork-after-snapshot window
- Wrap unexpected ExecutionException causes in CrawlerSystemException with
  cause chain instead of dropping them
- Sanitize filePrefix/extention via allow-list and handle Windows backslash
  to prevent argument injection through crafted resourceName
- Warn when setMaxOutputSize exceeds 512 MiB (heap multiplier risk)
- awaitTermination(1s) after streamPool.shutdownNow
- Clarify catch-all message ("Failed to execute command" instead of
  "Process terminated")
- Add @deprecated to internal methods of deprecated stub classes

Tests added:
- standardOutput=true overflow fast-fail timing
- invalid commandOutputEncoding UTF-8 fallback
- blank command and nonexistent binary CrawlerSystemException
- non-zero exit lenient-contract assertion
- exact 64 KiB floor for setMaxOutputLine
- resourceName metacharacter sanitization
- getFileName handles backslash
- thread interrupt propagation
- invalid workingDirectory
- deprecated MonitorThread / InputStreamThread NOP stubs
- tightened fast-fail elapsed-time assertion on existing output-size test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant