Skip to content

fix(extractor): bound ApiExtractor response, pool connections, and retry transients#162

Open
marevol wants to merge 7 commits into
masterfrom
fix/extractor-api-robustness
Open

fix(extractor): bound ApiExtractor response, pool connections, and retry transients#162
marevol wants to merge 7 commits into
masterfrom
fix/extractor-api-robustness

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 4, 2026

Summary

  • Bound API response body via BoundedInputStream against a configurable maxResponseSize (default 100 MiB).
  • Reuse HTTP client with a PoolingHttpClientConnectionManager (default 50 total / 10 per route, configurable).
  • Add bounded retry on IOException, 5xx, 408, 429 (honoring Retry-After); fail fast on other 4xx.
  • Configurable connect/response timeouts with sane defaults.

Threat model

Admin-configured endpoint (no SSRF concern). The fix targets misbehaving / hostile responses (oversize, slow), socket exhaustion under high crawl rate, and transient network errors.

Tests

  • New tests for response-size cap, retry behavior, timeout, and connection reuse.
  • All existing ApiExtractorTest cases pass.

Test plan

  • CI green
  • Manual review of pooling lifecycle and retry classification

marevol added 7 commits May 5, 2026 07:44
…try transients

Threat model: admin-configured endpoint, no SSRF concern. The fix targets misbehaving
or hostile responses (oversize, slow), socket exhaustion under high crawl rate, and
transient network errors.

- Bound API response body via BoundedInputStream against a configurable
  maxResponseSize (default 100 MiB); oversize responses throw ExtractException.
- Reuse the HTTP client with a PoolingHttpClientConnectionManager (default 50 total
  / 10 per route, configurable) so successive crawls share connections.
- Add bounded retry on IOException, 5xx, 408, and 429 (honoring Retry-After delta-
  seconds); fail fast on other 4xx and return null as before.
- Default connect/socket timeouts of 5s/30s when none configured to mitigate slow
  loris servers.
- Allow per-call URL override via the new params key extractorUrl.

Tests
- New tests: response size cap, 5xx retry recovery, no-retry on 4xx, 429 with
  Retry-After, slow-server timeout, and 10x sequential calls for connection reuse.
- All existing ApiExtractorTest cases pass.
sleepQuietly previously suppressed InterruptedException and only restored the
interrupt flag, allowing the for-loop in executeWithRetries to proceed to
another HTTP attempt. With AccessTimeoutTarget interrupting the worker thread
once per second after accessTimeout fires, that meant a request could
continue retrying past the configured timeout window.

sleepQuietly now throws ExtractException on interrupt (interrupt flag still
preserved for upstream observers), aborting the retry loop immediately.
Adds unit tests covering both the helper itself and the loop-level bail-out.
The extractorUrl param introduced in this PR opens a small SSRF surface
if a downstream caller ever lets attacker-influenced data into the
params Map. Make the override opt-in:

- allowExtractorUrlOverride defaults to false; the param is silently
  ignored unless the operator explicitly enables it
- when enabled, the override URL must be http(s); other schemes are
  logged at WARN and ignored
- log applied overrides at INFO so usage is auditable

Also add the @test annotations missing from the two interrupt
regression tests so the JUnit Platform actually runs them
(test count goes 11 -> 15).
…tion

HTTP-date has 1-second resolution per RFC 7231, so formatting `now + 2s`
loses up to ~999ms when the server happens to format late within a second,
causing the elapsed wait to fall below the 1.5s threshold (CI saw 1313ms).
Round the future timestamp up to the next whole second and bump the target
delta to ~3s so the assertion stays above its lower bound.
… Retry-After

Three follow-up gaps in the ApiExtractor robustness PR:

- HC4 RequestConfig#connectionRequestTimeout was not set, leaving the
  default of -1 (wait forever). With maxConnectionsPerRoute=10, slow
  upstream responses pinning the pool would silently block all later
  callers. Defaults to 5000ms, matching the connect timeout, with a
  setter for tuning.

- The previous fix buffered the entire request body in a byte[] sized
  by maxRequestSize (default 100 MiB) before each call. Concurrent
  extractors could pin many such arrays. Switch to
  DeferredFileOutputStream so small bodies stay in memory while larger
  ones spill to a temp file, and clean the file up via
  FileUtil.deleteInBackground after the retry loop.

- A hostile or misbehaving server could return Retry-After: 86400 to
  stall the extractor thread for 24 hours. Cap the effective Retry-After
  at maxRetryAfterMs (default 60 seconds) inside computeBackoff so the
  parsed value remains accurate but the actual sleep is bounded.

Tests:
- test_connectionRequestTimeout_failsFastWhenPoolExhausted: a slow
  background caller occupies the only pooled connection; a second
  caller must surface ExtractException in well under the slow-handler
  duration.
- test_requestSpoolThreshold_spillsToFileAndUploadsSuccessfully: forces
  every body to spill by setting threshold below the payload size, then
  verifies the server received the exact uploaded bytes.
- test_retryAfterClampedToMaxRetryAfter: server responds with
  Retry-After: 86400; with maxRetryAfterMs=150ms, the retry must
  complete promptly instead of stalling.
…n, and lifecycle

- Tighten isAllowedOverrideScheme: reject control characters, userinfo,
  opaque URIs, and parse with new URI() so URISyntaxException is caught.
- Collapse three near-duplicate IOException catches into one; do not retry
  on non-transient errors (UnknownHostException, SSLHandshakeException,
  SSLPeerUnverifiedException).
- Preserve cause on retry-exhausted ExtractException so the originating
  RetryableStatusException is recoverable from logs.
- Snapshot httpClient in executeOnce and surface a clear ExtractException
  when getText is called after destroy().
- Set ConnectionManagerShared(true) so destroy() owns the manager's
  lifecycle; narrow the manager close catch from Exception to RuntimeException.
- Null-defensive finally guards for accessTimeoutTarget / accessTimeoutTask.
- Validate numeric setters (max retries / sizes / connections / backoffs)
  with IllegalArgumentException.
- Log silent 200-with-empty-entity at debug; key=value formatting for new
  log statements; remove dead consumeQuietly method.
- Strengthen test_retryOn429_respectsRetryAfter and the disabled-override
  test to actually prove the behavior under test (override server hit count).
- Add tests for: control-character/userinfo/opaque URI rejection, HTTPS
  scheme acceptance, retry exhaustion with cause, UnknownHost non-retry,
  post-destroy guard, destroy idempotency, setter validation, empty 200
  entity, Retry-After garbage/negative fallback, and Shift_JIS round-trip.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant