fix(extractor): bound ApiExtractor response, pool connections, and retry transients#162
Open
marevol wants to merge 7 commits into
Open
fix(extractor): bound ApiExtractor response, pool connections, and retry transients#162marevol wants to merge 7 commits into
marevol wants to merge 7 commits into
Conversation
…try transients Threat model: admin-configured endpoint, no SSRF concern. The fix targets misbehaving or hostile responses (oversize, slow), socket exhaustion under high crawl rate, and transient network errors. - Bound API response body via BoundedInputStream against a configurable maxResponseSize (default 100 MiB); oversize responses throw ExtractException. - Reuse the HTTP client with a PoolingHttpClientConnectionManager (default 50 total / 10 per route, configurable) so successive crawls share connections. - Add bounded retry on IOException, 5xx, 408, and 429 (honoring Retry-After delta- seconds); fail fast on other 4xx and return null as before. - Default connect/socket timeouts of 5s/30s when none configured to mitigate slow loris servers. - Allow per-call URL override via the new params key extractorUrl. Tests - New tests: response size cap, 5xx retry recovery, no-retry on 4xx, 429 with Retry-After, slow-server timeout, and 10x sequential calls for connection reuse. - All existing ApiExtractorTest cases pass.
…After, add jitter, bound 5xx drain
sleepQuietly previously suppressed InterruptedException and only restored the interrupt flag, allowing the for-loop in executeWithRetries to proceed to another HTTP attempt. With AccessTimeoutTarget interrupting the worker thread once per second after accessTimeout fires, that meant a request could continue retrying past the configured timeout window. sleepQuietly now throws ExtractException on interrupt (interrupt flag still preserved for upstream observers), aborting the retry loop immediately. Adds unit tests covering both the helper itself and the loop-level bail-out.
The extractorUrl param introduced in this PR opens a small SSRF surface if a downstream caller ever lets attacker-influenced data into the params Map. Make the override opt-in: - allowExtractorUrlOverride defaults to false; the param is silently ignored unless the operator explicitly enables it - when enabled, the override URL must be http(s); other schemes are logged at WARN and ignored - log applied overrides at INFO so usage is auditable Also add the @test annotations missing from the two interrupt regression tests so the JUnit Platform actually runs them (test count goes 11 -> 15).
…tion HTTP-date has 1-second resolution per RFC 7231, so formatting `now + 2s` loses up to ~999ms when the server happens to format late within a second, causing the elapsed wait to fall below the 1.5s threshold (CI saw 1313ms). Round the future timestamp up to the next whole second and bump the target delta to ~3s so the assertion stays above its lower bound.
… Retry-After Three follow-up gaps in the ApiExtractor robustness PR: - HC4 RequestConfig#connectionRequestTimeout was not set, leaving the default of -1 (wait forever). With maxConnectionsPerRoute=10, slow upstream responses pinning the pool would silently block all later callers. Defaults to 5000ms, matching the connect timeout, with a setter for tuning. - The previous fix buffered the entire request body in a byte[] sized by maxRequestSize (default 100 MiB) before each call. Concurrent extractors could pin many such arrays. Switch to DeferredFileOutputStream so small bodies stay in memory while larger ones spill to a temp file, and clean the file up via FileUtil.deleteInBackground after the retry loop. - A hostile or misbehaving server could return Retry-After: 86400 to stall the extractor thread for 24 hours. Cap the effective Retry-After at maxRetryAfterMs (default 60 seconds) inside computeBackoff so the parsed value remains accurate but the actual sleep is bounded. Tests: - test_connectionRequestTimeout_failsFastWhenPoolExhausted: a slow background caller occupies the only pooled connection; a second caller must surface ExtractException in well under the slow-handler duration. - test_requestSpoolThreshold_spillsToFileAndUploadsSuccessfully: forces every body to spill by setting threshold below the payload size, then verifies the server received the exact uploaded bytes. - test_retryAfterClampedToMaxRetryAfter: server responds with Retry-After: 86400; with maxRetryAfterMs=150ms, the retry must complete promptly instead of stalling.
…n, and lifecycle - Tighten isAllowedOverrideScheme: reject control characters, userinfo, opaque URIs, and parse with new URI() so URISyntaxException is caught. - Collapse three near-duplicate IOException catches into one; do not retry on non-transient errors (UnknownHostException, SSLHandshakeException, SSLPeerUnverifiedException). - Preserve cause on retry-exhausted ExtractException so the originating RetryableStatusException is recoverable from logs. - Snapshot httpClient in executeOnce and surface a clear ExtractException when getText is called after destroy(). - Set ConnectionManagerShared(true) so destroy() owns the manager's lifecycle; narrow the manager close catch from Exception to RuntimeException. - Null-defensive finally guards for accessTimeoutTarget / accessTimeoutTask. - Validate numeric setters (max retries / sizes / connections / backoffs) with IllegalArgumentException. - Log silent 200-with-empty-entity at debug; key=value formatting for new log statements; remove dead consumeQuietly method. - Strengthen test_retryOn429_respectsRetryAfter and the disabled-override test to actually prove the behavior under test (override server hit count). - Add tests for: control-character/userinfo/opaque URI rejection, HTTPS scheme acceptance, retry exhaustion with cause, UnknownHost non-retry, post-destroy guard, destroy idempotency, setter validation, empty 200 entity, Retry-After garbage/negative fallback, and Shift_JIS round-trip.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BoundedInputStreamagainst a configurablemaxResponseSize(default 100 MiB).PoolingHttpClientConnectionManager(default 50 total / 10 per route, configurable).IOException, 5xx, 408, 429 (honoringRetry-After); fail fast on other 4xx.Threat model
Admin-configured endpoint (no SSRF concern). The fix targets misbehaving / hostile responses (oversize, slow), socket exhaustion under high crawl rate, and transient network errors.
Tests
ApiExtractorTestcases pass.Test plan