Skip to content

test(2275): 401-not-retried regression test [skip-runtime-e2e]#178

Merged
saurabhjain1592 merged 1 commit into
mainfrom
fix-2275-401-no-retry-test
May 20, 2026
Merged

test(2275): 401-not-retried regression test [skip-runtime-e2e]#178
saurabhjain1592 merged 1 commit into
mainfrom
fix-2275-401-no-retry-test

Conversation

@saurabhjain1592
Copy link
Copy Markdown
Member

Summary

Adds an explicit regression test asserting HTTP 401 responses on /api/v1/audit/tool-call are terminal — the SDK never retries a 401 auth failure.

Backstops issue getaxonflow/axonflow-enterprise#2275, where a customer's misconfigured deployment caused a tight 401 retry loop against community-saas — 716 × 401 in 24 hours from a single source IP (~30/hour, not human pacing).

Audit finding (no code change required)

Java SDK was already safe:

  • AxonFlow.handleErrorResponse maps HTTP 401 to AuthenticationException
  • RetryExecutor.isRetryable returns false for AuthenticationException (src/main/java/com/getaxonflow/sdk/util/RetryExecutor.java:117-146)

The existing RetryExecutorTest.shouldNotRetryOnAuthenticationException tests this at the exception-class level. This PR adds an end-to-end HTTP-status-code-level test against the specific endpoint named in #2275.

Mutation test (proves the assertion isn't tautological)

Per feedback_mutation_test_to_prove_assertion_not_tautological.md: removed AuthenticationException from the non-retryable list AND broadened the AxonFlowException retry check from >= 500 to >= 400:

-    if (e instanceof AuthenticationException
+    if (e instanceof PolicyViolationException
         || e instanceof PolicyViolationException
...
-      return statusCode >= 500 && statusCode < 600;
+      return statusCode >= 400 && statusCode < 600;

Test result with mutation: FAILED

AuditToolCallTest.shouldNotRetryOn401Issue2275:274 Expected exactly 1 requests matching the following pattern but received 3

→ confirms the verify(1, ...) assertion distinguishes "401 is terminal" vs "401 retries 3 times."

Companion work in #2275

The other 4 SDKs (Python, Go, Rust, TypeScript) get the same regression test in their respective repos. Plugins (claude/cursor/codex/openclaw) get a similar audit in follow-on PRs after user clarified storm was primarily seen on plugins.

Skip-runtime-e2e justification

Test-only addition; no new SDK feature surface and no wire-protocol change. The existing runtime-e2e/ runners stay intact.

Test plan

  • mvn test -Dtest=AuditToolCallTest#shouldNotRetryOn401Issue2275 passes locally
  • Mutation test: making AuthenticationException retryable AND broadening status-code retry to 400+ fails the test
  • CI green on this PR

Locks in that HTTP 401 responses on /api/v1/audit/tool-call are terminal
end-to-end through the WireMock HTTP path. Backstops
getaxonflow/axonflow-enterprise#2275 where a customer's misconfigured
deployment caused a tight 401 retry loop against community-saas (~30
401/hour from a single source IP).

Java SDK is already safe: AxonFlow.handleErrorResponse maps 401 to
AuthenticationException, and RetryExecutor.isRetryable returns false
for AuthenticationException
(src/main/java/com/getaxonflow/sdk/util/RetryExecutor.java:117-146).
The existing RetryExecutorTest.shouldNotRetryOnAuthenticationException
tests this at the exception-class level. This new test makes the
contract explicit at the HTTP-status-code level for the specific
endpoint named in #2275.

Mutation-tested: removing AuthenticationException from the
non-retryable list AND broadening the AxonFlowException retry check
from `>= 500` to `>= 400` makes the test fail with
`Expected exactly 1 requests matching the following pattern but
received 3`, confirming the assertion isn't tautological.

No code change; test-only.

Signed-off-by: Saurabh Jain <saurabh.jain@getaxonflow.com>
@saurabhjain1592 saurabhjain1592 merged commit 6778902 into main May 20, 2026
16 checks passed
@saurabhjain1592 saurabhjain1592 deleted the fix-2275-401-no-retry-test branch May 20, 2026 11:23
saurabhjain1592 added a commit that referenced this pull request May 20, 2026
…ME [skip-runtime-e2e] (#180)

The README's 'Retry Configuration' section advertised three RetryConfig.Builder
methods that do not exist in the public API:

  - initialDelayMs(int)            — actual Builder has initialDelay(Duration)
  - maxDelayMs(int)                — actual Builder has maxDelay(Duration)
  - retryableStatusCodes(Set<Integer>) — does not exist at all; the retry
                                         policy is hard-coded in
                                         RetryExecutor.isRetryable
                                         (5xx + 429 + connect/timeout, with
                                         401/403/PolicyViolation/Config
                                         exceptions always terminal)

The third method was the most load-bearing in the #2275 context. A user
reading the README would expect to be able to configure which statuses retry.
If they looked for the method and didn't find it, they might write a
workaround that bypassed retryExecutor entirely, reintroducing the storm
that PR #178 added regression coverage for.

Decision: delete the misleading example, not implement the methods.
Implementing retryableStatusCodes would be a feature addition (a new
configuration surface + serialized semantics + tests), not a doc fix.

This PR replaces the offending block with:
  - one paragraph documenting the ACTUAL retry contract from
    RetryExecutor.isRetryable
  - a citation back to #2275 (the regression test from PR #178 locks in
    that 401 stays terminal)
  - a list of every Builder method that ACTUALLY exists (enabled,
    maxAttempts, initialDelay, maxDelay, multiplier) with default values
  - a working example using initialDelay(Duration.ofMillis(100)) /
    maxDelay(Duration.ofSeconds(5)) so the code-block compiles against the
    real API

CHANGELOG entry under [Unreleased] notes this is a documentation-only
fix — no public-API or behavior change.

Lineage:
  - Hostile review of PR #178 (regression test for issue #2275) flagged
    this as HIGH doc-debt. PR #178 author punted.
  - This follow-up PR closes the finding.

Out of scope:
  - Adding retryableStatusCodes or any per-status configuration to the
    Builder (that's a feature).
  - Changing RetryExecutor.isRetryable behavior.
  - Version bump.

Signed-off-by: Saurabh Jain <saurabh.jain@getaxonflow.com>
saurabhjain1592 added a commit that referenced this pull request May 21, 2026
…e-e2e] (#182)

## Skip-runtime-e2e justification

CHANGELOG-only edit — pure release-notes hygiene, no code change, no
runtime contract touched.

## Summary

Cleans the v8.1.0 entry per the release-notes audit criteria:

- Folded the [Unreleased] section (org_id telemetry + retry-doc
  fix) into the v8.1.0 line so the cross-repo release line matches
  Go/Python (same version, same surface).
- Dropped "Epic #2230", "(v9.1 preflight, #2277)", and the inline
  PR-#178 / enterprise#2275 refs from the prose; moved issue
  tracking to a Tracking trailer at the bottom.
- Removed internal repo-path references (`ee/platform/...`,
  `axonflow-landing/content/privacy.html`) and replaced the privacy
  reference with the public URL.
- Dropped internal function-name implementation details
  (`apiAuthMiddleware`, `addAuthHeaders (AxonFlow.java)`,
  `RetryExecutor.isRetryable`).

No SDK code or behavior changes.

Signed-off-by: Saurabh Jain <saurabh.jain@getaxonflow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant