Skip to content

OAuth refresh race condition when multiple eca server processes coexist (Anthropic Max, likely others) #462

@agzam

Description

@agzam

Describe the bug

When multiple eca server processes run on the same machine (e.g. one per workspace in a multi-project workflow), they race during OAuth access-token refresh against https://console.anthropic.com/v1/oauth/token. The losing process receives:

Anthropic refresh token failed: "{\"error\": \"invalid_grant\", \"error_description\": \"Refresh token not found or invalid\"}"
Error: Auth token renew failed

After this fires, the affected session cannot recover automatically. The user has to /login again, which destroys the active chat from the user's perspective.

To Reproduce

  1. Authenticate ECA with Anthropic Max OAuth (/login -> max).
  2. Start a second eca server process for a different workspace (e.g. open ECA in another project, or use emacs --daemon and spawn ECA from a second workspace).
  3. Let both processes idle ~1 hour until access tokens approach expiry.
  4. Send a prompt in both near-simultaneously.
  5. Observe: one prompt succeeds and rotates the refresh token; the other fails with Anthropic refresh token failed: invalid_grant.

Higher process counts amplify probability. With N=3+ processes, failures observed daily.

Expected behavior

Token refresh should be safe under concurrent processes. The losers of the race should detect the rotation that already happened, adopt the new tokens from disk, and continue without surfacing an error.

Additional context

Root cause

All eca server processes share ~/.cache/eca/db.transit.json for OAuth state (refresh-token, access-token, expires-at). On startup each process reads this file into its in-memory db* atom. There is no file lock around the refresh flow:

  • src/eca/llm_providers/anthropic.clj oauth-refresh (around line 548) calls POST /v1/oauth/token with the in-memory refresh-token.
  • On success it swaps new tokens into db* and writes to disk via db/update-global-cache!.

When access tokens expire (~1 hour TTL), multiple processes detect expiry near-simultaneously and each call oauth-refresh with the same refresh-token they loaded at startup. Anthropic rotates the refresh token on every successful refresh; the first call invalidates the old token server-side and subsequent calls within the same window receive invalid_grant. The losers' in-memory and on-disk refresh-token becomes permanently invalid until manual /login.

Likely the same root cause as an earlier report

editor-code-assistant/eca-emacs#177 (closed) included a user report from @snoopier: "Get a lot of 401 today and it's absolutely random". The repo owner replied "I noticed anthropic and other models are throwing this randomly. I believe we can have a way to configure in ECA a match for status-code and body to consider as retry". The fix added retryRules config.

retryRules helps for transient 401s (the access token is stale; retry triggers a fresh refresh). It does not recover from invalid_grant on the refresh endpoint itself, because the refresh-token is permanently invalid. The "absolutely random" pattern matches what you would expect from a refresh race.

Diagnostic data from my setup

$ ps -eo pid,etime,command | grep "[e]ca server"
99996  1-23:55  /opt/homebrew/bin/eca server
97576  2-00:35  /opt/homebrew/bin/eca server
10027    04:35  /opt/homebrew/bin/eca server
29094       59  /opt/homebrew/bin/eca server
29434       55  /opt/homebrew/bin/eca server

5 concurrent eca server processes against a single shared ~/.cache/eca/db.transit.json. Failure observed daily. Error message verbatim:

Anthropic refresh token failed: "{\"error\": \"invalid_grant\", \"error_description\": \"Refresh token not found or invalid\"}"

Anthropic auth in db.transit.json:

{"anthropic" {:step :login/done
              :mode :max
              :type :auth/oauth
              :refresh-token "sk-ant-ort01-..."
              :api-key "sk-ant-oat01-..."
              :expires-at <epoch>}}

Proposed fix

Wrap oauth-refresh (and the equivalent in openai.clj and oauth.clj's refresh-token!) with a file-lock plus re-read pattern:

(defn ^:private with-token-refresh-lock [cache-file f]
  (let [lock-file (io/file (str cache-file ".lock"))]
    (io/make-parents lock-file)
    (with-open [raf (java.io.RandomAccessFile. lock-file "rw")
                ch  (.getChannel raf)
                lk  (.lock ch)]
      (f))))

;; Inside the refresh path, after acquiring the lock:
;;   1. Re-read db.transit.json from disk
;;   2. If on-disk refresh-token differs from in-memory: another process
;;      refreshed; adopt disk values, skip HTTP, return success
;;   3. Otherwise: call HTTP refresh, write new tokens, return

Java FileLock on the JVM works; the GraalVM native image also supports it. The lock is exclusive and held only across the refresh call (sub-second), so contention is negligible.

Simpler alternative (lower bar): on invalid_grant response, re-read db.transit.json once and retry with the disk-fresh refresh token before surfacing the error. This does not prevent the race but auto-recovers from it.

Affected files

  • src/eca/llm_providers/anthropic.clj (oauth-refresh, :login/renew-token step)
  • src/eca/llm_providers/openai.clj (parallel oauth-refresh)
  • src/eca/oauth.clj (refresh-token! for MCP server OAuth, same race possible)
  • src/eca/db.clj (good place for the lock wrapper)

Workaround until fixed

  • Run only one eca server process at a time, or
  • Use ANTHROPIC_API_KEY (loses Max subscription billing).

Severity

For users with multi-workspace or emacs --daemon workflows: daily auth failures, repeated browser-based re-logins, lost chat context. The issue scales with concurrent process count and is invisible to single-session users, which is likely why it has gone undiagnosed despite the symptom appearing in eca-emacs#177.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions