Skip to content

[Codex][Codex Auth] Handle stale managed auth refresh races before relogin#15127

Closed
ccy-oai wants to merge 3 commits intomainfrom
ccy/codex-auth-refresh-resilience-phase1
Closed

[Codex][Codex Auth] Handle stale managed auth refresh races before relogin#15127
ccy-oai wants to merge 3 commits intomainfrom
ccy/codex-auth-refresh-resilience-phase1

Conversation

@ccy-oai
Copy link
Copy Markdown
Collaborator

@ccy-oai ccy-oai commented Mar 19, 2026

CODEX-5156

Summary
  • Managed ChatGPT auth now reloads compatible local auth before spending a stale cached refresh token on the normal auth() path.
  • A refresh_token_reused failure now does one final guarded reload before forcing relogin.
  • Guarded reload stays pinned to compatible account_id and chatgpt_user_id when an account boundary is known.
  • Legacy auth without account_id keeps the old compatibility path instead of being stranded on stale credentials.
Rationale (from spec findings)
  • The 401 incident spec captured a concrete stale-refresh split-brain cohort: one local client advanced auth while another local session failed with refresh_token_reused.
  • The old proactive stale path went straight to the token authority and could spend an already-rotated refresh lineage.
  • Same-store local auth advancement should not force relogin when newer compatible auth is already available on disk.
  • Recovery still has to reject cross-user adoption when another local login changed the user identity inside the same workspace/account.
Scope
  • Changes codex-rs/login/src/auth/manager.rs, codex-rs/login/src/auth/auth_tests.rs, and the two cloud-requirements expectations in codex-rs/cloud-requirements/src/lib.rs.
  • Leaves storage/schema changes, cross-process locking, app-hosted chatgptAuthTokens, and cross-device auth divergence unchanged.
Trade-offs
  • Recovery is limited to same-store local auth changes; it does not solve cases where the newer auth exists only on another machine or another store.
  • When cached auth already has account_id, reload adoption now requires compatible account_id and chatgpt_user_id instead of accepting the account boundary alone.
  • Unauthorized recovery now evaluates guarded reloads against the auth snapshot that triggered the 401, not mutable cache state.
Client follow-up
  • Add observability for retained refresh_token_reused volume after the same-store race fix.
  • Tackle login/bootstrap failures such as token_exchange_failed separately.
  • Re-evaluate whether any later local refresh serialization is still needed after this resilience pass.
Testing
  • cargo test -p codex-login stale_proactive_refresh_uses_newer_local_auth_before_spending_cached_refresh_token -- --nocapture
  • cargo test -p codex-login stale_proactive_refresh_does_not_adopt_a_different_user_in_same_account -- --nocapture
  • cargo test -p codex-login stale_proactive_refresh_without_account_id_adopts_auth_with_account_id -- --nocapture
  • cargo test -p codex-login stale_proactive_refresh_without_identity_claims_still_attempts_refresh -- --nocapture
  • cargo test -p codex-login unauthorized_recovery_reload_uses_original_auth_snapshot_for_same_account_user_guard -- --nocapture
  • cargo test -p codex-login refresh_token_reused_recovers_if_guarded_reload_finds_newer_auth -- --nocapture
  • cargo test -p codex-login refresh_token_reused_without_account_id_adopts_auth_when_disk_auth_gains_account_id -- --nocapture
  • cargo test -p codex-login --no-run --message-format short
  • cargo test -p codex-cloud-requirements fetch_cloud_requirements_uses_reloaded_auth_before_unauthorized_retry -- --nocapture
  • cargo test -p codex-cloud-requirements fetch_cloud_requirements_rejects_different_user_in_same_account -- --nocapture
  • cargo test -p codex-cloud-requirements --no-run --message-format short
  • just fmt
  • Manual built-CLI validation in isolated CODEX_HOME=/tmp/codex-auth-resilience-smoke using /Users/ccy/code/codex/codex-rs/target/debug/codex.
  • Manual case 1: started a long-lived session, rewrote auth.json stale, restored newer same-user auth on disk, then prompted again in the same session; it succeeded without relogin.
  • Manual case 2: repeated the flow with cached auth missing tokens.account_id, then restored current same-user auth on disk; the same session still succeeded without relogin.

Comment thread codex-rs/core/src/auth.rs Outdated
Comment thread codex-rs/core/src/auth.rs Outdated
Comment thread codex-rs/core/src/auth.rs Outdated
Comment thread codex-rs/login/src/auth/manager.rs
Comment thread codex-rs/login/src/auth/manager.rs
Comment thread codex-rs/login/src/auth/manager.rs
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

let cached_before_reload = self.auth_cached();
let auth_changed =
!Self::auths_equal_for_refresh(cached_before_reload.as_ref(), new_auth.as_ref());

P1 Badge Base reload change detection on caller's cached auth

reload_if_account_id_matches computes auth_changed against self.auth_cached() instead of the caller snapshot passed into reload_for_refresh. If another task already updated the cache, stale callers can get ReloadedNoChange; then refresh_if_stale uses the old refresh token and may hit avoidable refresh_token_reused relogin failures.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ccy-oai ccy-oai force-pushed the ccy/codex-auth-refresh-resilience-phase1 branch from 2cdf1e9 to d3a092a Compare March 19, 2026 23:59
Port the stale managed auth reload and unauthorized recovery guards into
codex-login after auth ownership moved out of codex-core.

Keep guarded reload pinned to compatible account and user identity, preserve
legacy no-account-id compatibility, and update cloud requirements expectations
for the new recovery contract.

Co-authored-by: Codex <noreply@openai.com>
@ccy-oai ccy-oai force-pushed the ccy/codex-auth-refresh-resilience-phase1 branch from d3a092a to ab557f6 Compare March 20, 2026 18:34
Deduplicate the repeated never-policy requirements fixture and auth recovery
error string in cloud requirements tests without changing the exercised
contract.

Co-authored-by: Codex <noreply@openai.com>
@ccy-oai ccy-oai requested a review from aibrahim-oai March 20, 2026 19:11
Factor the repeated proactive auth fixture setup into a small helper and
keep the remaining auth manager tests focused on the linear contract
surface.

Co-authored-by: Codex <noreply@openai.com>
@ccy-oai ccy-oai changed the title Handle stale managed auth refresh races before relogin [Codex][Codex Auth] Handle stale managed auth refresh races before relogin Mar 20, 2026
Copy link
Copy Markdown
Collaborator Author

ccy-oai commented Mar 23, 2026

Closing this out in favor of #15357, which landed the main proactive stale-auth fix on current main.

This branch still contains additional auth-recovery/policy logic, but I don’t think we should stack that broader surface on top of the merged fix. If any of the remaining non-overlap still looks worthwhile, I’ll cut a much smaller follow-up from current main.

@ccy-oai ccy-oai closed this Mar 23, 2026
@ccy-oai ccy-oai deleted the ccy/codex-auth-refresh-resilience-phase1 branch March 23, 2026 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants