feat: Support multiple keys per environment#675
Draft
keelerm84 wants to merge 25 commits into
Draft
Conversation
Member
keelerm84
commented
May 28, 2026
- feat: Add wire format types for concurrent SDK and mobile keys
- feat: Add manual config for additional SDK and mobile keys
- feat: Add concurrent SDK key support to credential rotator
- feat: Add concurrent mobile key support to credential rotator
- feat: Wire env context to additional SDK and mobile keys
- feat: Wire autoconfig and filedata to additional and mobile-grace keys
- fix: Keep upstream client alive for rotation-predecessor SDK keys
- feat: Surface additional SDK and mobile keys on the status endpoint
- test: Cover autoconfig flow for additional SDK keys
- docs: Document additional SDK and mobile keys in configuration
- fix: Handle promoting an additional key to primary
- fix: Trim whitespace around additional keys in manual config
- feat: Validate that additional keys require a primary key
Extend SDKKeyRep with an optional Additional list of concurrent SDK keys, each with optional per-key expiry. Introduce MobileKeyRep parallel to SDKKeyRep (value, expiring, additional) and add it as a new top-level field on EnvironmentRep; the legacy MobKey string is retained so older relays keep functioning. EnvironmentParams gains parallel fields: AdditionalSDKKeys plus ExpiringAdditionalSDKKeys for SDK keys, AdditionalMobileKeys plus ExpiringAdditionalMobileKeys and an ExpiringMobileKey grace slot for mobile keys. ToParams splits the wire format's Additional slice into active vs expiring sets based on whether ExpiresAt is present, and prefers the new MobileKey shape over MobKey when both are sent. No behavior change yet; downstream consumers continue to read only the existing fields.
Extend EnvConfig with AdditionalSDKKeys and AdditionalMobileKeys, both ct.OptStringList -- consistent with the existing AllowedOrigin and AllowedHeader fields, supporting INI list syntax (repeated lines), comma-separated env vars (LD_ADDITIONAL_SDK_KEYS_<env>, LD_ADDITIONAL_MOBILE_KEYS_<env>), and programmatic configuration. Validation rejects empty entries, duplicates of the corresponding primary key, and intra-list duplicates. Error messages avoid including the key value to keep credentials out of logs. Wire MakeEnvironmentConfig to copy the active additional-key sets from EnvironmentParams into the new EnvConfig fields. Expiring additional keys are not copied into EnvConfig; they flow directly from EnvironmentParams to the rotator, the same pattern used for ExpiringSDKKey today.
Introduce SetAdditionalSDKKeys for synchronizing the set of concurrent SDK keys against an environment. Keys without ExpiresAt are tracked as active in additionalSdkKeys; keys with ExpiresAt go into a new expiringAdditionalSdkKeys map that is separate from the existing deprecatedSdkKeys (which is reserved for the rotation predecessor of primarySdkKey). The diff logic compares the previous and incoming sets: brand-new entries are queued as additions; transitions between active and expiring leave the key mapped without re-queuing; updated ExpiresAt timestamps are accepted as authoritative; omitted keys are queued as expirations immediately, with no grace period. StepTime now also walks expiringAdditionalSdkKeys to fire timed expirations, and DeprecatedCredentials reports keys from both maps. ActiveSDKKeys returns the primary plus the additional set, and primaryCredentials includes the additional set so existing EnvironmentLookup registration paths pick them up automatically. The primary SDK key is filtered out of the additional set defensively. Mobile-key parity follows in a later commit.
Mirror the SDK concurrent-key work for mobile keys. The rotator now tracks additionalMobileKeys, expiringAdditionalMobileKeys, and deprecatedMobileKeys -- parallel to the corresponding SDK maps -- with SetAdditionalMobileKeys following the same diff semantics as SetAdditionalSDKKeys. Introduce MobileGracePeriod and RotateMobileKeyWithGrace to give the primary mobile key a rotation grace period, which the rotator did not previously support. The existing updateMobileKey path (immediate revoke of the predecessor) is preserved via shared swapPrimaryMobileKey helper. StepTime, primaryCredentials, and deprecatedCredentials are extended to walk the new mobile maps so existing EnvironmentLookup registration paths and credential reporting continue to surface everything uniformly across credential kinds.
Add SetAdditionalSDKKeys and SetAdditionalMobileKeys to the EnvContext interface and implementation. Each forwards to the rotator's diff logic and then drains the resulting additions / expirations through triggerCredentialChanges, which dispatches addCredential / removeCredential the same way UpdateCredential does today. Gate the SDK-client and event-forwarder lifecycle in addCredential on identity with the primary key. Only the primary SDK key opens an upstream LD stream and drives the metrics / event dispatcher swaps; only the primary mobile key drives the event dispatcher. Additional keys are registered for inbound authentication (via envStreams, handlers, and connectionMapper) but never spawn an upstream client. Introduce Rotator.SeedAdditionalSDKKeys and SeedAdditionalMobileKeys for initial construction. These populate the active sets without queuing additions, so the existing AllCredentials-driven stream and lookup initialization in NewEnvContext picks them up as part of the initial wave rather than as runtime additions. NewEnvContext seeds the active additional sets from EnvConfig.AdditionalSDKKeys / AdditionalMobileKeys. Expiring additional keys do not flow through EnvConfig; the caller passes them later via the new setter methods (wired in a follow-up commit).
In relay/autoconfig_actions.go and relay/filedata_actions.go, after handling the primary SDK key rotation, also handle the primary mobile key's optional grace period and synchronize the additional SDK and mobile key sets. Generalize CredentialUpdate.deprecated from config.SDKKey to credential.SDKCredential so the same update API can carry a mobile-key grace period. The envContextImpl.UpdateCredential implementation dispatches on the primary credential's concrete type, routing SDK keys through Rotator.RotateWithGrace and mobile keys through the new RotateMobileKeyWithGrace. A small applyExpiringPrimaries helper factors out the add-mid-rotation pattern shared between autoconfig and filedata AddEnvironment paths so they both honor an inbound rotation grace for either credential kind on the very first patch.
The credential gating introduced when wiring concurrent keys into the env context inadvertently dropped the upstream SDK client for the rotation-grace predecessor. The original code started a client unconditionally for every SDK key passed through addCredential, which is what kept the predecessor's upstream stream alive during the grace window. Restore that behavior by querying the rotator's new IsSDKKeyRotationPredecessor helper: the SDK-client lifecycle now fires for the primary OR a rotation predecessor, while additional SDK keys remain auth-only. The mobile-key analog uses the symmetric IsMobileKeyRotationPredecessor helper for the event dispatcher, keeping behavior consistent for the new mobile-key grace flow without disturbing the additional-mobile-key path.
Extend api.EnvironmentStatusRep with AdditionalSDKKeys and AdditionalMobileKeys slices (both omitempty) alongside the existing scalar SDKKey and MobileKey fields. Existing consumers that parse the scalar primary fields keep working; new consumers can read the additional sets to see the full set of credentials accepted for an environment. To distinguish the primary credentials from concurrent additionals at the status endpoint, add PrimarySDKKey and PrimaryMobileKey accessors to the EnvContext interface that delegate to the rotator. The status handler now partitions GetCredentials into primary and additional buckets and masks each entry with the existing ObscureKey helper.
Add two focused autoconfig action tests: - Initial put with additional SDK keys: all keys authenticate to the same env, only one SDK client is spawned for the primary. - Patch that omits a previously-listed additional key: revokes it immediately; primary continues to work. The testAutoConfEnv helper grows a mobileKey field so future mobile-key tests can populate the new MobileKeyRep wire shape without reaching past the helper.
Add rows for additionalSdkKeys / additionalMobileKeys and the matching LD_ADDITIONAL_SDK_KEYS_<env> / LD_ADDITIONAL_MOBILE_KEYS_<env> environment variables to the per-environment configuration table. Include a worked example showing how to express the list in both the INI file (repeated lines) and env-var (comma-separated) forms, and call out the validation rules and the primary-only upstream behavior.
When an additional SDK or mobile key is rotated into the primary slot, the rotator now drops it from the additional / expiring- additional sets as part of the swap, so the same key is never tracked in two places. The promoted key still flows through the additions queue so the env context can start its upstream SDK client (additional keys never had one); downstream registration operations on an already-known credential -- envStreams, handlers, lookup mapping -- are idempotent. To make that idempotency hold, EnvStreams.AddCredential now checks for an existing entry before appending a new streamInfo. An existing TestAddCredential typo was masking this: the test declared two SDK keys with the same string value, which happened to pass before because both assertions checked the same value; updating the test to use a distinct second key and a duplicate re-add explicitly exercises the new idempotent path. Two rotator tests cover the promote scenario directly: promoting an active additional and promoting an expiring additional. Both verify the additional sets are cleared, the new primary is reported correctly, and the old primary still flows through the normal rotation-grace expiration path.
ct.OptStringList splits comma-separated values verbatim, so a value like "k1, k2" produces ["k1", " k2"] with a leading space on the second entry. SDK and mobile keys are matched byte-for- byte at auth time, so an untrimmed entry never authenticates. ValidateConfig now canonicalizes EnvConfig.AdditionalSDKKeys and AdditionalMobileKeys by trimming each entry up front, so all downstream consumers (validation + runtime conversion) see the same values. The toSDKKeys / toMobileKeys helpers in envContextImpl also trim defensively in case the env config arrives constructed programmatically without going through ValidateConfig. Test fixture for additional keys now includes whitespace in the input env vars and expects trimmed values in the parsed config.
Reject configurations that declare additional SDK or mobile keys without a corresponding primary key. The previous validation only flagged the missing primary SDK key (via the existing required- field check) and let the missing primary mobile key through silently -- the relay would start with auth-only mobile keys and no primary mobile key, which is logically inconsistent. A new errAdditionalKeysWithoutPrimary error fires for both credential kinds when the additional list is non-empty and the primary slot is undefined. For the SDK-key case the user will see both the existing "SDK key is required" message and the new one; they reinforce each other and the new one points to the specific reason.
…ey updates
SetAdditional{SDK,Mobile}Keys previously only filtered out the
current primary. If the platform sent the same key in both the
Expiring slot (a rotation predecessor) and the Additional list,
the rotator would track the key in two places -- once in
deprecatedSdkKeys with its rotation expiry and again in
expiringAdditionalSdkKeys or additionalSdkKeys. StepTime would
walk both maps, so the key's effective expiry became the minimum
of the two timestamps, and a subsequent Additional-list patch
omitting the key would queue an immediate revocation that fights
the rotation grace timer.
The diff logic now skips any key already present in
deprecatedSdkKeys / deprecatedMobileKeys and logs a warning, so
the rotation flow stays in sole charge of those keys' lifecycle.
Silent suppression of the primary key in the additional list is
also now a warning. Honoring the primary's expiry via the
additional path would leave the env with no upstream connection,
so the right place to express primary-key expiry is still the
Expiring slot of SDKKeyRep / MobileKeyRep; the warning surfaces
the malformed payload instead of dropping it on the floor.
addCredential's mobile-key branch gated dispatcher.ReplaceCredential on isPrimary || IsMobileKeyRotationPredecessor. Because ReplaceCredential is replace-semantics, draining the additions queue in order [newPrimary, predecessor] left the dispatcher holding the predecessor's mobile key as its outbound auth. Once the predecessor expired upstream, all mobile-event forwarding for the env silently returned 401 -- with no recovery short of an env restart. Mirror the SDK-key branch and gate the mobile dispatcher's ReplaceCredential on isPrimary only. The predecessor still flows through the rest of addCredential (envStreams, handlers, connectionMapper) so inbound auth for the deprecated key keeps working until the grace expiry, but it no longer overwrites the dispatcher. The regression test TestEventDispatcherMobileAuthIsPrimaryAfterRotationGrace exercises the stale-deprecation flow that the autoconfig applyExpiringPrimaries call produces at env construction and confirms the outbound Authorization header is the new primary.
updateSDKKey and RotateMobileKeyWithGrace wrote grace.key into the deprecated map without removing it from additionalSdkKeys / expiringAdditionalSdkKeys (and the mobile equivalents) when the key was previously tracked as an additional. This left the same key tracked in two maps, causing two PROVEN bugs: 1. A subsequent SetAdditional patch that omitted the key entered the cleanup loop, found the key in additionalSdkKeys, and queued an immediate revocation -- closing the upstream client and unmapping the key while the rotation grace still wanted it alive. Direct contradiction of the rotation-grace contract. 2. After the rotation grace expired and removeCredential ran, the entry in additionalSdkKeys / additionalMobileKeys was never cleaned up. ActiveSDKKeys / ActiveMobileKeys kept reporting the key forever, but the lookup map had unmapped it -- a phantom that subsequent SetAdditional calls would silently skip (treated as already-known). Now both updateSDKKey and RotateMobileKeyWithGrace delete grace.key from the additional maps before writing it to the deprecated map. The new tests TestSetAdditionalOmissionDoesNotRevokeRotationPredecessor, TestRotationPredecessorClearedFromAdditionalSet, and the mobile analogs exercise both reproductions from the adversarial review.
updateSDKKey and RotateMobileKeyWithGrace called swapPrimaryKey / swapPrimaryMobileKey first and then unconditionally wrote deprecatedSdkKeys[grace.key] = grace.expiry. When the grace.key matched the current primary -- e.g., a malformed patch that grants a grace period to the active key -- the primary ended up in the deprecated map AND was queued through addCredential a second time. addCredential's startSDKClient call then overwrote c.clients[P] with a fresh client, leaking the running upstream stream. Add a self-rotation defense in both updateSDKKey and RotateMobileKeyWithGrace: if grace.key == current primary, log a warning and return without modifying the deprecated maps or queueing additions. The mobile branch has the same issue (no upstream client to leak, but the dispatcher and deprecated map would still drift). Two regression tests assert nothing is queued and the primary stays out of the deprecated set.
… /status The deprecation loop in statusHandler only handled config.SDKKey and overwrote the scalar status.ExpiringSDKKey each iteration. That dropped every mobile key returned by GetDeprecatedCredentials silently and collapsed multiple expiring SDK keys (now possible when additionals carry their own ExpiresAt) into a single non-deterministic entry. api.EnvironmentStatusRep gains ExpiringMobileKey (scalar, mirror of ExpiringSDKKey) plus ExpiringSDKKeys / ExpiringMobileKeys arrays for the full set. The deprecation loop now partitions by kind, populates the singular field on first sight, and appends to the array for every entry. All values are masked the same way as the primary credentials.
The duplicate-check loop released the lock before Register, and each subsequent Register held the lock only for the append. Two concurrent callers with the same credential could both pass the check and both Register, leaving duplicate streamInfo entries. Not reachable today because envContextImpl.mu serializes the in-tree callers, but the public docstring advertises idempotency against concurrent callers and a future direct caller would trip the race. Hold es.lock for the whole method via defer. Add a regression test that fires 16 goroutines at the same credential and asserts exactly one stream registration results. Run with -race to confirm no data races introduced.
EnvironmentRep.ToParams switched the mobile primary to r.MobileKey.Value whenever r.MobileKey was non-nil, even if .Value was the empty key. A partially-populated MobileKey struct (e.g., during a platform rollout where the new field is present but not yet filled in) would silently clobber the legacy MobKey field that older serializers still populated. Guard the precedence with MobileKey.Value.Defined() so the fallback to MobKey still takes effect when the new field arrives empty. Expiring and Additional fields on MobileKeyRep remain processed even when Value is empty -- they're semantically independent.
statusHandler called PrimarySDKKey, PrimaryMobileKey, GetCredentials, and GetDeprecatedCredentials in sequence, each acquiring r.mu separately. A rotation interleaved between calls let a key briefly appear as additional when it had just been promoted to primary (or the converse), so /status output was non-deterministic during rotation windows. Add Rotator.Snapshot() returning a CredentialSnapshot struct populated under a single RLock, plus an EnvContext.CredentialSnapshot delegate. The status handler now derives the entire response partitioning from one snapshot, so no concurrent rotation can split a logical credential set across multiple lock acquisitions.
MakeEnvironmentConfig copied only the active additional-key sets into EnvConfig; the expiring map arrived later via SetAdditional after addEnvironment had already called InsertEnvironment. That left a brief window where requests authenticated with an expiring additional key returned 401 because the lookup hadn't mapped it yet. Carry the initial expiring-additional maps through to the rotator at construction: - New Rotator.SeedExpiringAdditionalSDKKeys / Mobile equivalents populate the expiring maps without queuing additions. - EnvContextImplParams gains InitialExpiringAdditionalSDKKeys and InitialExpiringAdditionalMobileKeys; NewEnvContext seeds them alongside the active sets. - addEnvironment accepts an addEnvironmentOptions struct so the autoconfig and filedata actions can pass these through without expanding the positional signature for every existing caller. - InsertEnvironment now also iterates GetDeprecatedCredentials so the expiring entries land in the lookup as part of the initial registration. Same guard handles the rare case of an env that arrives mid-rotation with a primary-key Expiring slot already populated. The new TestAutoConfigInitWithExpiringAdditionalSDKKey exercises the full autoconfig flow: a put event with an expiring extra SDK key, then a lookup assertion that the key authenticates immediately after env construction.
addCredential launched startSDKClient on a goroutine and released c.mu before c.clients[sdkKey] was populated. If removeCredential ran in the gap, it saw c.clients[sdkKey] == nil and skipped the Close() call. The goroutine then wrote the client into c.clients, where nothing cleaned it up until env shutdown -- an orphaned upstream streaming connection per occurrence. Track in-flight removals in a new pendingClientRemoval map: - addCredential clears the flag before launching the spawn so a re-added key still gets its client. - removeCredential sets the flag if c.clients[sdkKey] is nil at removal time (the client hasn't arrived yet). - startSDKClient checks the flag after acquiring c.mu and, if set, immediately closes the produced client and returns rather than publishing it into c.clients. The race is pre-existing but became easier to trigger with the rotation-predecessor SDK client lifecycle the rest of the SDK-2415 work added. Regression test uses a gated factory to deterministically interleave addCredential and removeCredential.
splitAdditionalSDKKeys / splitAdditionalMobileKeys took the wire format at face value: a duplicate entry produced two map writes (non-deterministic last-wins for expiring) or a doubled active slice entry. If the same key showed up once as active and once as expiring, it landed in both the active slice and the expiring map -- the rotator would then catch the conflict in SetAdditionalSDKKeys, but only by accident. Dedupe at the wire-format boundary: expiring wins over active (stronger signal), and within a kind the first occurrence wins. The platform shouldn't send duplicates, but defensive normalization keeps the rotator's invariants intact under malformed payloads.
Both helpers read the current primary under r.mu.RLock, released the lock, then re-acquired r.mu.Lock to write. The read-then-write pattern was prone to TOCTOU drift: another goroutine could mutate the primary between the early-return check and the eventual write. Today's call paths serialize, so the bug isn't reachable, but the implementation was inconsistent with updateSDKKey (which does the no-op check inside swapPrimaryKey under the write lock) and would break the moment a new caller appeared. Move the no-op check inside the write lock for both helpers -- updateMobileKey now delegates entirely to swapPrimaryMobileKey (which already short-circuits identical-primary cases), and updateEnvironmentID compares against r.primaryEnvironmentID directly under the lock instead of going through the RLock'd accessor.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.