Skip to content

feat: Support multiple keys per environment#675

Draft
keelerm84 wants to merge 25 commits into
v8from
mk/sdk-2415/concurrent-keys
Draft

feat: Support multiple keys per environment#675
keelerm84 wants to merge 25 commits into
v8from
mk/sdk-2415/concurrent-keys

Conversation

@keelerm84
Copy link
Copy Markdown
Member

  • feat: Add wire format types for concurrent SDK and mobile keys
  • feat: Add manual config for additional SDK and mobile keys
  • feat: Add concurrent SDK key support to credential rotator
  • feat: Add concurrent mobile key support to credential rotator
  • feat: Wire env context to additional SDK and mobile keys
  • feat: Wire autoconfig and filedata to additional and mobile-grace keys
  • fix: Keep upstream client alive for rotation-predecessor SDK keys
  • feat: Surface additional SDK and mobile keys on the status endpoint
  • test: Cover autoconfig flow for additional SDK keys
  • docs: Document additional SDK and mobile keys in configuration
  • fix: Handle promoting an additional key to primary
  • fix: Trim whitespace around additional keys in manual config
  • feat: Validate that additional keys require a primary key

keelerm84 added 14 commits May 27, 2026 12:49
Extend SDKKeyRep with an optional Additional list of concurrent SDK
keys, each with optional per-key expiry. Introduce MobileKeyRep
parallel to SDKKeyRep (value, expiring, additional) and add it as a
new top-level field on EnvironmentRep; the legacy MobKey string is
retained so older relays keep functioning.

EnvironmentParams gains parallel fields: AdditionalSDKKeys plus
ExpiringAdditionalSDKKeys for SDK keys, AdditionalMobileKeys plus
ExpiringAdditionalMobileKeys and an ExpiringMobileKey grace slot
for mobile keys. ToParams splits the wire format's Additional slice
into active vs expiring sets based on whether ExpiresAt is present,
and prefers the new MobileKey shape over MobKey when both are sent.

No behavior change yet; downstream consumers continue to read only
the existing fields.
Extend EnvConfig with AdditionalSDKKeys and AdditionalMobileKeys,
both ct.OptStringList -- consistent with the existing AllowedOrigin
and AllowedHeader fields, supporting INI list syntax (repeated
lines), comma-separated env vars (LD_ADDITIONAL_SDK_KEYS_<env>,
LD_ADDITIONAL_MOBILE_KEYS_<env>), and programmatic configuration.

Validation rejects empty entries, duplicates of the corresponding
primary key, and intra-list duplicates. Error messages avoid
including the key value to keep credentials out of logs.

Wire MakeEnvironmentConfig to copy the active additional-key sets
from EnvironmentParams into the new EnvConfig fields. Expiring
additional keys are not copied into EnvConfig; they flow directly
from EnvironmentParams to the rotator, the same pattern used for
ExpiringSDKKey today.
Introduce SetAdditionalSDKKeys for synchronizing the set of
concurrent SDK keys against an environment. Keys without
ExpiresAt are tracked as active in additionalSdkKeys; keys with
ExpiresAt go into a new expiringAdditionalSdkKeys map that is
separate from the existing deprecatedSdkKeys (which is reserved
for the rotation predecessor of primarySdkKey).

The diff logic compares the previous and incoming sets:
brand-new entries are queued as additions; transitions between
active and expiring leave the key mapped without re-queuing;
updated ExpiresAt timestamps are accepted as authoritative;
omitted keys are queued as expirations immediately, with no
grace period.

StepTime now also walks expiringAdditionalSdkKeys to fire timed
expirations, and DeprecatedCredentials reports keys from both
maps. ActiveSDKKeys returns the primary plus the additional set,
and primaryCredentials includes the additional set so existing
EnvironmentLookup registration paths pick them up automatically.

The primary SDK key is filtered out of the additional set
defensively. Mobile-key parity follows in a later commit.
Mirror the SDK concurrent-key work for mobile keys. The rotator
now tracks additionalMobileKeys, expiringAdditionalMobileKeys,
and deprecatedMobileKeys -- parallel to the corresponding SDK
maps -- with SetAdditionalMobileKeys following the same diff
semantics as SetAdditionalSDKKeys.

Introduce MobileGracePeriod and RotateMobileKeyWithGrace to give
the primary mobile key a rotation grace period, which the rotator
did not previously support. The existing updateMobileKey path
(immediate revoke of the predecessor) is preserved via shared
swapPrimaryMobileKey helper.

StepTime, primaryCredentials, and deprecatedCredentials are
extended to walk the new mobile maps so existing EnvironmentLookup
registration paths and credential reporting continue to surface
everything uniformly across credential kinds.
Add SetAdditionalSDKKeys and SetAdditionalMobileKeys to the
EnvContext interface and implementation. Each forwards to the
rotator's diff logic and then drains the resulting additions /
expirations through triggerCredentialChanges, which dispatches
addCredential / removeCredential the same way UpdateCredential
does today.

Gate the SDK-client and event-forwarder lifecycle in
addCredential on identity with the primary key. Only the primary
SDK key opens an upstream LD stream and drives the metrics /
event dispatcher swaps; only the primary mobile key drives the
event dispatcher. Additional keys are registered for inbound
authentication (via envStreams, handlers, and connectionMapper)
but never spawn an upstream client.

Introduce Rotator.SeedAdditionalSDKKeys and
SeedAdditionalMobileKeys for initial construction. These
populate the active sets without queuing additions, so the
existing AllCredentials-driven stream and lookup initialization
in NewEnvContext picks them up as part of the initial wave
rather than as runtime additions.

NewEnvContext seeds the active additional sets from
EnvConfig.AdditionalSDKKeys / AdditionalMobileKeys. Expiring
additional keys do not flow through EnvConfig; the caller passes
them later via the new setter methods (wired in a follow-up
commit).
In relay/autoconfig_actions.go and relay/filedata_actions.go,
after handling the primary SDK key rotation, also handle the
primary mobile key's optional grace period and synchronize the
additional SDK and mobile key sets.

Generalize CredentialUpdate.deprecated from config.SDKKey to
credential.SDKCredential so the same update API can carry a
mobile-key grace period. The envContextImpl.UpdateCredential
implementation dispatches on the primary credential's concrete
type, routing SDK keys through Rotator.RotateWithGrace and
mobile keys through the new RotateMobileKeyWithGrace.

A small applyExpiringPrimaries helper factors out the
add-mid-rotation pattern shared between autoconfig and filedata
AddEnvironment paths so they both honor an inbound rotation
grace for either credential kind on the very first patch.
The credential gating introduced when wiring concurrent keys
into the env context inadvertently dropped the upstream SDK
client for the rotation-grace predecessor. The original code
started a client unconditionally for every SDK key passed
through addCredential, which is what kept the predecessor's
upstream stream alive during the grace window.

Restore that behavior by querying the rotator's new
IsSDKKeyRotationPredecessor helper: the SDK-client lifecycle
now fires for the primary OR a rotation predecessor, while
additional SDK keys remain auth-only.

The mobile-key analog uses the symmetric
IsMobileKeyRotationPredecessor helper for the event dispatcher,
keeping behavior consistent for the new mobile-key grace flow
without disturbing the additional-mobile-key path.
Extend api.EnvironmentStatusRep with AdditionalSDKKeys and
AdditionalMobileKeys slices (both omitempty) alongside the
existing scalar SDKKey and MobileKey fields. Existing consumers
that parse the scalar primary fields keep working; new consumers
can read the additional sets to see the full set of credentials
accepted for an environment.

To distinguish the primary credentials from concurrent
additionals at the status endpoint, add PrimarySDKKey and
PrimaryMobileKey accessors to the EnvContext interface that
delegate to the rotator. The status handler now partitions
GetCredentials into primary and additional buckets and masks
each entry with the existing ObscureKey helper.
Add two focused autoconfig action tests:
- Initial put with additional SDK keys: all keys authenticate
  to the same env, only one SDK client is spawned for the
  primary.
- Patch that omits a previously-listed additional key: revokes
  it immediately; primary continues to work.

The testAutoConfEnv helper grows a mobileKey field so future
mobile-key tests can populate the new MobileKeyRep wire shape
without reaching past the helper.
Add rows for additionalSdkKeys / additionalMobileKeys and the
matching LD_ADDITIONAL_SDK_KEYS_<env> / LD_ADDITIONAL_MOBILE_KEYS_<env>
environment variables to the per-environment configuration table.
Include a worked example showing how to express the list in both
the INI file (repeated lines) and env-var (comma-separated) forms,
and call out the validation rules and the primary-only upstream
behavior.
When an additional SDK or mobile key is rotated into the primary
slot, the rotator now drops it from the additional / expiring-
additional sets as part of the swap, so the same key is never
tracked in two places. The promoted key still flows through the
additions queue so the env context can start its upstream SDK
client (additional keys never had one); downstream registration
operations on an already-known credential -- envStreams,
handlers, lookup mapping -- are idempotent.

To make that idempotency hold, EnvStreams.AddCredential now
checks for an existing entry before appending a new
streamInfo. An existing TestAddCredential typo was masking this:
the test declared two SDK keys with the same string value, which
happened to pass before because both assertions checked the same
value; updating the test to use a distinct second key and a
duplicate re-add explicitly exercises the new idempotent path.

Two rotator tests cover the promote scenario directly: promoting
an active additional and promoting an expiring additional. Both
verify the additional sets are cleared, the new primary is
reported correctly, and the old primary still flows through the
normal rotation-grace expiration path.
ct.OptStringList splits comma-separated values verbatim, so a
value like "k1, k2" produces ["k1", " k2"] with a leading space
on the second entry. SDK and mobile keys are matched byte-for-
byte at auth time, so an untrimmed entry never authenticates.

ValidateConfig now canonicalizes EnvConfig.AdditionalSDKKeys and
AdditionalMobileKeys by trimming each entry up front, so all
downstream consumers (validation + runtime conversion) see the
same values. The toSDKKeys / toMobileKeys helpers in
envContextImpl also trim defensively in case the env config
arrives constructed programmatically without going through
ValidateConfig.

Test fixture for additional keys now includes whitespace in the
input env vars and expects trimmed values in the parsed config.
Reject configurations that declare additional SDK or mobile keys
without a corresponding primary key. The previous validation only
flagged the missing primary SDK key (via the existing required-
field check) and let the missing primary mobile key through
silently -- the relay would start with auth-only mobile keys and
no primary mobile key, which is logically inconsistent.

A new errAdditionalKeysWithoutPrimary error fires for both
credential kinds when the additional list is non-empty and the
primary slot is undefined. For the SDK-key case the user will
see both the existing "SDK key is required" message and the new
one; they reinforce each other and the new one points to the
specific reason.
…ey updates

SetAdditional{SDK,Mobile}Keys previously only filtered out the
current primary. If the platform sent the same key in both the
Expiring slot (a rotation predecessor) and the Additional list,
the rotator would track the key in two places -- once in
deprecatedSdkKeys with its rotation expiry and again in
expiringAdditionalSdkKeys or additionalSdkKeys. StepTime would
walk both maps, so the key's effective expiry became the minimum
of the two timestamps, and a subsequent Additional-list patch
omitting the key would queue an immediate revocation that fights
the rotation grace timer.

The diff logic now skips any key already present in
deprecatedSdkKeys / deprecatedMobileKeys and logs a warning, so
the rotation flow stays in sole charge of those keys' lifecycle.

Silent suppression of the primary key in the additional list is
also now a warning. Honoring the primary's expiry via the
additional path would leave the env with no upstream connection,
so the right place to express primary-key expiry is still the
Expiring slot of SDKKeyRep / MobileKeyRep; the warning surfaces
the malformed payload instead of dropping it on the floor.
@keelerm84 keelerm84 changed the title mk/sdk 2415/concurrent keys feat: Support multiple keys per environment May 28, 2026
keelerm84 added 11 commits May 28, 2026 16:50
addCredential's mobile-key branch gated dispatcher.ReplaceCredential
on isPrimary || IsMobileKeyRotationPredecessor. Because
ReplaceCredential is replace-semantics, draining the additions
queue in order [newPrimary, predecessor] left the dispatcher
holding the predecessor's mobile key as its outbound auth. Once
the predecessor expired upstream, all mobile-event forwarding for
the env silently returned 401 -- with no recovery short of an env
restart.

Mirror the SDK-key branch and gate the mobile dispatcher's
ReplaceCredential on isPrimary only. The predecessor still flows
through the rest of addCredential (envStreams, handlers,
connectionMapper) so inbound auth for the deprecated key keeps
working until the grace expiry, but it no longer overwrites the
dispatcher.

The regression test TestEventDispatcherMobileAuthIsPrimaryAfterRotationGrace
exercises the stale-deprecation flow that the autoconfig
applyExpiringPrimaries call produces at env construction and
confirms the outbound Authorization header is the new primary.
updateSDKKey and RotateMobileKeyWithGrace wrote grace.key into the
deprecated map without removing it from additionalSdkKeys /
expiringAdditionalSdkKeys (and the mobile equivalents) when the
key was previously tracked as an additional. This left the same
key tracked in two maps, causing two PROVEN bugs:

1. A subsequent SetAdditional patch that omitted the key entered
   the cleanup loop, found the key in additionalSdkKeys, and
   queued an immediate revocation -- closing the upstream client
   and unmapping the key while the rotation grace still wanted it
   alive. Direct contradiction of the rotation-grace contract.

2. After the rotation grace expired and removeCredential ran, the
   entry in additionalSdkKeys / additionalMobileKeys was never
   cleaned up. ActiveSDKKeys / ActiveMobileKeys kept reporting
   the key forever, but the lookup map had unmapped it -- a
   phantom that subsequent SetAdditional calls would silently
   skip (treated as already-known).

Now both updateSDKKey and RotateMobileKeyWithGrace delete grace.key
from the additional maps before writing it to the deprecated map.
The new tests TestSetAdditionalOmissionDoesNotRevokeRotationPredecessor,
TestRotationPredecessorClearedFromAdditionalSet, and the mobile
analogs exercise both reproductions from the adversarial review.
updateSDKKey and RotateMobileKeyWithGrace called swapPrimaryKey /
swapPrimaryMobileKey first and then unconditionally wrote
deprecatedSdkKeys[grace.key] = grace.expiry. When the grace.key
matched the current primary -- e.g., a malformed patch that grants
a grace period to the active key -- the primary ended up in the
deprecated map AND was queued through addCredential a second time.
addCredential's startSDKClient call then overwrote c.clients[P]
with a fresh client, leaking the running upstream stream.

Add a self-rotation defense in both updateSDKKey and
RotateMobileKeyWithGrace: if grace.key == current primary, log a
warning and return without modifying the deprecated maps or
queueing additions. The mobile branch has the same issue (no
upstream client to leak, but the dispatcher and deprecated map
would still drift). Two regression tests assert nothing is queued
and the primary stays out of the deprecated set.
… /status

The deprecation loop in statusHandler only handled config.SDKKey
and overwrote the scalar status.ExpiringSDKKey each iteration.
That dropped every mobile key returned by GetDeprecatedCredentials
silently and collapsed multiple expiring SDK keys (now possible
when additionals carry their own ExpiresAt) into a single
non-deterministic entry.

api.EnvironmentStatusRep gains ExpiringMobileKey (scalar, mirror
of ExpiringSDKKey) plus ExpiringSDKKeys / ExpiringMobileKeys
arrays for the full set. The deprecation loop now partitions by
kind, populates the singular field on first sight, and appends
to the array for every entry. All values are masked the same way
as the primary credentials.
The duplicate-check loop released the lock before Register, and
each subsequent Register held the lock only for the append. Two
concurrent callers with the same credential could both pass the
check and both Register, leaving duplicate streamInfo entries.

Not reachable today because envContextImpl.mu serializes the
in-tree callers, but the public docstring advertises idempotency
against concurrent callers and a future direct caller would trip
the race.

Hold es.lock for the whole method via defer. Add a regression
test that fires 16 goroutines at the same credential and asserts
exactly one stream registration results. Run with -race to
confirm no data races introduced.
EnvironmentRep.ToParams switched the mobile primary to
r.MobileKey.Value whenever r.MobileKey was non-nil, even if
.Value was the empty key. A partially-populated MobileKey struct
(e.g., during a platform rollout where the new field is present
but not yet filled in) would silently clobber the legacy MobKey
field that older serializers still populated.

Guard the precedence with MobileKey.Value.Defined() so the
fallback to MobKey still takes effect when the new field arrives
empty. Expiring and Additional fields on MobileKeyRep remain
processed even when Value is empty -- they're semantically
independent.
statusHandler called PrimarySDKKey, PrimaryMobileKey,
GetCredentials, and GetDeprecatedCredentials in sequence, each
acquiring r.mu separately. A rotation interleaved between calls
let a key briefly appear as additional when it had just been
promoted to primary (or the converse), so /status output was
non-deterministic during rotation windows.

Add Rotator.Snapshot() returning a CredentialSnapshot struct
populated under a single RLock, plus an EnvContext.CredentialSnapshot
delegate. The status handler now derives the entire response
partitioning from one snapshot, so no concurrent rotation can
split a logical credential set across multiple lock acquisitions.
MakeEnvironmentConfig copied only the active additional-key sets
into EnvConfig; the expiring map arrived later via SetAdditional
after addEnvironment had already called InsertEnvironment. That
left a brief window where requests authenticated with an expiring
additional key returned 401 because the lookup hadn't mapped it
yet.

Carry the initial expiring-additional maps through to the rotator
at construction:

- New Rotator.SeedExpiringAdditionalSDKKeys / Mobile equivalents
  populate the expiring maps without queuing additions.
- EnvContextImplParams gains InitialExpiringAdditionalSDKKeys and
  InitialExpiringAdditionalMobileKeys; NewEnvContext seeds them
  alongside the active sets.
- addEnvironment accepts an addEnvironmentOptions struct so the
  autoconfig and filedata actions can pass these through without
  expanding the positional signature for every existing caller.
- InsertEnvironment now also iterates GetDeprecatedCredentials so
  the expiring entries land in the lookup as part of the initial
  registration. Same guard handles the rare case of an env that
  arrives mid-rotation with a primary-key Expiring slot already
  populated.

The new TestAutoConfigInitWithExpiringAdditionalSDKKey exercises
the full autoconfig flow: a put event with an expiring extra SDK
key, then a lookup assertion that the key authenticates
immediately after env construction.
addCredential launched startSDKClient on a goroutine and released
c.mu before c.clients[sdkKey] was populated. If removeCredential
ran in the gap, it saw c.clients[sdkKey] == nil and skipped the
Close() call. The goroutine then wrote the client into c.clients,
where nothing cleaned it up until env shutdown -- an orphaned
upstream streaming connection per occurrence.

Track in-flight removals in a new pendingClientRemoval map:
- addCredential clears the flag before launching the spawn so a
  re-added key still gets its client.
- removeCredential sets the flag if c.clients[sdkKey] is nil at
  removal time (the client hasn't arrived yet).
- startSDKClient checks the flag after acquiring c.mu and, if set,
  immediately closes the produced client and returns rather than
  publishing it into c.clients.

The race is pre-existing but became easier to trigger with the
rotation-predecessor SDK client lifecycle the rest of the SDK-2415
work added. Regression test uses a gated factory to deterministically
interleave addCredential and removeCredential.
splitAdditionalSDKKeys / splitAdditionalMobileKeys took the wire
format at face value: a duplicate entry produced two map writes
(non-deterministic last-wins for expiring) or a doubled active
slice entry. If the same key showed up once as active and once as
expiring, it landed in both the active slice and the expiring
map -- the rotator would then catch the conflict in
SetAdditionalSDKKeys, but only by accident.

Dedupe at the wire-format boundary: expiring wins over active
(stronger signal), and within a kind the first occurrence wins.
The platform shouldn't send duplicates, but defensive normalization
keeps the rotator's invariants intact under malformed payloads.
Both helpers read the current primary under r.mu.RLock, released
the lock, then re-acquired r.mu.Lock to write. The read-then-write
pattern was prone to TOCTOU drift: another goroutine could mutate
the primary between the early-return check and the eventual write.
Today's call paths serialize, so the bug isn't reachable, but the
implementation was inconsistent with updateSDKKey (which does the
no-op check inside swapPrimaryKey under the write lock) and would
break the moment a new caller appeared.

Move the no-op check inside the write lock for both helpers --
updateMobileKey now delegates entirely to swapPrimaryMobileKey
(which already short-circuits identical-primary cases), and
updateEnvironmentID compares against r.primaryEnvironmentID
directly under the lock instead of going through the RLock'd
accessor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant