Skip to content

App Config - Startup retry#47857

Open
mrm9084 wants to merge 8 commits intoAzure:mainfrom
mrm9084:StartupRetry
Open

App Config - Startup retry#47857
mrm9084 wants to merge 8 commits intoAzure:mainfrom
mrm9084:StartupRetry

Conversation

@mrm9084
Copy link
Member

@mrm9084 mrm9084 commented Jan 29, 2026

Description

Adds retry to startup. When all replicas fail there will be an attempt to retry the failed store for a period of time. By default 100s, minimal 30s, maximum 600s.

Also, refactors the load method to be split into a number of helper method to make this readable.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Copilot AI review requested due to automatic review settings January 29, 2026 20:04
@github-actions github-actions bot added the azure-spring All azure-spring related issues label Jan 29, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a startup retry mechanism to Azure App Configuration for Java to handle transient failures during application startup. When all replicas fail to load configuration, the provider will automatically retry with exponential backoff until a configurable timeout expires.

Changes:

  • Added startup-timeout configuration property (default: 100s, min: 30s, max: 600s) to control retry duration during startup
  • Refactored AzureAppConfigDataLoader.load() into smaller helper methods (loadConfiguration, attemptLoadFromClients, setupMonitoringState, handleReplicaFailure) for improved readability
  • Implemented retry loop with intelligent backoff that waits until the next client becomes available before retrying

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
README.md Added documentation for new startup-timeout configuration option
CHANGELOG.md Documented the new startup retry feature
AppConfigurationProperties.java Added startupTimeout field with default value and validation (30-600 seconds)
AzureAppConfigDataResource.java Added startupTimeout parameter to constructor and getter method
AzureAppConfigDataLocationResolver.java Passed startupTimeout from properties to resources
AzureAppConfigDataLoader.java Refactored load method and implemented retry logic with backoff for startup failures
ConnectionManager.java Added getMillisUntilNextClientAvailable() to calculate wait time until next replica is available
AppConfigurationReplicaClientFactory.java Added wrapper method to expose getMillisUntilNextClientAvailable
ConfigStore.java Minor code quality improvements (variable naming, isEmpty() usage)
ConnectionManagerTest.java Added comprehensive tests for getMillisUntilNextClientAvailable method
AzureAppConfigDataResourceTest.java Updated test constructor calls to include startupTimeout parameter
AzureAppConfigDataLoaderTest.java Added tests for startup retry behavior and refresh non-retry behavior

mrm9084 and others added 3 commits January 29, 2026 14:18
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Comment on lines +137 to +142
if (startupTimeout == null) {
throw new IllegalArgumentException("startupTimeout cannot be null.");
}
if (startupTimeout.getSeconds() < 30 || startupTimeout.getSeconds() > 600) {
throw new IllegalArgumentException("startupTimeout must be between 30 and 600 seconds.");
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests verifying the startupTimeout validation logic that was added. Consider adding tests to verify that: 1) null startupTimeout throws IllegalArgumentException, 2) values below 30 seconds throw IllegalArgumentException, 3) values above 600 seconds throw IllegalArgumentException, and 4) values within the valid range (30-600) are accepted.

Copilot uses AI. Check for mistakes.
Comment on lines +192 to 215
if (Instant.now().isBefore(deadline)) {
long elapsedSeconds = Instant.now().getEpochSecond() - startTime.getEpochSecond();
Long backoffSeconds = getBackoffDuration(elapsedSeconds);

// If backoff is null, elapsed time exceeds fixed intervals - use exponential backoff
if (backoffSeconds == null) {
postFixedWindowAttempts++;
// Convert nanoseconds to seconds
backoffSeconds = BackoffTimeCalculator.calculateBackoff(postFixedWindowAttempts) / 1_000_000_000L;
}

// Don't wait longer than remaining time until deadline
long remainingSeconds = deadline.getEpochSecond() - Instant.now().getEpochSecond();
long waitSeconds = Math.min(backoffSeconds, remainingSeconds);

if (waitSeconds > 0) {
logger.debug("All replicas in backoff for store: " + resource.getEndpoint()
+ ". Waiting " + waitSeconds + "s before retry (elapsed: " + elapsedSeconds + "s).");
try {
Thread.sleep(waitSeconds * 1000);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return lastException;
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition at line 192 is redundant because it's already guaranteed by the while loop condition at line 181. The code will only reach line 192 if Instant.now().isBefore(deadline) is true. Consider removing this redundant check to simplify the code.

Suggested change
if (Instant.now().isBefore(deadline)) {
long elapsedSeconds = Instant.now().getEpochSecond() - startTime.getEpochSecond();
Long backoffSeconds = getBackoffDuration(elapsedSeconds);
// If backoff is null, elapsed time exceeds fixed intervals - use exponential backoff
if (backoffSeconds == null) {
postFixedWindowAttempts++;
// Convert nanoseconds to seconds
backoffSeconds = BackoffTimeCalculator.calculateBackoff(postFixedWindowAttempts) / 1_000_000_000L;
}
// Don't wait longer than remaining time until deadline
long remainingSeconds = deadline.getEpochSecond() - Instant.now().getEpochSecond();
long waitSeconds = Math.min(backoffSeconds, remainingSeconds);
if (waitSeconds > 0) {
logger.debug("All replicas in backoff for store: " + resource.getEndpoint()
+ ". Waiting " + waitSeconds + "s before retry (elapsed: " + elapsedSeconds + "s).");
try {
Thread.sleep(waitSeconds * 1000);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return lastException;
}
long elapsedSeconds = Instant.now().getEpochSecond() - startTime.getEpochSecond();
Long backoffSeconds = getBackoffDuration(elapsedSeconds);
// If backoff is null, elapsed time exceeds fixed intervals - use exponential backoff
if (backoffSeconds == null) {
postFixedWindowAttempts++;
// Convert nanoseconds to seconds
backoffSeconds = BackoffTimeCalculator.calculateBackoff(postFixedWindowAttempts) / 1_000_000_000L;
}
// Don't wait longer than remaining time until deadline
long remainingSeconds = deadline.getEpochSecond() - Instant.now().getEpochSecond();
long waitSeconds = Math.min(backoffSeconds, remainingSeconds);
if (waitSeconds > 0) {
logger.debug("All replicas in backoff for store: " + resource.getEndpoint()
+ ". Waiting " + waitSeconds + "s before retry (elapsed: " + elapsedSeconds + "s).");
try {
Thread.sleep(waitSeconds * 1000);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return lastException;

Copilot uses AI. Check for mistakes.
}

// Don't wait longer than remaining time until deadline
long remainingSeconds = deadline.getEpochSecond() - Instant.now().getEpochSecond();
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 204, remainingSeconds could potentially be negative if there's a delay between the while condition check at line 181 and reaching line 204. While unlikely in practice, this could result in negative values being passed to Math.min() at line 205, which would then result in a negative waitSeconds. The check at line 207 prevents sleeping with negative values, but it would be clearer to use Math.max(0, deadline.getEpochSecond() - Instant.now().getEpochSecond()) to ensure remainingSeconds is never negative.

Suggested change
long remainingSeconds = deadline.getEpochSecond() - Instant.now().getEpochSecond();
long remainingSeconds = Math.max(0L, deadline.getEpochSecond() - Instant.now().getEpochSecond());

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +195
// Create a second client mock for the successful retry
AppConfigurationReplicaClient secondClientMock = Mockito.mock(AppConfigurationReplicaClient.class);
lenient().when(secondClientMock.getEndpoint()).thenReturn(ENDPOINT);

Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable secondClientMock is created but never used. It can be removed to clean up the test.

Suggested change
// Create a second client mock for the successful retry
AppConfigurationReplicaClient secondClientMock = Mockito.mock(AppConfigurationReplicaClient.class);
lenient().when(secondClientMock.getEndpoint()).thenReturn(ENDPOINT);

Copilot uses AI. Check for mistakes.
Comment on lines +251 to +280
long getMillisUntilNextClientAvailable() {
Instant now = Instant.now();
Instant earliestAvailable = Instant.MAX;

// Check configured clients
if (clients != null) {
for (AppConfigurationReplicaClient client : clients) {
Instant backoffEnd = client.getBackoffEndTime();
if (!backoffEnd.isAfter(now)) {
return 0; // Client available now
}
if (backoffEnd.isBefore(earliestAvailable)) {
earliestAvailable = backoffEnd;
}
}
}

// Check auto-failover clients
for (AppConfigurationReplicaClient client : autoFailoverClients.values()) {
Instant backoffEnd = client.getBackoffEndTime();
if (!backoffEnd.isAfter(now)) {
return 0; // Client available now
}
if (backoffEnd.isBefore(earliestAvailable)) {
earliestAvailable = backoffEnd;
}
}

return earliestAvailable.toEpochMilli() - now.toEpochMilli();
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method getMillisUntilNextClientAvailable is not used anywhere in the production code. It's only called in tests. This suggests that either the method should be removed as dead code, or the retry logic in AzureAppConfigDataLoader should be using this method to determine optimal wait times instead of fixed backoff intervals. Consider removing this method or integrating it into the actual retry logic.

Copilot uses AI. Check for mistakes.
Comment on lines +111 to 120
* Gets the duration in milliseconds until the next client becomes available for the specified store.
*
* @param originEndpoint the origin configuration store endpoint
* @return duration in milliseconds until next client is available, or 0 if one is available now
*/
long getMillisUntilNextClientAvailable(String originEndpoint) {
return CONNECTIONS.get(originEndpoint).getMillisUntilNextClientAvailable();
}

/**
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method getMillisUntilNextClientAvailable is not used anywhere in the production code. It's only called in tests. This suggests that either the method should be removed as dead code, or it should be integrated into the retry logic. Consider removing this method or using it in the startup retry implementation.

Suggested change
* Gets the duration in milliseconds until the next client becomes available for the specified store.
*
* @param originEndpoint the origin configuration store endpoint
* @return duration in milliseconds until next client is available, or 0 if one is available now
*/
long getMillisUntilNextClientAvailable(String originEndpoint) {
return CONNECTIONS.get(originEndpoint).getMillisUntilNextClientAvailable();
}
/**

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +93
private static final int[][] STARTUP_BACKOFF_INTERVALS = {
{100, 5}, // 0-100 seconds elapsed: 5 second backoff
{200, 10}, // 100-200 seconds elapsed: 10 second backoff
{600, 30} // 200-600 seconds elapsed: 30 second backoff
};
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixed backoff intervals extend to 600 seconds, but the comment at line 90 says "0-100 seconds elapsed". This threshold value (100) should match the first interval threshold and appears correct. However, the last interval at line 92 covers "200-600 seconds elapsed" which seems inconsistent with the default and maximum timeout of 100-600 seconds. Since the default timeout is 100 seconds and the minimum is 30 seconds, many users will never reach the higher backoff intervals defined here. Consider whether these intervals align with the expected timeout ranges.

Copilot uses AI. Check for mistakes.
@rujche rujche added the azure-spring-app-configuration Spring app configuration related issues. label Feb 3, 2026
@rujche rujche moved this from Todo to In Progress in Spring Cloud Azure Feb 3, 2026
@rujche rujche added this to the 2026-02 milestone Feb 3, 2026
@rujche rujche modified the milestones: 2026-02, 2026-03 Feb 9, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Comment on lines +259 to +271
try {
sourceList.addAll(createSettings(currentClient));
List<WatchedConfigurationSettings> featureFlags = createFeatureFlags(currentClient);

AppConfigurationStoreMonitoring monitoring = resource.getMonitoring();

storeState.setStateFeatureFlag(resource.getEndpoint(), featureFlags,
monitoring.getFeatureFlagRefreshInterval());

if (monitoring.isEnabled()) {
setupMonitoringState(currentClient, monitoring);
}

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In attemptLoadFromClients, sourceList.addAll(createSettings(...)) happens before feature flags/monitoring are fully loaded. If a replica fails after settings were added (e.g., exception in createFeatureFlags or setupMonitoringState), the method proceeds to the next replica without rolling back, potentially mixing property sources from multiple replicas in a single load attempt. Consider building into a temporary list and only merging into sourceList once the full load for a replica succeeds (or clearing/rolling back to the previous size on failure).

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +199
// Create a second client mock for the successful retry
AppConfigurationReplicaClient secondClientMock = Mockito.mock(AppConfigurationReplicaClient.class);
lenient().when(secondClientMock.getEndpoint()).thenReturn(ENDPOINT);

// Setup mocks:
// - First getNextActiveClient(true) returns clientMock which will throw
// - First getNextActiveClient(false) returns null (no more replicas in first attempt)
// - Second getNextActiveClient(true) returns null (simulating success path)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secondClientMock is created but never used in this test, which adds noise and can confuse the intended scenario. Please remove it or incorporate it into the stubbing/verification to reflect the "retry then succeed" path being tested.

Suggested change
// Create a second client mock for the successful retry
AppConfigurationReplicaClient secondClientMock = Mockito.mock(AppConfigurationReplicaClient.class);
lenient().when(secondClientMock.getEndpoint()).thenReturn(ENDPOINT);
// Setup mocks:
// - First getNextActiveClient(true) returns clientMock which will throw
// - First getNextActiveClient(false) returns null (no more replicas in first attempt)
// - Second getNextActiveClient(true) returns null (simulating success path)
// Setup mocks to simulate a retry after initial client failure:
// - First getNextActiveClient(true) returns clientMock which will throw
// - First getNextActiveClient(false) returns null (no more replicas in first attempt)
// - Second getNextActiveClient(true) returns null (simulating success path with no more failures)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Comment on lines +240 to +249
* Attempts to load configuration from available clients.
*
* @param sourceList the list to populate with property sources
* @return the exception if all clients failed, null on success
*/
private Exception attemptLoadFromClients(List<EnumerablePropertySource<?>> sourceList) {
boolean reloadFailed = false;
Exception lastException = null;
AppConfigurationReplicaClient client = replicaClientFactory.getNextActiveClient(resource.getEndpoint(), true);

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When getNextActiveClient returns null on line 248, attemptLoadFromClients returns null (no exception), which is treated as success in lines 187-188. However, getNextActiveClient returns null in two scenarios: (1) no clients are configured, and (2) clients exist but all are currently in backoff. During startup retry, case (2) should trigger continued retries with backoff until clients become available or timeout expires, but the current implementation treats it as immediate success, breaking out of the retry loop. This means startup might succeed with empty configuration when all replicas are temporarily unavailable. Consider having attemptLoadFromClients return a specific exception (e.g., NoAvailableClientsException) when getNextActiveClient returns null to distinguish "no attempt made" from "attempt succeeded", allowing the retry logic to work as intended.

Suggested change
* Attempts to load configuration from available clients.
*
* @param sourceList the list to populate with property sources
* @return the exception if all clients failed, null on success
*/
private Exception attemptLoadFromClients(List<EnumerablePropertySource<?>> sourceList) {
boolean reloadFailed = false;
Exception lastException = null;
AppConfigurationReplicaClient client = replicaClientFactory.getNextActiveClient(resource.getEndpoint(), true);
* Exception indicating that no App Configuration clients were available to load configuration.
* <p>
* This is used internally to distinguish between a successful configuration load and the case
* where no attempt could be made because there were no active clients (either none configured
* or all currently in backoff).
*/
private static final class NoAvailableClientsException extends Exception {
private static final long serialVersionUID = 1L;
NoAvailableClientsException(String message) {
super(message);
}
}
/**
* Attempts to load configuration from available clients.
*
* @param sourceList the list to populate with property sources
* @return the exception if all clients failed or no clients were available, {@code null} on success
*/
private Exception attemptLoadFromClients(List<EnumerablePropertySource<?>> sourceList) {
boolean reloadFailed = false;
Exception lastException = null;
AppConfigurationReplicaClient client = replicaClientFactory.getNextActiveClient(resource.getEndpoint(), true);
if (client == null) {
// No active clients are currently available (either none configured or all in backoff).
// Signal to the caller that no attempt was made so that startup retry logic can continue.
return new NoAvailableClientsException(
"No active App Configuration clients available for endpoint: " + resource.getEndpoint());
}

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +139
if (startupTimeout == null) {
throw new IllegalArgumentException("startupTimeout cannot be null.");
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation throws IllegalArgumentException when startupTimeout is null, but the property has a default value of Duration.ofSeconds(100) defined at line 42. This means the null check will never be true unless someone explicitly calls setStartupTimeout(null). Consider removing this check since it's redundant with the default value, or document why explicit null setting should be prevented.

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +142
if (startupTimeout == null) {
throw new IllegalArgumentException("startupTimeout cannot be null.");
}
if (startupTimeout.getSeconds() < 30 || startupTimeout.getSeconds() > 600) {
throw new IllegalArgumentException("startupTimeout must be between 30 and 600 seconds.");
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the new startupTimeout validation logic added in lines 137-142. Consider adding tests to verify: (1) validation accepts valid timeout values (e.g., 30s, 100s, 600s), (2) validation rejects timeout less than 30 seconds, (3) validation rejects timeout greater than 600 seconds, and (4) validation handles null timeout (if that's an important edge case despite the default value). This follows the existing test pattern seen for refreshInterval validation in minValidWatchTime test at line 105.

Copilot uses AI. Check for mistakes.
logger.debug("All replicas in backoff for store: " + resource.getEndpoint()
+ ". Waiting " + waitSeconds + "s before retry (elapsed: " + elapsedSeconds + "s).");
try {
Thread.sleep(waitSeconds * 1000);
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiplication waitSeconds * 1000 could potentially overflow for very large waitSeconds values (greater than Long.MAX_VALUE / 1000, approximately 9.2 million seconds). While the current logic caps waitSeconds based on the deadline and backoff intervals making overflow unlikely in practice, consider using Math.multiplyExact or adding a guard check to ensure safe multiplication and provide clearer error handling if overflow occurs.

Suggested change
Thread.sleep(waitSeconds * 1000);
long waitMillis;
try {
waitMillis = Math.multiplyExact(waitSeconds, 1000L);
} catch (ArithmeticException ex) {
// In the unlikely event of overflow, cap to the maximum sleep duration.
waitMillis = Long.MAX_VALUE;
}
Thread.sleep(waitMillis);

Copilot uses AI. Check for mistakes.
Comment on lines +181 to +193
while (Instant.now().isBefore(deadline)) {
// Ensure we do not retain partial results from previous failed attempts
sourceList.clear();
replicaClientFactory.findActiveClients(resource.getEndpoint());
lastException = attemptLoadFromClients(sourceList);

if (lastException == null) {
return null; // Success
}

// Reverse in order to add Profile specific properties earlier, and last profile comes first
try {
sourceList.addAll(createSettings(currentClient));
List<WatchedConfigurationSettings> featureFlags = createFeatureFlags(currentClient);

logger.debug("PropertySource context.");
AppConfigurationStoreMonitoring monitoring = resource.getMonitoring();

storeState.setStateFeatureFlag(resource.getEndpoint(), featureFlags,
monitoring.getFeatureFlagRefreshInterval());

if (monitoring.isEnabled()) {
// Check if refreshAll is enabled - if so, use watched configuration settings
if (monitoring.getTriggers().size() == 0) {
// Use watched configuration settings for refresh
List<WatchedConfigurationSettings> watchedConfigurationSettingsList = getWatchedConfigurationSettings(
currentClient);
storeState.setState(resource.getEndpoint(), Collections.emptyList(),
watchedConfigurationSettingsList, monitoring.getRefreshInterval());
} else {
// Use traditional watch key monitoring
List<ConfigurationSetting> watchKeysSettings = monitoring.getTriggers().stream()
.map(trigger -> currentClient.getWatchKey(trigger.getKey(), trigger.getLabel(),
requestContext))
.toList();

storeState.setState(resource.getEndpoint(), watchKeysSettings,
monitoring.getRefreshInterval());
}
// All clients failed, use fixed backoff based on elapsed time
if (Instant.now().isBefore(deadline)) {
long elapsedSeconds = Instant.now().getEpochSecond() - startTime.getEpochSecond();
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation Instant.now().getEpochSecond() - startTime.getEpochSecond() may not accurately reflect elapsed time during the retry loop because it's called multiple times at different points in the loop (lines 193, 204, and within the while condition on line 181). Each call to Instant.now() captures a slightly different time, which could lead to minor inconsistencies. Consider capturing Instant.now() once at the beginning of each iteration and reusing that value for all time-related calculations within that iteration to ensure consistency.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

azure-spring All azure-spring related issues azure-spring-app-configuration Spring app configuration related issues.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants