Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds a startup retry mechanism to Azure App Configuration for Java to handle transient failures during application startup. When all replicas fail to load configuration, the provider will automatically retry with exponential backoff until a configurable timeout expires.
Changes:
- Added
startup-timeoutconfiguration property (default: 100s, min: 30s, max: 600s) to control retry duration during startup - Refactored
AzureAppConfigDataLoader.load()into smaller helper methods (loadConfiguration,attemptLoadFromClients,setupMonitoringState,handleReplicaFailure) for improved readability - Implemented retry loop with intelligent backoff that waits until the next client becomes available before retrying
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Added documentation for new startup-timeout configuration option |
| CHANGELOG.md | Documented the new startup retry feature |
| AppConfigurationProperties.java | Added startupTimeout field with default value and validation (30-600 seconds) |
| AzureAppConfigDataResource.java | Added startupTimeout parameter to constructor and getter method |
| AzureAppConfigDataLocationResolver.java | Passed startupTimeout from properties to resources |
| AzureAppConfigDataLoader.java | Refactored load method and implemented retry logic with backoff for startup failures |
| ConnectionManager.java | Added getMillisUntilNextClientAvailable() to calculate wait time until next replica is available |
| AppConfigurationReplicaClientFactory.java | Added wrapper method to expose getMillisUntilNextClientAvailable |
| ConfigStore.java | Minor code quality improvements (variable naming, isEmpty() usage) |
| ConnectionManagerTest.java | Added comprehensive tests for getMillisUntilNextClientAvailable method |
| AzureAppConfigDataResourceTest.java | Updated test constructor calls to include startupTimeout parameter |
| AzureAppConfigDataLoaderTest.java | Added tests for startup retry behavior and refresh non-retry behavior |
...ring/cloud/appconfiguration/config/implementation/properties/AppConfigurationProperties.java
Show resolved
Hide resolved
.../com/azure/spring/cloud/appconfiguration/config/implementation/AzureAppConfigDataLoader.java
Outdated
Show resolved
Hide resolved
.../com/azure/spring/cloud/appconfiguration/config/implementation/AzureAppConfigDataLoader.java
Show resolved
Hide resolved
.../com/azure/spring/cloud/appconfiguration/config/implementation/AzureAppConfigDataLoader.java
Outdated
Show resolved
Hide resolved
.../azure/spring/cloud/appconfiguration/config/implementation/AzureAppConfigDataLoaderTest.java
Show resolved
Hide resolved
...in/java/com/azure/spring/cloud/appconfiguration/config/implementation/ConnectionManager.java
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| if (startupTimeout == null) { | ||
| throw new IllegalArgumentException("startupTimeout cannot be null."); | ||
| } | ||
| if (startupTimeout.getSeconds() < 30 || startupTimeout.getSeconds() > 600) { | ||
| throw new IllegalArgumentException("startupTimeout must be between 30 and 600 seconds."); | ||
| } |
There was a problem hiding this comment.
There are no tests verifying the startupTimeout validation logic that was added. Consider adding tests to verify that: 1) null startupTimeout throws IllegalArgumentException, 2) values below 30 seconds throw IllegalArgumentException, 3) values above 600 seconds throw IllegalArgumentException, and 4) values within the valid range (30-600) are accepted.
| if (Instant.now().isBefore(deadline)) { | ||
| long elapsedSeconds = Instant.now().getEpochSecond() - startTime.getEpochSecond(); | ||
| Long backoffSeconds = getBackoffDuration(elapsedSeconds); | ||
|
|
||
| // If backoff is null, elapsed time exceeds fixed intervals - use exponential backoff | ||
| if (backoffSeconds == null) { | ||
| postFixedWindowAttempts++; | ||
| // Convert nanoseconds to seconds | ||
| backoffSeconds = BackoffTimeCalculator.calculateBackoff(postFixedWindowAttempts) / 1_000_000_000L; | ||
| } | ||
|
|
||
| // Don't wait longer than remaining time until deadline | ||
| long remainingSeconds = deadline.getEpochSecond() - Instant.now().getEpochSecond(); | ||
| long waitSeconds = Math.min(backoffSeconds, remainingSeconds); | ||
|
|
||
| if (waitSeconds > 0) { | ||
| logger.debug("All replicas in backoff for store: " + resource.getEndpoint() | ||
| + ". Waiting " + waitSeconds + "s before retry (elapsed: " + elapsedSeconds + "s)."); | ||
| try { | ||
| Thread.sleep(waitSeconds * 1000); | ||
| } catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); | ||
| return lastException; | ||
| } |
There was a problem hiding this comment.
The condition at line 192 is redundant because it's already guaranteed by the while loop condition at line 181. The code will only reach line 192 if Instant.now().isBefore(deadline) is true. Consider removing this redundant check to simplify the code.
| if (Instant.now().isBefore(deadline)) { | |
| long elapsedSeconds = Instant.now().getEpochSecond() - startTime.getEpochSecond(); | |
| Long backoffSeconds = getBackoffDuration(elapsedSeconds); | |
| // If backoff is null, elapsed time exceeds fixed intervals - use exponential backoff | |
| if (backoffSeconds == null) { | |
| postFixedWindowAttempts++; | |
| // Convert nanoseconds to seconds | |
| backoffSeconds = BackoffTimeCalculator.calculateBackoff(postFixedWindowAttempts) / 1_000_000_000L; | |
| } | |
| // Don't wait longer than remaining time until deadline | |
| long remainingSeconds = deadline.getEpochSecond() - Instant.now().getEpochSecond(); | |
| long waitSeconds = Math.min(backoffSeconds, remainingSeconds); | |
| if (waitSeconds > 0) { | |
| logger.debug("All replicas in backoff for store: " + resource.getEndpoint() | |
| + ". Waiting " + waitSeconds + "s before retry (elapsed: " + elapsedSeconds + "s)."); | |
| try { | |
| Thread.sleep(waitSeconds * 1000); | |
| } catch (InterruptedException e) { | |
| Thread.currentThread().interrupt(); | |
| return lastException; | |
| } | |
| long elapsedSeconds = Instant.now().getEpochSecond() - startTime.getEpochSecond(); | |
| Long backoffSeconds = getBackoffDuration(elapsedSeconds); | |
| // If backoff is null, elapsed time exceeds fixed intervals - use exponential backoff | |
| if (backoffSeconds == null) { | |
| postFixedWindowAttempts++; | |
| // Convert nanoseconds to seconds | |
| backoffSeconds = BackoffTimeCalculator.calculateBackoff(postFixedWindowAttempts) / 1_000_000_000L; | |
| } | |
| // Don't wait longer than remaining time until deadline | |
| long remainingSeconds = deadline.getEpochSecond() - Instant.now().getEpochSecond(); | |
| long waitSeconds = Math.min(backoffSeconds, remainingSeconds); | |
| if (waitSeconds > 0) { | |
| logger.debug("All replicas in backoff for store: " + resource.getEndpoint() | |
| + ". Waiting " + waitSeconds + "s before retry (elapsed: " + elapsedSeconds + "s)."); | |
| try { | |
| Thread.sleep(waitSeconds * 1000); | |
| } catch (InterruptedException e) { | |
| Thread.currentThread().interrupt(); | |
| return lastException; |
| } | ||
|
|
||
| // Don't wait longer than remaining time until deadline | ||
| long remainingSeconds = deadline.getEpochSecond() - Instant.now().getEpochSecond(); |
There was a problem hiding this comment.
At line 204, remainingSeconds could potentially be negative if there's a delay between the while condition check at line 181 and reaching line 204. While unlikely in practice, this could result in negative values being passed to Math.min() at line 205, which would then result in a negative waitSeconds. The check at line 207 prevents sleeping with negative values, but it would be clearer to use Math.max(0, deadline.getEpochSecond() - Instant.now().getEpochSecond()) to ensure remainingSeconds is never negative.
| long remainingSeconds = deadline.getEpochSecond() - Instant.now().getEpochSecond(); | |
| long remainingSeconds = Math.max(0L, deadline.getEpochSecond() - Instant.now().getEpochSecond()); |
| // Create a second client mock for the successful retry | ||
| AppConfigurationReplicaClient secondClientMock = Mockito.mock(AppConfigurationReplicaClient.class); | ||
| lenient().when(secondClientMock.getEndpoint()).thenReturn(ENDPOINT); | ||
|
|
There was a problem hiding this comment.
The variable secondClientMock is created but never used. It can be removed to clean up the test.
| // Create a second client mock for the successful retry | |
| AppConfigurationReplicaClient secondClientMock = Mockito.mock(AppConfigurationReplicaClient.class); | |
| lenient().when(secondClientMock.getEndpoint()).thenReturn(ENDPOINT); |
| long getMillisUntilNextClientAvailable() { | ||
| Instant now = Instant.now(); | ||
| Instant earliestAvailable = Instant.MAX; | ||
|
|
||
| // Check configured clients | ||
| if (clients != null) { | ||
| for (AppConfigurationReplicaClient client : clients) { | ||
| Instant backoffEnd = client.getBackoffEndTime(); | ||
| if (!backoffEnd.isAfter(now)) { | ||
| return 0; // Client available now | ||
| } | ||
| if (backoffEnd.isBefore(earliestAvailable)) { | ||
| earliestAvailable = backoffEnd; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Check auto-failover clients | ||
| for (AppConfigurationReplicaClient client : autoFailoverClients.values()) { | ||
| Instant backoffEnd = client.getBackoffEndTime(); | ||
| if (!backoffEnd.isAfter(now)) { | ||
| return 0; // Client available now | ||
| } | ||
| if (backoffEnd.isBefore(earliestAvailable)) { | ||
| earliestAvailable = backoffEnd; | ||
| } | ||
| } | ||
|
|
||
| return earliestAvailable.toEpochMilli() - now.toEpochMilli(); | ||
| } |
There was a problem hiding this comment.
The method getMillisUntilNextClientAvailable is not used anywhere in the production code. It's only called in tests. This suggests that either the method should be removed as dead code, or the retry logic in AzureAppConfigDataLoader should be using this method to determine optimal wait times instead of fixed backoff intervals. Consider removing this method or integrating it into the actual retry logic.
| * Gets the duration in milliseconds until the next client becomes available for the specified store. | ||
| * | ||
| * @param originEndpoint the origin configuration store endpoint | ||
| * @return duration in milliseconds until next client is available, or 0 if one is available now | ||
| */ | ||
| long getMillisUntilNextClientAvailable(String originEndpoint) { | ||
| return CONNECTIONS.get(originEndpoint).getMillisUntilNextClientAvailable(); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
The method getMillisUntilNextClientAvailable is not used anywhere in the production code. It's only called in tests. This suggests that either the method should be removed as dead code, or it should be integrated into the retry logic. Consider removing this method or using it in the startup retry implementation.
| * Gets the duration in milliseconds until the next client becomes available for the specified store. | |
| * | |
| * @param originEndpoint the origin configuration store endpoint | |
| * @return duration in milliseconds until next client is available, or 0 if one is available now | |
| */ | |
| long getMillisUntilNextClientAvailable(String originEndpoint) { | |
| return CONNECTIONS.get(originEndpoint).getMillisUntilNextClientAvailable(); | |
| } | |
| /** |
| private static final int[][] STARTUP_BACKOFF_INTERVALS = { | ||
| {100, 5}, // 0-100 seconds elapsed: 5 second backoff | ||
| {200, 10}, // 100-200 seconds elapsed: 10 second backoff | ||
| {600, 30} // 200-600 seconds elapsed: 30 second backoff | ||
| }; |
There was a problem hiding this comment.
The fixed backoff intervals extend to 600 seconds, but the comment at line 90 says "0-100 seconds elapsed". This threshold value (100) should match the first interval threshold and appears correct. However, the last interval at line 92 covers "200-600 seconds elapsed" which seems inconsistent with the default and maximum timeout of 100-600 seconds. Since the default timeout is 100 seconds and the minimum is 30 seconds, many users will never reach the higher backoff intervals defined here. Consider whether these intervals align with the expected timeout ranges.
| try { | ||
| sourceList.addAll(createSettings(currentClient)); | ||
| List<WatchedConfigurationSettings> featureFlags = createFeatureFlags(currentClient); | ||
|
|
||
| AppConfigurationStoreMonitoring monitoring = resource.getMonitoring(); | ||
|
|
||
| storeState.setStateFeatureFlag(resource.getEndpoint(), featureFlags, | ||
| monitoring.getFeatureFlagRefreshInterval()); | ||
|
|
||
| if (monitoring.isEnabled()) { | ||
| setupMonitoringState(currentClient, monitoring); | ||
| } | ||
|
|
There was a problem hiding this comment.
In attemptLoadFromClients, sourceList.addAll(createSettings(...)) happens before feature flags/monitoring are fully loaded. If a replica fails after settings were added (e.g., exception in createFeatureFlags or setupMonitoringState), the method proceeds to the next replica without rolling back, potentially mixing property sources from multiple replicas in a single load attempt. Consider building into a temporary list and only merging into sourceList once the full load for a replica succeeds (or clearing/rolling back to the previous size on failure).
| // Create a second client mock for the successful retry | ||
| AppConfigurationReplicaClient secondClientMock = Mockito.mock(AppConfigurationReplicaClient.class); | ||
| lenient().when(secondClientMock.getEndpoint()).thenReturn(ENDPOINT); | ||
|
|
||
| // Setup mocks: | ||
| // - First getNextActiveClient(true) returns clientMock which will throw | ||
| // - First getNextActiveClient(false) returns null (no more replicas in first attempt) | ||
| // - Second getNextActiveClient(true) returns null (simulating success path) |
There was a problem hiding this comment.
secondClientMock is created but never used in this test, which adds noise and can confuse the intended scenario. Please remove it or incorporate it into the stubbing/verification to reflect the "retry then succeed" path being tested.
| // Create a second client mock for the successful retry | |
| AppConfigurationReplicaClient secondClientMock = Mockito.mock(AppConfigurationReplicaClient.class); | |
| lenient().when(secondClientMock.getEndpoint()).thenReturn(ENDPOINT); | |
| // Setup mocks: | |
| // - First getNextActiveClient(true) returns clientMock which will throw | |
| // - First getNextActiveClient(false) returns null (no more replicas in first attempt) | |
| // - Second getNextActiveClient(true) returns null (simulating success path) | |
| // Setup mocks to simulate a retry after initial client failure: | |
| // - First getNextActiveClient(true) returns clientMock which will throw | |
| // - First getNextActiveClient(false) returns null (no more replicas in first attempt) | |
| // - Second getNextActiveClient(true) returns null (simulating success path with no more failures) |
| * Attempts to load configuration from available clients. | ||
| * | ||
| * @param sourceList the list to populate with property sources | ||
| * @return the exception if all clients failed, null on success | ||
| */ | ||
| private Exception attemptLoadFromClients(List<EnumerablePropertySource<?>> sourceList) { | ||
| boolean reloadFailed = false; | ||
| Exception lastException = null; | ||
| AppConfigurationReplicaClient client = replicaClientFactory.getNextActiveClient(resource.getEndpoint(), true); | ||
|
|
There was a problem hiding this comment.
When getNextActiveClient returns null on line 248, attemptLoadFromClients returns null (no exception), which is treated as success in lines 187-188. However, getNextActiveClient returns null in two scenarios: (1) no clients are configured, and (2) clients exist but all are currently in backoff. During startup retry, case (2) should trigger continued retries with backoff until clients become available or timeout expires, but the current implementation treats it as immediate success, breaking out of the retry loop. This means startup might succeed with empty configuration when all replicas are temporarily unavailable. Consider having attemptLoadFromClients return a specific exception (e.g., NoAvailableClientsException) when getNextActiveClient returns null to distinguish "no attempt made" from "attempt succeeded", allowing the retry logic to work as intended.
| * Attempts to load configuration from available clients. | |
| * | |
| * @param sourceList the list to populate with property sources | |
| * @return the exception if all clients failed, null on success | |
| */ | |
| private Exception attemptLoadFromClients(List<EnumerablePropertySource<?>> sourceList) { | |
| boolean reloadFailed = false; | |
| Exception lastException = null; | |
| AppConfigurationReplicaClient client = replicaClientFactory.getNextActiveClient(resource.getEndpoint(), true); | |
| * Exception indicating that no App Configuration clients were available to load configuration. | |
| * <p> | |
| * This is used internally to distinguish between a successful configuration load and the case | |
| * where no attempt could be made because there were no active clients (either none configured | |
| * or all currently in backoff). | |
| */ | |
| private static final class NoAvailableClientsException extends Exception { | |
| private static final long serialVersionUID = 1L; | |
| NoAvailableClientsException(String message) { | |
| super(message); | |
| } | |
| } | |
| /** | |
| * Attempts to load configuration from available clients. | |
| * | |
| * @param sourceList the list to populate with property sources | |
| * @return the exception if all clients failed or no clients were available, {@code null} on success | |
| */ | |
| private Exception attemptLoadFromClients(List<EnumerablePropertySource<?>> sourceList) { | |
| boolean reloadFailed = false; | |
| Exception lastException = null; | |
| AppConfigurationReplicaClient client = replicaClientFactory.getNextActiveClient(resource.getEndpoint(), true); | |
| if (client == null) { | |
| // No active clients are currently available (either none configured or all in backoff). | |
| // Signal to the caller that no attempt was made so that startup retry logic can continue. | |
| return new NoAvailableClientsException( | |
| "No active App Configuration clients available for endpoint: " + resource.getEndpoint()); | |
| } |
| if (startupTimeout == null) { | ||
| throw new IllegalArgumentException("startupTimeout cannot be null."); | ||
| } |
There was a problem hiding this comment.
The validation throws IllegalArgumentException when startupTimeout is null, but the property has a default value of Duration.ofSeconds(100) defined at line 42. This means the null check will never be true unless someone explicitly calls setStartupTimeout(null). Consider removing this check since it's redundant with the default value, or document why explicit null setting should be prevented.
| if (startupTimeout == null) { | ||
| throw new IllegalArgumentException("startupTimeout cannot be null."); | ||
| } | ||
| if (startupTimeout.getSeconds() < 30 || startupTimeout.getSeconds() > 600) { | ||
| throw new IllegalArgumentException("startupTimeout must be between 30 and 600 seconds."); | ||
| } |
There was a problem hiding this comment.
Missing test coverage for the new startupTimeout validation logic added in lines 137-142. Consider adding tests to verify: (1) validation accepts valid timeout values (e.g., 30s, 100s, 600s), (2) validation rejects timeout less than 30 seconds, (3) validation rejects timeout greater than 600 seconds, and (4) validation handles null timeout (if that's an important edge case despite the default value). This follows the existing test pattern seen for refreshInterval validation in minValidWatchTime test at line 105.
| logger.debug("All replicas in backoff for store: " + resource.getEndpoint() | ||
| + ". Waiting " + waitSeconds + "s before retry (elapsed: " + elapsedSeconds + "s)."); | ||
| try { | ||
| Thread.sleep(waitSeconds * 1000); |
There was a problem hiding this comment.
The multiplication waitSeconds * 1000 could potentially overflow for very large waitSeconds values (greater than Long.MAX_VALUE / 1000, approximately 9.2 million seconds). While the current logic caps waitSeconds based on the deadline and backoff intervals making overflow unlikely in practice, consider using Math.multiplyExact or adding a guard check to ensure safe multiplication and provide clearer error handling if overflow occurs.
| Thread.sleep(waitSeconds * 1000); | |
| long waitMillis; | |
| try { | |
| waitMillis = Math.multiplyExact(waitSeconds, 1000L); | |
| } catch (ArithmeticException ex) { | |
| // In the unlikely event of overflow, cap to the maximum sleep duration. | |
| waitMillis = Long.MAX_VALUE; | |
| } | |
| Thread.sleep(waitMillis); |
| while (Instant.now().isBefore(deadline)) { | ||
| // Ensure we do not retain partial results from previous failed attempts | ||
| sourceList.clear(); | ||
| replicaClientFactory.findActiveClients(resource.getEndpoint()); | ||
| lastException = attemptLoadFromClients(sourceList); | ||
|
|
||
| if (lastException == null) { | ||
| return null; // Success | ||
| } | ||
|
|
||
| // Reverse in order to add Profile specific properties earlier, and last profile comes first | ||
| try { | ||
| sourceList.addAll(createSettings(currentClient)); | ||
| List<WatchedConfigurationSettings> featureFlags = createFeatureFlags(currentClient); | ||
|
|
||
| logger.debug("PropertySource context."); | ||
| AppConfigurationStoreMonitoring monitoring = resource.getMonitoring(); | ||
|
|
||
| storeState.setStateFeatureFlag(resource.getEndpoint(), featureFlags, | ||
| monitoring.getFeatureFlagRefreshInterval()); | ||
|
|
||
| if (monitoring.isEnabled()) { | ||
| // Check if refreshAll is enabled - if so, use watched configuration settings | ||
| if (monitoring.getTriggers().size() == 0) { | ||
| // Use watched configuration settings for refresh | ||
| List<WatchedConfigurationSettings> watchedConfigurationSettingsList = getWatchedConfigurationSettings( | ||
| currentClient); | ||
| storeState.setState(resource.getEndpoint(), Collections.emptyList(), | ||
| watchedConfigurationSettingsList, monitoring.getRefreshInterval()); | ||
| } else { | ||
| // Use traditional watch key monitoring | ||
| List<ConfigurationSetting> watchKeysSettings = monitoring.getTriggers().stream() | ||
| .map(trigger -> currentClient.getWatchKey(trigger.getKey(), trigger.getLabel(), | ||
| requestContext)) | ||
| .toList(); | ||
|
|
||
| storeState.setState(resource.getEndpoint(), watchKeysSettings, | ||
| monitoring.getRefreshInterval()); | ||
| } | ||
| // All clients failed, use fixed backoff based on elapsed time | ||
| if (Instant.now().isBefore(deadline)) { | ||
| long elapsedSeconds = Instant.now().getEpochSecond() - startTime.getEpochSecond(); |
There was a problem hiding this comment.
The calculation Instant.now().getEpochSecond() - startTime.getEpochSecond() may not accurately reflect elapsed time during the retry loop because it's called multiple times at different points in the loop (lines 193, 204, and within the while condition on line 181). Each call to Instant.now() captures a slightly different time, which could lead to minor inconsistencies. Consider capturing Instant.now() once at the beginning of each iteration and reusing that value for all time-related calculations within that iteration to ensure consistency.
Description
Adds retry to startup. When all replicas fail there will be an attempt to retry the failed store for a period of time. By default 100s, minimal 30s, maximum 600s.
Also, refactors the load method to be split into a number of helper method to make this readable.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines