Skip to content

[BUG] DatabaseAccount lazy-initialised fields are unsafely published, sporadic NPE in JsonSerializable.getWithMapping #49256

@lloydmeta

Description

@lloydmeta

Describe the bug

DatabaseAccount.getConsistencyPolicy() (and the three parallel lazy getters getReplicationPolicy(), getSystemReplicationPolicy(), getQueryEngineConfiguration()) caches its result in a non-volatile, non-final field with no synchronisation. The cached object itself extends JsonSerializable, whose propertyBag field is also non-volatile and non-final. The combination unsafely publishes a partially-constructed ConsistencyPolicy to other threads: a concurrent reader can observe consistencyPolicy != null while consistencyPolicy.propertyBag == null. The next call to getDefaultConsistencyLevel() then NPEs at JsonSerializable.getWithMapping dereferencing this.propertyBag.get(propertyName).

The race re-creates every time GlobalEndpointManager replaces latestDatabaseAccount with a freshly-constructed instance during its background refresh, since the four lazy fields are null again on the new instance.

Concretely in 4.80.0:

public ConsistencyPolicy getConsistencyPolicy() {
if (this.consistencyPolicy == null) {
this.consistencyPolicy = super.getObject(Constants.Properties.USER_CONSISTENCY_POLICY,
ConsistencyPolicy.class);
if (this.consistencyPolicy == null) {
this.consistencyPolicy = new ConsistencyPolicy();
}
}
return this.consistencyPolicy;
}

public JsonSerializable(ObjectNode objectNode) {
this.propertyBag = objectNode;
}

private <T> T getWithMapping(String propertyName, Function<JsonNode, T> mapper) {
JsonNode node = this.propertyBag.get(propertyName);
if (node == null || node.isNull()) {
return null;
}
return mapper.apply(node);
}

IIUC, without final on propertyBag or volatile on consistencyPolicy, the JMM doesn't establish a happens-before boundary between the constructor write to propertyBag and a reader that observes the published consistencyPolicy reference. A reader can see the new consistencyPolicy while the constructor write to propertyBag hasn't been propagated, producing the NPE. I think this is covered in the JSR 133 FAQ.

The same broad class of unsafe lazy initialisation was flagged in #9107 and resolved by dropping the cache entirely. IIUC, that specific fix doesn't translate here though... each DatabaseAccount instance's lazy fields are derived from its own propertyBag.

Exception or Stack Trace

java.lang.NullPointerException: Cannot invoke "com.fasterxml.jackson.databind.node.ObjectNode.get(String)" because "this.propertyBag" is null
    at com.azure.cosmos.implementation.JsonSerializable.getWithMapping(JsonSerializable.java:417)
    at com.azure.cosmos.implementation.JsonSerializable.getString(JsonSerializable.java:356)
    at com.azure.cosmos.implementation.ConsistencyPolicy.getDefaultConsistencyLevel(ConsistencyPolicy.java:57)
    at com.azure.cosmos.implementation.directconnectivity.GatewayServiceConfigurationReader.getDefaultConsistencyLevel(GatewayServiceConfigurationReader.java:40)
    at com.azure.cosmos.implementation.directconnectivity.RequestHelper.getReadConsistencyStrategyToUse(RequestHelper.java:43)
    at com.azure.cosmos.implementation.directconnectivity.ConsistencyReader.deduceReadMode(ConsistencyReader.java:344)
    at com.azure.cosmos.implementation.directconnectivity.ConsistencyReader.readAsync(ConsistencyReader.java:207)
    at com.azure.cosmos.implementation.directconnectivity.ReplicatedResourceClient.invokeAsync(ReplicatedResourceClient.java:160)
    at com.azure.cosmos.implementation.directconnectivity.ReplicatedResourceClient.getStoreResponseMono(ReplicatedResourceClient.java:136)
    ... (reactor scaffolding) ...
    at com.azure.cosmos.implementation.query.Paginator.lambda$getPaginatedQueryResultAsObservable$1(Paginator.java:134)
    ... (reactor scaffolding) ...
    at reactor.core.publisher.BlockingIterable.iterator(BlockingIterable.java:86)
    at <caller code consuming queryChangeFeed pages>

To Reproduce

The bug is non-deterministic and requires concurrent reads coinciding with GlobalEndpointManager refreshing latestDatabaseAccount. It manifests reliably in production after enough hours of traffic. I haven't been able to land a tight JCStress harness for it in the time I had: the race window between refreshLocationPrivateAsync publishing a new DatabaseAccount and the first concurrent getConsistencyPolicy() call against that instance is small. The reasoning in Describe the bug above should be enough to verify by inspection.

The pattern that matches the observed cases:

  1. Run any non-trivial concurrent Cosmos workload (multiple in-flight reads/queries).
  2. Wait for GlobalEndpointManager to refresh latestDatabaseAccount (default ~5 min, plus on failover triggers).
  3. With low probability, a concurrent reader hits the NPE above on its first contact with the new DatabaseAccount.

Code Snippet

To be clear, I haven't tried this but in principle....

CosmosClient client = new CosmosClientBuilder()
      .endpoint(endpoint)
      .credential(tokenCredential)
      .buildClient();
  // On N virtual threads, every few seconds per thread:
  for (int i = 0; i < N; i++) {
      Thread.ofVirtual().start(() -> {
          // Any operation routing through ReadConsistencyStrategy resolution.
          container.queryChangeFeed(
              CosmosChangeFeedRequestOptions.createForProcessingFromNow(FeedRange.forFullRange()),
              JsonNode.class
          ).iterableByPage().forEach(page -> { /* consume */ });
      });
  }

Expected behavior

Concurrent reads of getConsistencyPolicy() (and its three siblings) should never observe a partially-constructed ConsistencyPolicy.

Some possible fixes include:

  1. Make propertyBag final in JsonSerializable (it's only assigned from constructors). This fixes a source of raciness for every subclass at once.
  2. Make consistencyPolicy, replicationPolicy, systemReplicationPolicy, queryEngineConfiguration volatile in DatabaseAccount. Maybe worth combining with (1).
  3. Eagerly initialise these four fields in DatabaseAccount(ObjectNode) instead of lazily, and make them final. The values are derived from the constructor argument, and final-field semantics remove the race altogether. Unsure though if this is ok (e.g. is it lazy for performance ? or perhaps to not was cycles initialising stuff that is not used?).
  4. Return new values of these each time.

Gut feeling is (1) feels right.

Screenshots

N/A

Setup (please complete the following information):

  • OS: Linux 6.12 (Amazon Linux 2023, aarch64 / AWS Graviton)
  • IDE: IntelliJ
  • Library/Libraries: com.azure:azure-cosmos:4.80.0, com.azure:azure-identity:1.18.3
  • Java version: JRE 25.0.3
  • App Server/Environment: Quarkus 3.35 on Kubernetes
  • Frameworks: Quarkus

Exact user-agent strings from the failing spans:

  • azsdk-java-cosmos/4.80.0 Linux/6.12.68-92.122.amzn2023.aarch64 JRE/25.0.3
  • azsdk-java-cosmos/4.80.0 Linux/6.12.73-95.123.amzn2023.aarch64 JRE/25.0.3

Additional context

A few signals from 180 days of production traces:

  • 7 NPEs across the fleet over 180 days. We run heavily on aarch64, and all 7 happened on aarch64 hosts. The sample is too small and our hardware mix too skewed to assert x86 is unaffected, but a JMM happens-before bug like this would naturally surface more easily on weaker memory architectures.
  • All 7 land in change-feed paging. That's where the operation count is by far the largest, not because change feed has a unique code path.
  • Spread across 4 different builds of our service over those 180 days, so not a regression in any specific release.

A startup-time eager getConsistencyPolicy() workaround doesn't help, because GlobalEndpointManager constructs a fresh DatabaseAccount on every background refresh and the four lazy fields are null again on the new instance. The only client-side mitigation I can think of is reflectively pre-warming the fields after every refresh, which is worse than the bug 😬.

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added

Metadata

Metadata

Assignees

No one assigned

    Labels

    ClientThis issue points to a problem in the data-plane of the library.CosmosService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions