[BUG] DatabaseAccount lazy-initialised fields are unsafely published, sporadic NPE in JsonSerializable.getWithMapping

**Describe the bug**

`DatabaseAccount.getConsistencyPolicy()` (and the three parallel lazy getters `getReplicationPolicy()`, `getSystemReplicationPolicy()`, `getQueryEngineConfiguration()`) caches its result in a non-`volatile`, non-`final` field with no synchronisation. The cached object itself extends `JsonSerializable`, whose `propertyBag` field is also non-`volatile` and non-`final`. The combination unsafely publishes a partially-constructed `ConsistencyPolicy` to other threads: a concurrent reader can observe `consistencyPolicy != null` while `consistencyPolicy.propertyBag == null`. The next call to `getDefaultConsistencyLevel()` then NPEs at `JsonSerializable.getWithMapping` dereferencing `this.propertyBag.get(propertyName)`.

The race re-creates every time `GlobalEndpointManager` replaces `latestDatabaseAccount` with a freshly-constructed instance during its background refresh, since the four lazy fields are null again on the new instance. 

Concretely in 4.80.0:

https://github.com/Azure/azure-sdk-for-java/blob/260591ea9430030620d6912a5c7173bd9d753670/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/DatabaseAccount.java#L141-L151

https://github.com/Azure/azure-sdk-for-java/blob/260591ea9430030620d6912a5c7173bd9d753670/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/JsonSerializable.java#L68

https://github.com/Azure/azure-sdk-for-java/blob/260591ea9430030620d6912a5c7173bd9d753670/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/JsonSerializable.java#L101-L103

https://github.com/Azure/azure-sdk-for-java/blob/260591ea9430030620d6912a5c7173bd9d753670/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/JsonSerializable.java#L416-L423

https://github.com/Azure/azure-sdk-for-java/blob/260591ea9430030620d6912a5c7173bd9d753670/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/DatabaseAccount.java#L19

IIUC, without `final` on `propertyBag` or `volatile` on `consistencyPolicy`, the JMM doesn't establish a happens-before boundary between the constructor write to `propertyBag` and a reader that observes the published `consistencyPolicy` reference. A reader can see the new `consistencyPolicy` while the constructor write to `propertyBag` hasn't been propagated, producing the NPE.  I think this is covered in the [JSR 133 FAQ](https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#dcl).

The same broad class of unsafe lazy initialisation was flagged in #9107 and resolved by dropping the cache entirely. IIUC, that specific fix doesn't translate here though... each `DatabaseAccount` instance's lazy fields are derived from its own `propertyBag`.

***Exception or Stack Trace***

```
java.lang.NullPointerException: Cannot invoke "com.fasterxml.jackson.databind.node.ObjectNode.get(String)" because "this.propertyBag" is null
    at com.azure.cosmos.implementation.JsonSerializable.getWithMapping(JsonSerializable.java:417)
    at com.azure.cosmos.implementation.JsonSerializable.getString(JsonSerializable.java:356)
    at com.azure.cosmos.implementation.ConsistencyPolicy.getDefaultConsistencyLevel(ConsistencyPolicy.java:57)
    at com.azure.cosmos.implementation.directconnectivity.GatewayServiceConfigurationReader.getDefaultConsistencyLevel(GatewayServiceConfigurationReader.java:40)
    at com.azure.cosmos.implementation.directconnectivity.RequestHelper.getReadConsistencyStrategyToUse(RequestHelper.java:43)
    at com.azure.cosmos.implementation.directconnectivity.ConsistencyReader.deduceReadMode(ConsistencyReader.java:344)
    at com.azure.cosmos.implementation.directconnectivity.ConsistencyReader.readAsync(ConsistencyReader.java:207)
    at com.azure.cosmos.implementation.directconnectivity.ReplicatedResourceClient.invokeAsync(ReplicatedResourceClient.java:160)
    at com.azure.cosmos.implementation.directconnectivity.ReplicatedResourceClient.getStoreResponseMono(ReplicatedResourceClient.java:136)
    ... (reactor scaffolding) ...
    at com.azure.cosmos.implementation.query.Paginator.lambda$getPaginatedQueryResultAsObservable$1(Paginator.java:134)
    ... (reactor scaffolding) ...
    at reactor.core.publisher.BlockingIterable.iterator(BlockingIterable.java:86)
    at <caller code consuming queryChangeFeed pages>
```

**To Reproduce**


The bug is non-deterministic and requires concurrent reads coinciding with `GlobalEndpointManager` refreshing `latestDatabaseAccount`. It manifests reliably in production after enough hours of traffic. I haven't been able to land a tight JCStress harness for it in the time I had: the race window between `refreshLocationPrivateAsync` publishing a new `DatabaseAccount` and the first concurrent `getConsistencyPolicy()` call against that instance is small. The reasoning in **Describe the bug** above should be enough to verify by inspection.

The pattern that matches the observed cases:

1. Run any non-trivial concurrent Cosmos workload (multiple in-flight reads/queries).
2. Wait for `GlobalEndpointManager` to refresh `latestDatabaseAccount` (default ~5 min, plus on failover triggers).
3. With low probability, a concurrent reader hits the NPE above on its first contact with the new `DatabaseAccount`.

***Code Snippet***

To be clear, I haven't tried this but in principle....

```java
CosmosClient client = new CosmosClientBuilder()
      .endpoint(endpoint)
      .credential(tokenCredential)
      .buildClient();
  // On N virtual threads, every few seconds per thread:
  for (int i = 0; i < N; i++) {
      Thread.ofVirtual().start(() -> {
          // Any operation routing through ReadConsistencyStrategy resolution.
          container.queryChangeFeed(
              CosmosChangeFeedRequestOptions.createForProcessingFromNow(FeedRange.forFullRange()),
              JsonNode.class
          ).iterableByPage().forEach(page -> { /* consume */ });
      });
  }
```

**Expected behavior**

Concurrent reads of `getConsistencyPolicy()` (and its three siblings) should never observe a partially-constructed `ConsistencyPolicy`.

Some possible fixes include:
1. Make `propertyBag` `final` in `JsonSerializable` (it's only assigned from constructors). This fixes a source of raciness for every subclass at once.
2. Make `consistencyPolicy`, `replicationPolicy`, `systemReplicationPolicy`, `queryEngineConfiguration` `volatile` in `DatabaseAccount`. Maybe worth combining with (1).
3. Eagerly initialise these four fields in `DatabaseAccount(ObjectNode)` instead of lazily, and make them `final`. The values are derived from the constructor argument, and final-field semantics remove the race altogether.  Unsure though if this is ok (e.g. is it lazy for performance ? or perhaps to not was cycles initialising stuff that is not used?).
4. Return new values of these each time.

Gut feeling is (1) feels right.

**Screenshots**

N/A

**Setup (please complete the following information):**

* OS: Linux 6.12 (Amazon Linux 2023, aarch64 / AWS Graviton)
* IDE: IntelliJ
* Library/Libraries: `com.azure:azure-cosmos:4.80.0`, `com.azure:azure-identity:1.18.3`
* Java version: JRE 25.0.3
* App Server/Environment: Quarkus 3.35 on Kubernetes
* Frameworks: Quarkus

Exact user-agent strings from the failing spans:

* `azsdk-java-cosmos/4.80.0 Linux/6.12.68-92.122.amzn2023.aarch64 JRE/25.0.3`
* `azsdk-java-cosmos/4.80.0 Linux/6.12.73-95.123.amzn2023.aarch64 JRE/25.0.3`

**Additional context**

A few signals from 180 days of production traces:

* 7 NPEs across the fleet over 180 days. We run heavily on aarch64, and all 7 happened on aarch64 hosts. The sample is too small and our hardware mix too skewed to assert x86 is unaffected, but a JMM happens-before bug like this would naturally surface more easily on weaker memory architectures.
* All 7 land in change-feed paging. That's where the operation count is by far the largest, not because change feed has a unique code path.
* Spread across 4 different builds of our service over those 180 days, so not a regression in any specific release.

A startup-time eager `getConsistencyPolicy()` workaround doesn't help, because `GlobalEndpointManager` constructs a fresh `DatabaseAccount` on every background refresh and the four lazy fields are null again on the new instance. The only client-side mitigation I can think of is reflectively pre-warming the fields after every refresh, which is worse than the bug 😬.

**Information Checklist**
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
- [x] Bug Description Added
- [x] Repro Steps Added
- [x] Setup information Added


	public ConsistencyPolicy getConsistencyPolicy() {
	if (this.consistencyPolicy == null) {
	this.consistencyPolicy = super.getObject(Constants.Properties.USER_CONSISTENCY_POLICY,
	ConsistencyPolicy.class);

	if (this.consistencyPolicy == null) {
	this.consistencyPolicy = new ConsistencyPolicy();
	}
	}
	return this.consistencyPolicy;
	}

	private <T> T getWithMapping(String propertyName, Function<JsonNode, T> mapper) {
	JsonNode node = this.propertyBag.get(propertyName);
	if (node == null \|\| node.isNull()) {
	return null;
	}

	return mapper.apply(node);
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DatabaseAccount lazy-initialised fields are unsafely published, sporadic NPE in JsonSerializable.getWithMapping #49256

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	public JsonSerializable(ObjectNode objectNode) {
	this.propertyBag = objectNode;
	}

[BUG] DatabaseAccount lazy-initialised fields are unsafely published, sporadic NPE in JsonSerializable.getWithMapping #49256

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions