Skip to content

fix: race condition on get when add received after informer cache read#3237

Open
csviri wants to merge 10 commits intooperator-framework:mainfrom
csviri:rcaw-bug
Open

fix: race condition on get when add received after informer cache read#3237
csviri wants to merge 10 commits intooperator-framework:mainfrom
csviri:rcaw-bug

Conversation

@csviri
Copy link
Collaborator

@csviri csviri commented Mar 17, 2026

This section in the test fails in extremely rare occasion:

context.resourceOperations().serverSideApply(prepareCM(resource));
var cachedCM = context.getSecondaryResource(ConfigMap.class);
if (cachedCM.isEmpty()) {
issueFound.set(true);
throw new IllegalStateException("Error for resource: " + ResourceID.fromResource(resource));
}

The only way I was able to explain this based on the logs attached below, is that it can happen that, in the get method, here:

@Override
public Optional<R> get(ResourceID resourceID) {
Optional<R> resource = temporaryResourceCache.getResourceFromCache(resourceID);
var res = cache.get(resourceID);
if (comparableResourceVersions
&& resource.isPresent()
&& ReconcilerUtilsInternal.compareResourceVersions(
resource.get().getMetadata().getResourceVersion(),
manager().lastSyncResourceVersion(resource.get().getMetadata().getNamespace()))
> 0) {
log.debug("Latest resource found in temporary cache for Resource ID: {}", resourceID);
return resource;
}
log.debug(
"Resource not found, or older, in temporary cache. Found in informer cache {}, for"
+ " Resource ID: {}",
res.isPresent(),
resourceID);
return res;
}

That we found the resource in temp cache but not in informer cache, then we receive an event that bumps the last sync version, so the temp cache resource is not returned. I added in the PR a failsafe for this case.

It might be actually more efficient to work only with the cached resource and not the lastSyncVersion that would basically provide a "snapshot isolation".

cc @shawkins

Signed-off-by: Attila Mészáros a_meszaros@apple.com

Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026
@csviri
Copy link
Collaborator Author

csviri commented Mar 17, 2026

Putting here related log messages:

2026-03-17T12:25:22.4566081Z 2026-03-17 12:25:10,223 7122 i.j.o.p.e.s.i.TemporaryResourceCache [DEBUG] Processing event p:[::] e:[ControllerResourceEventSource:ADDED:resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] 
2026-03-17T12:25:22.4567683Z 2026-03-17 12:25:10,223 7122 i.j.o.p.e.s.c.ControllerEventSource [DEBUG] Event received with action: ADDED p:[::] e:[ControllerResourceEventSource:ADDED:resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] 
2026-03-17T12:25:22.4570516Z 2026-03-17 12:25:10,223 7122 i.j.o.p.e.EventProcessor       [DEBUG] Received event: ResourceEvent{action=ADDED, associated resource id=ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'}} p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[ControllerResourceEventSource:ADDED:resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] 
2026-03-17T12:25:22.4574012Z 2026-03-17 12:25:10,223 7122 i.j.o.p.e.ResourceState        [DEBUG] Marking event received for: ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'} p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[ControllerResourceEventSource:ADDED:resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] 
2026-03-17T12:25:22.4578463Z 2026-03-17 12:25:10,223 7122 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] Resource not found, or older, in temporary cache. Found in informer cache true, for Resource ID: ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'} p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[ControllerResourceEventSource:ADDED:resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] 
2026-03-17T12:25:22.4582470Z 2026-03-17 12:25:10,223 7122 i.j.o.p.e.EventProcessor       [DEBUG] Executing events for custom resource. Scope: ExecutionScope{resource=ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'}, retryInfo=null, deleteEvent=false, isDeleteFinalStateUnknown=false} p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[ControllerResourceEventSource:ADDED:resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] 
2026-03-17T12:25:22.4585786Z 2026-03-17 12:25:10,224 7146 i.j.o.p.e.EventProcessor       [DEBUG] Running reconcile executor for: ExecutionScope{resource=ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'}, retryInfo=null, deleteEvent=false, isDeleteFinalStateUnknown=false} p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:] e:[::::] 
2026-03-17T12:25:22.4588553Z 2026-03-17 12:25:10,224 7146 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] Resource not found, or older, in temporary cache. Found in informer cache true, for Resource ID: ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'} p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:] e:[::::] 
2026-03-17T12:25:22.4590685Z 2026-03-17 12:25:10,224 7146 i.j.o.p.e.ReconciliationDispatcher [DEBUG] Handling dispatch p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::] 
2026-03-17T12:25:22.4592810Z 2026-03-17 12:25:10,224 7146 i.j.o.p.e.ReconciliationDispatcher [DEBUG] Reconciling resource execution scope: ExecutionScope{resource=ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'}, retryInfo=null, deleteEvent=false, isDeleteFinalStateUnknown=false} p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::] 
2026-03-17T12:25:22.4595120Z 2026-03-17 12:25:10,224 7146 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] Starting event filtering and caching update p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::]
2026-03-17T12:25:22.4731819Z 2026-03-17 12:25:10,232 7146 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] Resource update successful p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::] 
2026-03-17T12:25:22.4733969Z 2026-03-17 12:25:10,232 7146 i.j.o.p.e.s.i.TemporaryResourceCache [DEBUG] Temporarily moving ahead to target version 10370 for resource id: ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'} p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::] 
2026-03-17T12:25:22.4736671Z 2026-03-17 12:25:10,232 7146 i.j.o.p.e.s.i.TemporaryResourceCache [DEBUG] Zero active updates for resource id: ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'}; event after update event: false; updated resource version: 10370 p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::] 
2026-03-17T12:25:22.4738963Z 2026-03-17 12:25:10,232 7146 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] No new event present after the filtering update p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::] 
2026-03-17T12:25:22.4742249Z 2026-03-17 12:25:10,232 7146 i.j.o.p.e.s.i.InformerEventSource [DEBUG] Using PrimaryToSecondaryIndex to find secondary resources for primary: ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'}. Found secondary ids: [ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'}]  p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::] 
2026-03-17T12:25:22.4745175Z 2026-03-17 12:25:10,232 7121 i.j.o.p.e.s.i.InformerEventSource [DEBUG] On add event received p:[::] e:[io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource@78332334:ADDED:resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10370] 
2026-03-17T12:25:22.4747433Z 2026-03-17 12:25:10,232 7121 i.j.o.p.e.s.i.TemporaryResourceCache [DEBUG] Processing event p:[::] e:[io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource@78332334:ADDED:resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10370] 
2026-03-17T12:25:22.4749747Z 2026-03-17 12:25:10,232 7121 i.j.o.p.e.s.i.TemporaryResourceCache [DEBUG] Removing resource from temp cache. comparison: 0 unknown state: false p:[::] e:[io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource@78332334:ADDED:resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10370] 
2026-03-17T12:25:22.4752697Z 2026-03-17 12:25:10,232 7121 i.j.o.p.e.s.i.InformerEventSource [DEBUG] Skipping event propagation p:[::] e:[io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource@78332334:ADDED:resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10370] 
2026-03-17T12:25:22.4755709Z 2026-03-17 12:25:10,232 7146 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] Resource not found, or older, in temporary cache. Found in informer cache false, for Resource ID: ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'} p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::] 
2026-03-17T12:25:22.4759951Z 2026-03-17 12:25:10,232 7146 i.j.o.p.e.EventProcessor       [DEBUG] Event processing finished. Scope: ExecutionScope{resource=ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'}, retryInfo=null, deleteEvent=false, isDeleteFinalStateUnknown=false}, PostExecutionControl: PostExecutionControl{onlyFinalizerHandled=false, updatedCustomResource=null, runtimeException=java.lang.IllegalStateException: Error for resource: ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'}} p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::] 
2026-03-17T12:25:22.4764092Z 2026-03-17 12:25:10,232 7146 i.j.o.p.e.ResourceState        [DEBUG] Marking event received for: ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'} p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::] 
2026-03-17T12:25:22.4767426Z 2026-03-17 12:25:10,232 7146 i.j.o.p.e.EventProcessor       [WARN ] Uncaught error during event processing ExecutionScope{resource=ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'}, retryInfo=null, deleteEvent=false, isDeleteFinalStateUnknown=false} - but another reconciliation will be attempted because a superseding event has been received or another retry attempt is pending. p:[resource26:cachingfilteringupdateit-testresourceaccessafterupdate-448c5:10364] e:[::::] 
2026-03-17T12:25:22.4769917Z java.lang.IllegalStateException: Error for resource: ResourceID{name='resource26', namespace='cachingfilteringupdateit-testresourceaccessafterupdate-448c5'}
2026-03-17T12:25:22.4771405Z 	at io.javaoperatorsdk.operator.baseapi.cachingfilteringupdate.CachingFilteringUpdateReconciler.reconcile(CachingFilteringUpdateReconciler.java:50)
2026-03-17T12:25:22.4772706Z 	at io.javaoperatorsdk.operator.baseapi.cachingfilteringupdate.CachingFilteringUpdateReconciler.reconcile(CachingFilteringUpdateReconciler.java:35)


csviri added 2 commits March 17, 2026 16:36
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri csviri marked this pull request as ready for review March 17, 2026 15:44
Copilot AI review requested due to automatic review settings March 17, 2026 15:44
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026
@openshift-ci openshift-ci bot requested review from metacosm and xstefank March 17, 2026 15:44
@csviri csviri changed the title Suspected bug in informer cache access fix: extremely rare race-condition when getting the resource from cache Mar 17, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses a rare race condition when reading resources from the temporary cache vs informer cache, which can lead to unexpected cache misses during reconciliation (and rare test failures).

Changes:

  • Adjusts ManagedInformerEventSource#get(ResourceID) to prefer returning the temporary-cache resource when the informer cache doesn’t contain the resource.
  • Adds additional debug logging around the informer’s last sync resource version and cache hit/miss state.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +198 to +200
log.debug(
"Latest sync version: {}",
manager().lastSyncResourceVersion(resourceID.getNamespace().orElse(null)));
Comment on lines +194 to +201
// order of getting those resource from cache matters
Optional<R> resource = temporaryResourceCache.getResourceFromCache(resourceID);
var res = cache.get(resourceID);
if (log.isDebugEnabled()) {
log.debug(
"Latest sync version: {}",
manager().lastSyncResourceVersion(resourceID.getNamespace().orElse(null)));
}
Comment on lines 216 to 220
"Resource found in temp cache: {}, or older, in temporary cache. Found in informer cache"
+ " {}, for Resource ID: {}",
resource.isPresent(),
res.isPresent(),
resourceID);
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri csviri requested a review from shawkins March 17, 2026 16:02
@csviri
Copy link
Collaborator Author

csviri commented Mar 17, 2026

@shawkins at the end I changed to use the cached resource for comparison since that allows also easier debugging

@shawkins
Copy link
Collaborator

It might be actually more efficient to work only with the cached resource and not the lastSyncVersion that would basically provide a "snapshot isolation".

Do you really want snapshot isolation here? Naively I would have thought that all logic should be tolerant to eventual consistencty.

If you want repeatable reads that seems more like a concern that would be per reconciliation.

Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri csviri changed the title fix: extremely rare race-condition when getting the resource from cache fix: extremely rare race condition when getting the resource from cache Mar 17, 2026
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026
@csviri
Copy link
Collaborator Author

csviri commented Mar 17, 2026

It might be actually more efficient to work only with the cached resource and not the lastSyncVersion that would basically provide a "snapshot isolation".

Do you really want snapshot isolation here? Naively I would have thought that all logic should be tolerant to eventual consistencty.

If you want repeatable reads that seems more like a concern that would be per reconciliation.

That is not the goal really, it is just much easier to follow the logic this way and fixes the race condition

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an extremely rare race condition when reading a resource from ManagedInformerEventSource#get(ResourceID) by preferring the temporary cache entry when it is newer than (or missing from) the informer cache, avoiding reliance on a namespace-level “last sync” version that can change due to unrelated events.

Changes:

  • Adjust cache selection logic in get(ResourceID) to compare the temporary cache entry directly against the informer cache entry’s resourceVersion.
  • Update debug logging around cache hits/misses to include temp-cache presence.
Comments suppressed due to low confidence (1)

operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/ManagedInformerEventSource.java:215

  • The debug log message is hard to parse ("Resource found in temp cache: {}, or older, in temporary cache") and the first placeholder is a boolean, not the resource/age. Consider rewording to something unambiguous like separate flags for temp/informer presence and (if relevant) whether the temp entry was newer.
        resource.isPresent(),
        res.isPresent(),
        resourceID);
    return res;
  }


You can also share your feedback on Copilot code review. Take the survey.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@csviri
Copy link
Collaborator Author

csviri commented Mar 17, 2026

@shawkins feel free to come up with an alternative approach

@shawkins
Copy link
Collaborator

That is not the goal really, it is just much easier to follow the logic this way and fixes the race condition

There will still be race conditions. The change here opens the logic up more to returning stale or ghost entries - it really just depends upon whatever lag exists between what the informer cache has seen and the events processed by the temporary resource cache.

@csviri
Copy link
Collaborator Author

csviri commented Mar 17, 2026

That is not the goal really, it is just much easier to follow the logic this way and fixes the race condition

There will still be race conditions. The change here opens the logic up more to returning stale or ghost entries - it really just depends upon whatever lag exists between what the informer cache has seen and the events processed by the temporary resource cache.

So the primary goal, is to fix the issue that this test fails:

context.resourceOperations().serverSideApply(prepareCM(resource));
var cachedCM = context.getSecondaryResource(ConfigMap.class);
if (cachedCM.isEmpty()) {
issueFound.set(true);
throw new IllegalStateException("Error for resource: " + ResourceID.fromResource(resource));
}

I don't see how ghost resources cleanup interference with this, could you please describe a scenario for race condition? @shawkins

@shawkins
Copy link
Collaborator

shawkins commented Mar 17, 2026

I don't see how ghost resources cleanup interference with this, could you please describe a scenario for race condition? @shawkins

I'm not saying that it interfers, I mean that the existing check prevents ManagedInformerEventSource from return knowingly stale / ghost entries wrt to the current state of the cache.

The change relaxes that.

So the primary goal, is to fix the issue that this test fails:

May need more investigation. If the informer cache has moved ahead, and no entry exists for cachedCM, yet you expect that the api server state should have that ConfigMap, then there's a problem.

csviri added 2 commits March 17, 2026 20:41
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri
Copy link
Collaborator Author

csviri commented Mar 17, 2026

I'm not saying that it interfers, I mean that the existing check prevents ManagedInformerEventSource from return knowingly stale / ghost entries wrt to the current state of the cache.

Yes, that is true. With this change ghost resources are just an eventual consistency problem, while it fixes a correctness problem. Also note that ghost resources are quite special case, that requires a delete operation on the created resource with timing when informer down and the requires a re-list. With the fabric8 improvement event that is minimized.

I would say, let's give it some time think a bit if we can find a better solution, but if not I would prefer go this way.

@shawkins
Copy link
Collaborator

I would prefer go this way

I wouldn't proceed with this change - that could just be masking another problem.

Do you have a full log including the client debug? If the entry is supposed to be in the informer cache, but isn't we need to know why.

@csviri
Copy link
Collaborator Author

csviri commented Mar 17, 2026

Do you have a full log including the client debug? If the entry is supposed to be in the informer cache, but isn't we need to know why.

No we can turn on additional logs, but I think there is clearly this case when this can happen:

@Override
public Optional<R> get(ResourceID resourceID) {
Optional<R> resource = temporaryResourceCache.getResourceFromCache(resourceID);
var res = cache.get(resourceID);
if (comparableResourceVersions
&& resource.isPresent()
&& ReconcilerUtilsInternal.compareResourceVersions(
resource.get().getMetadata().getResourceVersion(),
manager().lastSyncResourceVersion(resource.get().getMetadata().getNamespace()))
> 0) {
log.debug("Latest resource found in temporary cache for Resource ID: {}", resourceID);
return resource;
}
log.debug(
"Resource not found, or older, in temporary cache. Found in informer cache {}, for"
+ " Resource ID: {}",
res.isPresent(),
resourceID);
return res;
}

On:

  1. On line 194 we see there is a resource in TRC
  2. on 195 we don't see the resource in cache yet
  3. At this point we receive an add event that moves the lastSyncVersion
  4. we return return an empty optional since the resource from TRC has the same resource version than lastSyncVersion
    but the Optional read at 195 is empty.

This scenario corresponds with what I see in the logs. Theoretically we could re-read the cache also on line 210 if the res is empty as a workaround.

@csviri
Copy link
Collaborator Author

csviri commented Mar 17, 2026

added alternative: #3238

@csviri csviri changed the title fix: extremely rare race condition when getting the resource from cache fix: race condition on get when add received after informer cache read Mar 17, 2026
@shawkins
Copy link
Collaborator

Ok, so to do this consistently we need to read first from TRC, then check the last sync version, then read from the informer cache if needed.

Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri
Copy link
Collaborator Author

csviri commented Mar 18, 2026

Ok, so to do this consistently we need to read first from TRC, then check the last sync version, then read from the informer cache if needed.

Ahh, yes 🤦‍♂️ , that will solve it. thank you. updated this PR.

Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri
Copy link
Collaborator Author

csviri commented Mar 18, 2026

@shawkins please review and approve if this makes sense for you. And mainly thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants