Skip to content

OAK-12128: Fail earlier (and add more diags) when detecting late writes#2792

Open
reschke wants to merge 9 commits intotrunkfrom
OAK-12128
Open

OAK-12128: Fail earlier (and add more diags) when detecting late writes#2792
reschke wants to merge 9 commits intotrunkfrom
OAK-12128

Conversation

@reschke
Copy link
Contributor

@reschke reschke commented Mar 9, 2026

@reschke reschke marked this pull request as draft March 9, 2026 17:33
@reschke reschke requested a review from Joscorbe March 9, 2026 17:34
@reschke reschke self-assigned this Mar 9, 2026
@reschke reschke requested review from mbaedke and rishabhdaim March 9, 2026 17:39
@Joscorbe
Copy link
Member

Joscorbe commented Mar 9, 2026

I wonder if we should add a distinct log message when the lease check fails after the operation has already run (in the finally block). That would make it easier to find and analyze cases where the lease was lost during the call.

try {
operation.run();
} finally {
performLeaseCheck();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested this, but isn't it possible that the operation throws an exception and the lease check in finally also fails, so the DocumentStoreException from the lease check would mask the exception thrown by the operation?

Copy link
Contributor Author

@reschke reschke Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think. Note that this case apparently has no test coverage (or it is the reason for the tests complaining about the incorrect exception message). Let me try that.

EDIT: yes, that's the reason. Good catch.

@reschke
Copy link
Contributor Author

reschke commented Mar 10, 2026

I wonder if we should add a distinct log message when the lease check fails after the operation has already run (in the finally block). That would make it easier to find and analyze cases where the lease was lost during the call.

@Joscorbe - https://issues.apache.org/jira/browse/OAK-12128?focusedCommentId=18064356&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-18064356

Well, help in analyzing is the whole point of this PR. The question is whether it actually adds anything, because next a DSException will be thrown anyway.

Adding this requires minor changes in ClusterNode - API to check without throwing; or potentially to performLeaseCheck().

@reschke reschke marked this pull request as ready for review March 10, 2026 08:12
@reschke
Copy link
Contributor Author

reschke commented Mar 10, 2026

Note: I could minimize the diff by re-adding the finals.

@reschke
Copy link
Contributor Author

reschke commented Mar 10, 2026

@reschke reschke requested review from Joscorbe and rishabhdaim March 10, 2026 08:34
@reschke
Copy link
Contributor Author

reschke commented Mar 10, 2026

The test fails for me as well. Occasionally.

@reschke
Copy link
Contributor Author

reschke commented Mar 17, 2026

Changes:

  1. Late writes are now ERROR-logged (code change minimal so that we dnot have to touch ClusterNodeInfo.
  2. Test assertion changed so that 'both' exception messages are tolerated.

clusterNodeInfo.performLeaseCheck();
} catch (DocumentStoreException ex) {
if (after) {
LOG.error("Potential late write operation detected", new Exception("call stack"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering, in case of a late write we might be under memory pressure. in which case even new Exception might not work. what about doing 2 logs, first one without the exception, followed by a second one with the exception?

@reschke reschke requested a review from stefan-egli March 17, 2026 17:46
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants