Skip to content

Conversation

@onelapahead
Copy link
Contributor

Proposed changes

Original downstream PR: kaleido-io#136

From @peterbroadhurst on why this fix is safe in relation to idempotency behavior: kaleido-io#136 (review)

Types of changes

  • Updates Ethereum blockchain package to:

    • handle 409s from connectors as non-errors for contract deploys and invokes
    • Adds retry optimizations to the HTTP client to avoid retrying on common 4xx errors
  • Updates E2E for Ethereum to account for 409 vs 202 race on idempotent contract invokes

  • Fixes a flaky test in the IPFS downloader and avoids data races highlighted by go test -race

  • Bug fix

  • New feature added

  • Documentation Update

Please make sure to follow these points

  • I have read the contributing guidelines.
  • I have performed a self-review of my own code or work.
  • I have commented my code, particularly in hard-to-understand areas.
  • My changes generates no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.
  • My changes have sufficient code coverage (unit, integration, e2e tests).

Screenshots (If Applicable)

N/A - see logs below

Other Information

Original PR description and logs:

On contract invokes we observed an invoke submission being sent to an EVMConnect, but the connection was closed or otherwise disrupted resulting in a retry on status=0:

[2025-12-30T12:34:34.510Z] DEBUG ==> POST https://an-evmconnect.svc.cluster.local:5000/ breq=HTW53IrH httpreq=cWpubUxN pid=1 req=goWyzk4W
	2025-12-30 07:34:49.511	
[2025-12-30T12:34:49.511Z]  INFO retry 1/5 (min=250ms/max=30000ms) status=0 breq=HTW53IrH httpreq=cWpubUxN pid=1 req=goWyzk4W
	2025-12-30 07:34:49.762	

As a result, the EVMConnect did successful process the invoke operation submitted originally unbeknownst to FF, and so when FF then retried it:

[2025-12-30T12:34:49.762Z] DEBUG ==> POST https://an-evmconnect.svc.cluster.local:5000/ breq=HTW53IrH httpreq=cWpubUxN pid=1 req=goWyzk4W
	2025-12-30 07:34:50.272	
[2025-12-30T12:34:50.272Z] ERROR <== POST https://an-evmconnect.svc.cluster.local:5000/ [409] (15761ms) breq=HTW53IrH httpreq=cWpubUxN pid=1 req=goWyzk4W
	2025-12-30 07:34:50.272	

Once the 5 retries were done, we finally then wrapped the error and returned the 409 back to the requestor:

[2025-12-30T12:34:55.502Z] ERROR Transaction submission failed [submissionRejected=false]: FF10458: Conflict from blockchain connector: FF21065: ID 'a-namespace:cd3de8d2-9bf9-4181-a3b3-64043ccc2a4e' is not unique httpreq=cWpubUxN pid=1 req=goWyzk4W

And in the background, the txn listener sees the operation succeeded asynchronously:

[2025-12-30T12:34:55.033Z]  INFO Received operation update: status=Succeeded request=a-namespace:cd3de8d2-9bf9-4181-a3b3-64043ccc2a4etx=0x4de0bcd91f557f95033b58ac95c1ae1464478daa6f357c7e69ffdb8837205e5f message= namespace=a-namespace pid=1 proto=ethereum role=event-loop

Note submissionReject=false. Really the operation ends up succeeding, but aside from the error message containing the operation ID, there's no clean way for the requestor to detect this, get the operation ID, and check the operation status for themselves. While idempotency keys might be helpful, this PR proposes:

  1. we should not retry on 404s / 409s to avoid unnecessary latency / conns / CPU / etc. on retries which never succeed
  2. for invoke/deploy contracts, we detect 409s and the submissionRejected=false, if so we should not return an error so the operation can be properly marked as pending.
    a. do worry there's a race here, will the operation update indicating it succeeded, beat the retry that results in a 409 to initially mark it as pending ?

… and Mark Operations Pending

Signed-off-by: hfuss <hayden.fuss@kaleido.io>
Signed-off-by: hfuss <hayden.fuss@kaleido.io>
Signed-off-by: hfuss <hayden.fuss@kaleido.io>
Signed-off-by: hfuss <hayden.fuss@kaleido.io>
… race conditions

Signed-off-by: hfuss <hayden.fuss@kaleido.io>
Signed-off-by: hfuss <hayden.fuss@kaleido.io>
Signed-off-by: hfuss <hayden.fuss@kaleido.io>
@onelapahead onelapahead requested a review from a team as a code owner January 8, 2026 14:15
@onelapahead onelapahead changed the title 409s invoke retries [blockchain] [ethereum] Avoid 409 Retries on Contract Deploys/Invokes and Mark Operations Pending Jan 8, 2026
@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.96%. Comparing base (df3fe81) to head (ff2ce27).

Files with missing lines Patch % Lines
internal/blockchain/ethereum/ethereum.go 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1715      +/-   ##
==========================================
- Coverage   99.96%   99.96%   -0.01%     
==========================================
  Files         342      342              
  Lines       25022    25028       +6     
==========================================
+ Hits        25014    25018       +4     
- Misses          4        5       +1     
- Partials        4        5       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@EnriqueL8 EnriqueL8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few clean up comments and needs test to match coverage please

The detailed explanation was super helpful, just want to validate that this comment is in fact still true kaleido-io#136 (comment) as should be documented in FFTM incase the behaviour changes

and this worry:
do worry there's a race here, will the operation update indicating it succeeded, beat the retry that results in a 409 to initially mark it as pending ?

This could be a genuine race condition but there is a workaround of fetching the status of the operation to reevaluate it

called <- true
})
mom.On("SubmitOperationUpdate", mock.Anything).Return(nil)
//mom.On("SubmitOperationUpdate", mock.Anything).Return(nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line?

for {
suite.T().Logf("Waiting for invoke operation to succeed: %s", res.ID.String())
op := suite.testState.client1.GetOperation(suite.T(), res.ID.String())
if op.Status == core.OpStatusSucceeded {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we are stuck here if it fails, until the timeout? Should we just check for failed status and make potential move this to a common test utils, will leave it up to you on this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants