Skip to content

[DX-4035] shard go_core_*_tests#22286

Open
Tofel wants to merge 16 commits intodevelopfrom
dx-4035-experiment-with-go-core-test-sharding
Open

[DX-4035] shard go_core_*_tests#22286
Tofel wants to merge 16 commits intodevelopfrom
dx-4035-experiment-with-go-core-test-sharding

Conversation

@Tofel
Copy link
Copy Markdown
Contributor

@Tofel Tofel commented May 4, 2026

Approach

This PR shards go_core_tests, go_core_tests_integration, and go_core_deployment_tests in CI to reduce flakiness caused by running too many test packages in parallel in a single job.

The sharding logic is intentionally simple and deterministic. Each existing test runner first computes the exact package list it already intends to run today, then passes that package list to a small Go helper. The helper assigns each package to a shard by hashing the package path with FNV-1a and taking hash % shard_count. This gives a stable, zero-maintenance partitioning scheme: no manual package lists, no balancing tables, and new test packages are picked up automatically.

Testing

Correctness is guarded in two ways. First, a consistency check verifies that for a given package universe, the union of all shards covers every package exactly once. Second, I compared the executed test sets between unsharded and sharded runs by extracting package + test name from gotestsum --jsonfile output and diffing the sorted results. For go_core_tests, go_core_tests_integration, and go_core_deployment_tests, the executed test sets matched exactly between the non-sharded and sharded runs.

Results:

  • go_core_tests
  • go_core_deployment_tests
  • go_core_integration_tests

Rejected approaches

Sharding based on regex/test names

We considered naive sharding by test-name regex, but rejected it because Go executes tests at package granularity and the suite is not evenly distributed by test-name prefix. In practice, leading-letter buckets were highly imbalanced, and the dominant runtime/flakiness hotspots are concentrated in a small number of heavy packages. Package-path hashing gives a simpler, future-proof, zero-maintenance partitioning mechanism that aligns with go test execution semantics.


Follow up work

Speeding up outlier packages

In order to fully unlock workflow execution speed up some of the deployments tests will need to made faster by respective teams. Once that is done plausible p50 of CI Core might land in the zone of 10-11 minutes.

Current execution swim line is shown below:
image

Slowest packages that need speeding up:

ok  	github.com/smartcontractkit/chainlink/deployment/ccip/changeset/solana_v0_1_1	232.204s
ok  	github.com/smartcontractkit/chainlink/deployment/ccip/changeset/v1_6	383.140s
ok  	github.com/smartcontractkit/chainlink/deployment/ccip/changeset/aptos	683.586s

Sharding fuzz tests

As can be seen on the graph above the next far outlier are fuzz tests. Once we shard them (2 shards should be enough) we should be able to cut down workflow execution time even further.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

✅ No conflicts with other open PRs targeting develop

@trunk-io
Copy link
Copy Markdown

trunk-io Bot commented May 4, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

Comment thread .github/workflows/ci-core.yml Fixed
Comment thread .github/workflows/ci-core.yml Fixed
Comment thread .github/workflows/ci-core.yml Fixed
Comment thread .github/workflows/ci-core.yml Fixed
Comment thread .github/workflows/ci-core.yml Fixed
Comment thread .github/workflows/ci-core.yml Fixed
Comment thread .github/workflows/ci-core.yml Fixed
Comment thread .github/workflows/ci-core.yml Fixed
@Tofel Tofel force-pushed the dx-4035-experiment-with-go-core-test-sharding branch 3 times, most recently from 72a9f6c to a9749e1 Compare May 4, 2026 16:15
@Tofel Tofel force-pushed the dx-4035-experiment-with-go-core-test-sharding branch from a9749e1 to cc0765d Compare May 4, 2026 17:27
Comment thread .github/workflows/ci-core.yml Fixed
Comment thread .github/workflows/ci-core.yml Fixed
@Tofel Tofel force-pushed the dx-4035-experiment-with-go-core-test-sharding branch from aca57a2 to 9e98a2f Compare May 5, 2026 06:13
@Tofel Tofel marked this pull request as ready for review May 5, 2026 07:20
@Tofel Tofel requested a review from a team as a code owner May 5, 2026 07:20
Copilot AI review requested due to automatic review settings May 5, 2026 07:20
@Tofel Tofel requested review from a team as code owners May 5, 2026 07:20
"run-timeout-minutes": "15",
"touch-web-assets": "true",
"restore-build-cache-only": "true",
"increase-timeouts-on-schedule": "true",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there an additional qualifier for timeout increases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how it worked before, we increase the default timeouts for fuzz/race tests and pass them as env vars to bash scripts that run them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was not an extra qualifier before. We simply checked if we had a scheduled run, then increased timeouts.

"touch-web-assets": "true",
"restore-build-cache-only": "true",
"increase-timeouts-on-schedule": "true",
"print-races": "true",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an extra variable to ensure that races are printed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that code that you linked does not include an extra parameter to indicate if races should be printed. It checks if we are running races via matrix.type.cmd == 'go_core_race_tests', and if so it prints them. If we need a different means of checking whether we are in race mode now, then so be it, but I don't think we should add extra params for each action that is implied.

Suggested change
"print-races": "true",
"race": "true",

"logs-artifact-name": f"go_core_tests_integration_logs_shard_{i}",
})

def add_core_deployment_rows():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need touch-web-assets for this one? Isn't that necessary to compile the core binary?

Copy link
Copy Markdown
Contributor Author

@Tofel Tofel May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these tests run in deployments folder, which is a separate Go module that doesn't compile core/web package.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for including tests from that module here? This is otherwise a consistent set of variations on the core tests.

@Tofel
Copy link
Copy Markdown
Contributor Author

Tofel commented May 5, 2026

What is the overall before vs. after here?

@jmank88 Roughly speaking it is the same, around 17 minutes, because 3 packages mentioned in the PR are resulting in a long tail. If we do go the sharding way then I will go after teams that own them and request they speed them up so that we can reap the benefits of sharding.

Until then the only benefit will be less flakyness.

@jmank88
Copy link
Copy Markdown
Contributor

jmank88 commented May 5, 2026

What is the overall before vs. after here?

@jmank88 Roughly speaking it is the same, around 17 minutes, because 3 packages mentioned in the PR are resulting in a long tail. If we do go the sharding way then I will go after teams that own them and request they speed them up so that we can reap the benefits of sharding.

Until then the only benefit will be less flakyness.

Huh, I only expected speedup. Why would there be less flakyness? I don't think it is true that:

flakiness caused by running too many test packages in parallel in a single job.

Too many packages is not inherently a problem because concurrency is already limited. There has to be more to it than that. Is the database contention the problem?

@Tofel
Copy link
Copy Markdown
Contributor Author

Tofel commented May 5, 2026

Huh, I only expected speedup.

Hard to get a speed up if we shard by package level if have packages that run for 11 minutes. If these are sped up them we will see a big speed up.

Too many packages is not inherently a problem because concurrency is already limited.

You mean limited by GOMAXPROCS?

Is the database contention the problem?

I think it is one of the problems. Since now every job will start its own DB, issues that were related to DB contention should also have lower impact.

@Tofel Tofel dismissed stale reviews from mchain0 and skudasov via b0aae99 May 5, 2026 12:33
@Tofel Tofel force-pushed the dx-4035-experiment-with-go-core-test-sharding branch from b0aae99 to b1e1a14 Compare May 5, 2026 12:36
@jmank88
Copy link
Copy Markdown
Contributor

jmank88 commented May 5, 2026

You mean limited by GOMAXPROCS?

Indirectly, yes. Both package and test level concurrency.

Is the database contention the problem?

I think it is one of the problems. Since now every job will start its own DB, issues that were related to DB contention should also have lower impact.

IMHO it is important to be certain about this. If avoiding DB contention is the main difference, then there are simpler approaches that we can try, and some of those will benefit local development as well rather than addressing only CI.

"pkg/c",
"pkg/d",
"pkg/e",
"pkg/f",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not seeing a duplicate case

mchain0
mchain0 previously approved these changes May 5, 2026
mchain0
mchain0 previously approved these changes May 6, 2026
@Tofel Tofel force-pushed the dx-4035-experiment-with-go-core-test-sharding branch 2 times, most recently from 6ad0e54 to fbf5288 Compare May 7, 2026 07:43
@Tofel Tofel force-pushed the dx-4035-experiment-with-go-core-test-sharding branch from fbf5288 to 6d2a935 Compare May 7, 2026 07:46
@cl-sonarqube-production
Copy link
Copy Markdown

@Tofel Tofel requested review from jmank88, mchain0 and skudasov and removed request for skudasov May 7, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants