Skip to content

roachtest: GCS failover for oversized artifacts.zip#170147

Open
williamchoe3 wants to merge 2 commits into
cockroachdb:masterfrom
williamchoe3:roachtest-artifact-failover
Open

roachtest: GCS failover for oversized artifacts.zip#170147
williamchoe3 wants to merge 2 commits into
cockroachdb:masterfrom
williamchoe3:roachtest-artifact-failover

Conversation

@williamchoe3
Copy link
Copy Markdown
Contributor

@williamchoe3 williamchoe3 commented May 11, 2026

Previously, artifacts.zip was uploaded via TeamCity service message "##teamcity[publishArtifacts '%s']", t.artifactsSpec). The current size limit for this upload method is 7GB. We have started to observe some tests attempting to upload artifacts.zip that is way beyond that limit i.e. 20+GB.

This change adds a GCS failover mechanism to roachtest that will trigger when artifacts.zip is >= 7GB. Then, instead of the uploading the artifact to TC, it is uploaded to GCS instead. Then in the large artifact's place on TC, a small breadcrumb file artifacts-failover.txt is added that points to the artifact URI on GCS. See below for an example. A github issue for a failed test that had it's artifacts uploaded to GCS will then have the link to this breadcrumb file. Currently a timeout of 2 hours is used. The failover threshold and bucket default to production values, but can be overridden for smoke tests with ROACHTEST_ARTIFACT_FAILOVER_MAX_BYTES and ROACHTEST_ARTIFACT_FAILOVER_BUCKET. Furthermore, GCS object names are build and test scoped to avoid collisions.

Alternatively, we could simply increase the TeamCity upload limit, but this particular setting is a global setting that would apply to all projects. Also we currently do not know what the maximum artifact upload size currently is / will be in the future, so if we do rely on that setting, we would have to arbitrarily guess and then adjust as we need to.

Epic: none

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 11, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@williamchoe3 williamchoe3 force-pushed the roachtest-artifact-failover branch from 06b04f9 to 3b4590b Compare May 11, 2026 19:11
@williamchoe3
Copy link
Copy Markdown
Contributor Author

williamchoe3 commented May 11, 2026

Assumes the invoking script process already has GCP credentials available which is set build/teamcity/util/roachtest_util.sh. I'm assuming the right service account is serviceAccount:teamcity-nightly@cockroach-ephemeral.iam.gserviceaccount.com, but couldn't find proof.

TODO:

  • Add as a follow-up commit or PR move datadog upload to happen after TC upload so the TC failover logs can be indexed to datadog and we can create log monitors. (reason for additional commit for this is to potentially make backporting easier)

@williamchoe3 williamchoe3 force-pushed the roachtest-artifact-failover branch 7 times, most recently from 65f0581 to 33a2d03 Compare May 11, 2026 19:54
@williamchoe3
Copy link
Copy Markdown
Contributor Author

williamchoe3 commented May 11, 2026

artifacts-failover.txt

artifacts.zip exceeded TeamCity's per-file artifact limit and was uploaded to GCS.

GCS URI: gs://roachtest-artifact-failover/teamcity/12345/kv/restart/nodes=12/run_1/artifacts.zip
Size: 27840000000 bytes
Uploaded at: 2026-05-11T18:42:15Z

@williamchoe3 williamchoe3 force-pushed the roachtest-artifact-failover branch from 33a2d03 to d30e323 Compare May 11, 2026 20:29
@williamchoe3 williamchoe3 changed the title roachtest: fail over oversized artifacts to GCS roachtest: GCS failover for oversized artifacts.zip May 11, 2026
@williamchoe3 williamchoe3 force-pushed the roachtest-artifact-failover branch 2 times, most recently from b295625 to e925dee Compare May 11, 2026 20:41
@williamchoe3
Copy link
Copy Markdown
Contributor Author

@williamchoe3 williamchoe3 force-pushed the roachtest-artifact-failover branch 2 times, most recently from e8bbdef to ced67a1 Compare May 11, 2026 21:12
@williamchoe3
Copy link
Copy Markdown
Contributor Author

@williamchoe3 williamchoe3 marked this pull request as ready for review May 11, 2026 21:44
@williamchoe3 williamchoe3 requested review from a team as code owners May 11, 2026 21:44
@williamchoe3 williamchoe3 requested review from golgeek and shailendra-patel and removed request for a team May 11, 2026 21:44
@williamchoe3 williamchoe3 force-pushed the roachtest-artifact-failover branch from 6e7bb54 to 6160346 Compare May 11, 2026 21:47
Roachtest currently publishes artifacts.zip with TeamCity's publishArtifacts
service message, which has a 7 GB limit. Some runs now produce artifacts.zip
files above 20 GB, causing TeamCity artifact publication to fail.

Add a GCS failover path for artifacts.zip files at or above 7 GB. Instead of
uploading the large artifacts.zip to TeamCity, roachtest uploads it to GCS and
publishes a small artifacts-failover.txt breadcrumb file pointing to the GCS
URI.

Release note: None
@williamchoe3 williamchoe3 force-pushed the roachtest-artifact-failover branch from 6160346 to 79390b3 Compare May 11, 2026 22:00
// has completed. We're using the exact same destination to avoid
// duplication of any of the artifacts.
shout(ctx, l, stdout, "##teamcity[publishArtifacts '%s']", t.artifactsSpec)
publishTeamCityArtifacts := func() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could move this into a helper, runTest was already really long before this though (350 lines). Can go either way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants