fix(test): resolve rate-limit failures when downloading datasets by shiva-istari · Pull Request #9658 · dgraph-io/dgraph

shiva-istari · 2026-03-16T10:13:25Z

CI workflows on forked PRs fail with HTTP 429 (Too Many Requests) when downloading benchmark datasets from GitHub, because shared-IP runners hit rate limits on unauthenticated requests. This fix replaces GitHub web URLs with direct CDN URLs, adds retry logic with exponential backoff, limits concurrent downloads to 5, and cleans up partial files on failure. It also adds GitHub Actions caching to avoid re-downloading data files on repeat runs. Together these changes make benchmark data downloads reliable for both forked and internal CI workflows.

mwelles-istari · 2026-03-16T21:45:39Z

t/t.go

+	"ldbcTypes.schema": "https://media.githubusercontent.com/media/dgraph-io/dgraph-benchmarks/refs/heads/main/ldbc/sf0.3/ldbcTypes.schema",
+}
+
+func wgetWithRetry(fname, url, dir string) error {


This function is a near-exact duplicate of downloadFile in dgraphtest/load.go. Is there a reason it needs to be redefined here instead of moving it to a shared location and reusing it?

This is especially concerning here because the function contents have clearly had to evolve over time as different failure causes were found and more guards were added. Every place it's duplicated is one more place to find and fix (or forget to fix) when the next issue arises.

The duplication is compounded by the double-retry approach — both copies pass --tries=3 --waitretry=5 --retry-connrefused to wget AND wrap that in a Go-level 3-attempt loop with exponential backoff, meaning up to 9 wget invocations per file with cumulative delays that could reach several minutes before a genuine failure is reported. If this retry logic needs tuning in the future (and it likely will), having it in exactly one place is essential.

mwelles-istari · 2026-03-16T21:51:45Z

.github/workflows/ci-dgraph-integration2-tests.yml

+        uses: actions/cache/restore@v4
+        with:
+          path: dgraphtest/datafiles
+          key: dataset-dgraphtest-v1


@shiva-istari one thing I'm concerned about here is that what version of the dataset fixtures is used is non-deterministic, and there's no way to know for sure which version of which fixture was used in a particular build if the cache was used. Which build seeded that cache? Which version of which fixture was cached when it did?

There's a hardcoded "-v1" suffix on the key, which AFAIK doesn't refer to any version of the test data being used — just the version of the key name itself.

I think instead either:

The release tag of the test fixtures to use should be configured in an env var or project file, such that:

Only the versions of the fixtures at that tag can be downloaded.

The cache key suffix used must match the release tag (e.g. dataset-dgraphtest-<release-tag>).

If possible, make it controllable via a test execution runtime flag as well.

This will ensure builds get the same version of the test data they would on a fresh checkout even when a cache copy is used, that builds are idempotent and test results are reproducible. And it ensures that any version of a given test data asset is downloaded only once, then cached and reused by every subsequent build that uses it.

Additional concern: The file-exists check that guards re-downloading (fi.Size() > 0) won't catch corrupted or truncated files from a previous failed run. Combined with replacing MakeDirEmpty with os.MkdirAll (which preserves existing content), there's no mechanism to detect or recover from a bad cached file. A checksum validation or known-size check would improve reliability here.

mlwelles

Additional findings beyond the code duplication and cache key concerns already noted:

Note: t/t.go:1133 still has var suffix = "?raw=true" which gets appended to media.githubusercontent.com URLs via baseUrl + name + suffix. The ?raw=true parameter is a GitHub blob-URL convention and is unnecessary for direct CDN URLs. Not harmful, but misleading — should be removed for consistency with the other URL updates in this PR.

mlwelles · 2026-03-16T22:15:46Z

dgraphtest/load.go

+		cmd := exec.Command("wget", "--tries=3", "--waitretry=5", "--retry-connrefused", "-O", fname, url)
+		cmd.Dir = datasetFilesPath
+
+		if out, err := cmd.CombinedOutput(); err != nil {


The Go-level retry loop (3 attempts) wraps a wget invocation that itself retries 3 times (--tries=3 --waitretry=5). This means up to 9 wget requests per file, with cumulative backoff delays that could reach several minutes before a genuine failure is reported.

Consider either:

Removing the wget-level retries (--tries=1) and letting the Go loop handle all retry logic, or

Removing the Go-level loop and relying solely on wget's built-in retry mechanism.

Having both layers is redundant and makes the failure timeline hard to reason about.

mlwelles · 2026-03-16T22:15:46Z

t/t.go

-	_ = os.RemoveAll(*tmp)
 	if err != nil {
 		os.Exit(1)
 	}


Removing os.RemoveAll(*tmp) is necessary for the caching strategy, but it silently changes behavior for local development — test data will now accumulate in the temp directory across runs and never be cleaned up.

Rather than removing this unconditionally, add a new flag to control it:

keepData = pflag.Bool("keep-data", false, "Preserve downloaded test data after run (for CI caching). Default cleans up.")

Then in main(), restore the cleanup but make it conditional:

err := run() if !*keepData { _ = os.RemoveAll(*tmp) }

The CI workflows would then pass --keep-data explicitly:

cd t; ./t --suite=load --tmp=${{ github.workspace }}/test-data --keep-data

This fits the existing flag pattern (similar to --keep for clusters, --download for resources), is explicit and discoverable via --help, and avoids coupling behavior to the environment.

mlwelles · 2026-03-16T22:15:46Z

t/t.go

+				panic(fmt.Sprintf("error downloading %s: %v", fname, err))
 			}
-			fmt.Printf("Downloaded %s to %s in %s \n", fname, dir, time.Since(start))
-		}(fname, link, &wg)


Calling panic() inside a goroutine will crash the entire program without unwinding other goroutines or running deferred cleanup. The original code had this same issue, but since this PR is improving error handling, it would be worth fixing.

Consider using errgroup.Group instead of sync.WaitGroup — it propagates errors cleanly from goroutines and would allow downloadLDBCFiles to return an error rather than panicking.

shiva-istari requested a review from a team as a code owner March 16, 2026 10:13

github-actions bot added area/testing Testing related issues area/integrations Related to integrations with other projects. go Pull requests that update Go code labels Mar 16, 2026

This comment has been minimized.

Sign in to view

resolve rate-limit failures when downloading datasets

12d07b5

shiva-istari force-pushed the shiva/ci-downloads branch from adc1378 to 12d07b5 Compare March 16, 2026 10:45

mwelles-istari reviewed Mar 16, 2026

View reviewed changes

mwelles-istari suggested changes Mar 16, 2026

View reviewed changes

mlwelles requested changes Mar 16, 2026

View reviewed changes

mlwelles mentioned this pull request Mar 17, 2026

fix(ci): bump trivy to v0.69.3 to resolve 404 download failure #9660

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(test): resolve rate-limit failures when downloading datasets#9658

fix(test): resolve rate-limit failures when downloading datasets#9658
shiva-istari wants to merge 1 commit intomainfrom
shiva/ci-downloads

shiva-istari commented Mar 16, 2026

Uh oh!

This comment has been minimized.

mwelles-istari Mar 16, 2026 •

edited by mlwelles

Loading

Uh oh!

mwelles-istari Mar 16, 2026 •

edited by mlwelles

Loading

Uh oh!

mlwelles left a comment

Uh oh!

mlwelles Mar 16, 2026

Uh oh!

mlwelles Mar 16, 2026 •

edited

Loading

Uh oh!

mlwelles Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

shiva-istari commented Mar 16, 2026

Uh oh!

This comment has been minimized.

mwelles-istari Mar 16, 2026 • edited by mlwelles Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwelles-istari Mar 16, 2026 • edited by mlwelles Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mlwelles left a comment

Choose a reason for hiding this comment

Uh oh!

mlwelles Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mlwelles Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mlwelles Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

mwelles-istari Mar 16, 2026 •

edited by mlwelles

Loading

mwelles-istari Mar 16, 2026 •

edited by mlwelles

Loading

mlwelles Mar 16, 2026 •

edited

Loading