Skip to content

fix(test): resolve rate-limit failures when downloading datasets#9658

Open
shiva-istari wants to merge 1 commit intomainfrom
shiva/ci-downloads
Open

fix(test): resolve rate-limit failures when downloading datasets#9658
shiva-istari wants to merge 1 commit intomainfrom
shiva/ci-downloads

Conversation

@shiva-istari
Copy link
Contributor

CI workflows on forked PRs fail with HTTP 429 (Too Many Requests) when downloading benchmark datasets from GitHub, because shared-IP runners hit rate limits on unauthenticated requests. This fix replaces GitHub web URLs with direct CDN URLs, adds retry logic with exponential backoff, limits concurrent downloads to 5, and cleans up partial files on failure. It also adds GitHub Actions caching to avoid re-downloading data files on repeat runs. Together these changes make benchmark data downloads reliable for both forked and internal CI workflows.

@shiva-istari shiva-istari requested a review from a team as a code owner March 16, 2026 10:13
@github-actions github-actions bot added area/testing Testing related issues area/integrations Related to integrations with other projects. go Pull requests that update Go code labels Mar 16, 2026
@blacksmith-sh

This comment has been minimized.

"ldbcTypes.schema": "https://media.githubusercontent.com/media/dgraph-io/dgraph-benchmarks/refs/heads/main/ldbc/sf0.3/ldbcTypes.schema",
}

func wgetWithRetry(fname, url, dir string) error {
Copy link

@mwelles-istari mwelles-istari Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a near-exact duplicate of downloadFile in dgraphtest/load.go. Is there a reason it needs to be redefined here instead of moving it to a shared location and reusing it?

This is especially concerning here because the function contents have clearly had to evolve over time as different failure causes were found and more guards were added. Every place it's duplicated is one more place to find and fix (or forget to fix) when the next issue arises.

The duplication is compounded by the double-retry approach — both copies pass --tries=3 --waitretry=5 --retry-connrefused to wget AND wrap that in a Go-level 3-attempt loop with exponential backoff, meaning up to 9 wget invocations per file with cumulative delays that could reach several minutes before a genuine failure is reported. If this retry logic needs tuning in the future (and it likely will), having it in exactly one place is essential.

uses: actions/cache/restore@v4
with:
path: dgraphtest/datafiles
key: dataset-dgraphtest-v1
Copy link

@mwelles-istari mwelles-istari Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiva-istari one thing I'm concerned about here is that what version of the dataset fixtures is used is non-deterministic, and there's no way to know for sure which version of which fixture was used in a particular build if the cache was used. Which build seeded that cache? Which version of which fixture was cached when it did?

There's a hardcoded "-v1" suffix on the key, which AFAIK doesn't refer to any version of the test data being used — just the version of the key name itself.

I think instead either:

  • The release tag of the test fixtures to use should be configured in an env var or project file, such that:
    • Only the versions of the fixtures at that tag can be downloaded.
    • The cache key suffix used must match the release tag (e.g. dataset-dgraphtest-<release-tag>).
    • If possible, make it controllable via a test execution runtime flag as well.

This will ensure builds get the same version of the test data they would on a fresh checkout even when a cache copy is used, that builds are idempotent and test results are reproducible. And it ensures that any version of a given test data asset is downloaded only once, then cached and reused by every subsequent build that uses it.

Additional concern: The file-exists check that guards re-downloading (fi.Size() > 0) won't catch corrupted or truncated files from a previous failed run. Combined with replacing MakeDirEmpty with os.MkdirAll (which preserves existing content), there's no mechanism to detect or recover from a bad cached file. A checksum validation or known-size check would improve reliability here.

Copy link
Contributor

@mlwelles mlwelles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings beyond the code duplication and cache key concerns already noted:

Note: t/t.go:1133 still has var suffix = "?raw=true" which gets appended to media.githubusercontent.com URLs via baseUrl + name + suffix. The ?raw=true parameter is a GitHub blob-URL convention and is unnecessary for direct CDN URLs. Not harmful, but misleading — should be removed for consistency with the other URL updates in this PR.

cmd := exec.Command("wget", "--tries=3", "--waitretry=5", "--retry-connrefused", "-O", fname, url)
cmd.Dir = datasetFilesPath

if out, err := cmd.CombinedOutput(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Go-level retry loop (3 attempts) wraps a wget invocation that itself retries 3 times (--tries=3 --waitretry=5). This means up to 9 wget requests per file, with cumulative backoff delays that could reach several minutes before a genuine failure is reported.

Consider either:

  • Removing the wget-level retries (--tries=1) and letting the Go loop handle all retry logic, or
  • Removing the Go-level loop and relying solely on wget's built-in retry mechanism.

Having both layers is redundant and makes the failure timeline hard to reason about.

_ = os.RemoveAll(*tmp)
if err != nil {
os.Exit(1)
}
Copy link
Contributor

@mlwelles mlwelles Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing os.RemoveAll(*tmp) is necessary for the caching strategy, but it silently changes behavior for local development — test data will now accumulate in the temp directory across runs and never be cleaned up.

Rather than removing this unconditionally, add a new flag to control it:

keepData = pflag.Bool("keep-data", false,
    "Preserve downloaded test data after run (for CI caching). Default cleans up.")

Then in main(), restore the cleanup but make it conditional:

err := run()
if !*keepData {
    _ = os.RemoveAll(*tmp)
}

The CI workflows would then pass --keep-data explicitly:

cd t; ./t --suite=load --tmp=${{ github.workspace }}/test-data --keep-data

This fits the existing flag pattern (similar to --keep for clusters, --download for resources), is explicit and discoverable via --help, and avoids coupling behavior to the environment.

panic(fmt.Sprintf("error downloading %s: %v", fname, err))
}
fmt.Printf("Downloaded %s to %s in %s \n", fname, dir, time.Since(start))
}(fname, link, &wg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling panic() inside a goroutine will crash the entire program without unwinding other goroutines or running deferred cleanup. The original code had this same issue, but since this PR is improving error handling, it would be worth fixing.

Consider using errgroup.Group instead of sync.WaitGroup — it propagates errors cleanly from goroutines and would allow downloadLDBCFiles to return an error rather than panicking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/integrations Related to integrations with other projects. area/testing Testing related issues go Pull requests that update Go code

Development

Successfully merging this pull request may close these issues.

3 participants