fix(test): resolve rate-limit failures when downloading datasets#9658
fix(test): resolve rate-limit failures when downloading datasets#9658shiva-istari wants to merge 1 commit intomainfrom
Conversation
This comment has been minimized.
This comment has been minimized.
adc1378 to
12d07b5
Compare
| "ldbcTypes.schema": "https://media.githubusercontent.com/media/dgraph-io/dgraph-benchmarks/refs/heads/main/ldbc/sf0.3/ldbcTypes.schema", | ||
| } | ||
|
|
||
| func wgetWithRetry(fname, url, dir string) error { |
There was a problem hiding this comment.
This function is a near-exact duplicate of downloadFile in dgraphtest/load.go. Is there a reason it needs to be redefined here instead of moving it to a shared location and reusing it?
This is especially concerning here because the function contents have clearly had to evolve over time as different failure causes were found and more guards were added. Every place it's duplicated is one more place to find and fix (or forget to fix) when the next issue arises.
The duplication is compounded by the double-retry approach — both copies pass --tries=3 --waitretry=5 --retry-connrefused to wget AND wrap that in a Go-level 3-attempt loop with exponential backoff, meaning up to 9 wget invocations per file with cumulative delays that could reach several minutes before a genuine failure is reported. If this retry logic needs tuning in the future (and it likely will), having it in exactly one place is essential.
| uses: actions/cache/restore@v4 | ||
| with: | ||
| path: dgraphtest/datafiles | ||
| key: dataset-dgraphtest-v1 |
There was a problem hiding this comment.
@shiva-istari one thing I'm concerned about here is that what version of the dataset fixtures is used is non-deterministic, and there's no way to know for sure which version of which fixture was used in a particular build if the cache was used. Which build seeded that cache? Which version of which fixture was cached when it did?
There's a hardcoded "-v1" suffix on the key, which AFAIK doesn't refer to any version of the test data being used — just the version of the key name itself.
I think instead either:
- The release tag of the test fixtures to use should be configured in an env var or project file, such that:
- Only the versions of the fixtures at that tag can be downloaded.
- The cache key suffix used must match the release tag (e.g.
dataset-dgraphtest-<release-tag>). - If possible, make it controllable via a test execution runtime flag as well.
This will ensure builds get the same version of the test data they would on a fresh checkout even when a cache copy is used, that builds are idempotent and test results are reproducible. And it ensures that any version of a given test data asset is downloaded only once, then cached and reused by every subsequent build that uses it.
Additional concern: The file-exists check that guards re-downloading (fi.Size() > 0) won't catch corrupted or truncated files from a previous failed run. Combined with replacing MakeDirEmpty with os.MkdirAll (which preserves existing content), there's no mechanism to detect or recover from a bad cached file. A checksum validation or known-size check would improve reliability here.
mlwelles
left a comment
There was a problem hiding this comment.
Additional findings beyond the code duplication and cache key concerns already noted:
Note: t/t.go:1133 still has var suffix = "?raw=true" which gets appended to media.githubusercontent.com URLs via baseUrl + name + suffix. The ?raw=true parameter is a GitHub blob-URL convention and is unnecessary for direct CDN URLs. Not harmful, but misleading — should be removed for consistency with the other URL updates in this PR.
| cmd := exec.Command("wget", "--tries=3", "--waitretry=5", "--retry-connrefused", "-O", fname, url) | ||
| cmd.Dir = datasetFilesPath | ||
|
|
||
| if out, err := cmd.CombinedOutput(); err != nil { |
There was a problem hiding this comment.
The Go-level retry loop (3 attempts) wraps a wget invocation that itself retries 3 times (--tries=3 --waitretry=5). This means up to 9 wget requests per file, with cumulative backoff delays that could reach several minutes before a genuine failure is reported.
Consider either:
- Removing the wget-level retries (
--tries=1) and letting the Go loop handle all retry logic, or - Removing the Go-level loop and relying solely on wget's built-in retry mechanism.
Having both layers is redundant and makes the failure timeline hard to reason about.
| _ = os.RemoveAll(*tmp) | ||
| if err != nil { | ||
| os.Exit(1) | ||
| } |
There was a problem hiding this comment.
Removing os.RemoveAll(*tmp) is necessary for the caching strategy, but it silently changes behavior for local development — test data will now accumulate in the temp directory across runs and never be cleaned up.
Rather than removing this unconditionally, add a new flag to control it:
keepData = pflag.Bool("keep-data", false,
"Preserve downloaded test data after run (for CI caching). Default cleans up.")Then in main(), restore the cleanup but make it conditional:
err := run()
if !*keepData {
_ = os.RemoveAll(*tmp)
}The CI workflows would then pass --keep-data explicitly:
cd t; ./t --suite=load --tmp=${{ github.workspace }}/test-data --keep-dataThis fits the existing flag pattern (similar to --keep for clusters, --download for resources), is explicit and discoverable via --help, and avoids coupling behavior to the environment.
| panic(fmt.Sprintf("error downloading %s: %v", fname, err)) | ||
| } | ||
| fmt.Printf("Downloaded %s to %s in %s \n", fname, dir, time.Since(start)) | ||
| }(fname, link, &wg) |
There was a problem hiding this comment.
Calling panic() inside a goroutine will crash the entire program without unwinding other goroutines or running deferred cleanup. The original code had this same issue, but since this PR is improving error handling, it would be worth fixing.
Consider using errgroup.Group instead of sync.WaitGroup — it propagates errors cleanly from goroutines and would allow downloadLDBCFiles to return an error rather than panicking.
CI workflows on forked PRs fail with HTTP 429 (Too Many Requests) when downloading benchmark datasets from GitHub, because shared-IP runners hit rate limits on unauthenticated requests. This fix replaces GitHub web URLs with direct CDN URLs, adds retry logic with exponential backoff, limits concurrent downloads to 5, and cleans up partial files on failure. It also adds GitHub Actions caching to avoid re-downloading data files on repeat runs. Together these changes make benchmark data downloads reliable for both forked and internal CI workflows.