fix: add exponential backoff retry for initial GTFS data load (#496)#513
Conversation
aaronbrethorst
left a comment
There was a problem hiding this comment.
Hey Aditya — the motivation here is spot-on. Having the application crash immediately on a transient network hiccup during startup is fragile, and adding retry logic is the right call. The retry loop structure is clean, the backoff schedule is reasonable, and the independent tracking of staticData vs gtfsDB shows good attention to avoiding redundant work. There are a couple of important issues to address around partial state cleanup and local file behavior before this is ready to merge.
Critical Issues
1. buildGtfsDB leaves a partial SQLite file on disk that poisons subsequent retries
When buildGtfsDB runs, it calls gtfsdb.NewClient() which creates the SQLite file on disk and runs schema migration (creating all tables). If the subsequent DownloadAndStore or ImportFromFile call fails partway through, the partially-populated SQLite file remains on disk.
On retry, buildGtfsDB opens the same file. Schema migration succeeds silently (CREATE TABLE IF NOT EXISTS). But the tables may already contain partial data from the previous attempt, leading to:
- Unique constraint violations on agencies, routes, stops, etc. — masking the original error
- Or, in the unlikely case that import metadata was written before failure, a silent short-circuit where the function thinks the import is already complete
This means the retry logic as written can't actually recover from a buildGtfsDB failure — the second attempt will likely fail with a different (misleading) error.
Fix: Delete the SQLite file before retrying buildGtfsDB:
if gtfsDB == nil {
if attempt > 1 && config.GTFSDataPath != "" && config.GTFSDataPath != ":memory:" {
_ = os.Remove(config.GTFSDataPath)
}
gtfsDB, err = buildGtfsDB(config, isLocalFile, "")
// ...
}File: internal/gtfs/gtfs_manager.go:109-120
2. buildGtfsDB leaks an open SQLite connection on failure
Looking at buildGtfsDB in static.go: when DownloadAndStore or ImportFromFile fails, the function returns nil, err without closing the *gtfsdb.Client that was already created by NewClient. Each failed retry leaks an open database connection. These open connections may also hold file locks on the SQLite file, which could cause subsequent retries to fail with "database is locked" errors.
This is a pre-existing bug in buildGtfsDB, but the retry loop amplifies it from a one-time leak to a repeated one (up to 4 leaked connections).
Fix: Close the client on the error path in buildGtfsDB:
if err != nil {
_ = client.Close()
return nil, err
}File: internal/gtfs/static.go (the buildGtfsDB function)
Important Issues
3. Retrying local file loads is unnecessary
When isLocalFile is true, loadGTFSData reads from disk with os.ReadFile and buildGtfsDB imports from the same local file. If the file is missing or corrupt, every retry will produce the exact same error. The system will wait through all backoff delays (5 + 15 + 30 + 60 = 110 seconds) before reporting the same failure.
The PR description correctly identifies the motivation as "if the feed URL was unreachable," but the retry logic applies equally to local files.
Fix: Skip the retry loop when isLocalFile is true — just use the original single-attempt logic.
File: internal/gtfs/gtfs_manager.go:93
4. Use structured logging attributes instead of fmt.Sprintf
The codebase consistently uses slog.Attr fields for structured logging. The new retry logging embeds attempt/delay into the message string:
logging.LogError(logger, fmt.Sprintf("Failed to load GTFS static data (attempt %d/%d). Retrying in %v...", attempt, maxAttempts, delay), err)This should use structured attributes:
logging.LogError(logger, "Failed to load GTFS static data, retrying", err,
slog.Int("attempt", attempt),
slog.Int("max_attempts", maxAttempts),
slog.Duration("retry_delay", delay),
)File: internal/gtfs/gtfs_manager.go:100, 114
Suggestions
5. Log success after retry recovery
When a retry succeeds, the loop silently breaks. The operator sees error messages for attempts 1 through N-1, then nothing — no confirmation that the system recovered. Adding a success log after retries helps operators confirm recovery:
if attempt > 1 {
logger.Info("GTFS data loaded after retry", slog.Int("attempts", attempt))
}6. Consider accepting context.Context for cancellable startup
time.Sleep can't be interrupted. If the operator hits Ctrl+C during a 60-second backoff, the process won't respond until the sleep completes. This is a lower priority since it only affects startup, but accepting a context.Context and using select with time.After would follow Go conventions:
select {
case <-ctx.Done():
return nil, ctx.Err()
case <-time.After(delay):
}Strengths
- Clear, well-organized PR description with problem/solution/changes/testing sections
- The retry loop logic is structurally correct — nil guards, proper indexing, clean break
- Independent tracking of
staticDatavsgtfsDBavoids redundant work when only one step fails - Backoff schedule of [5s, 15s, 30s, 60s] is sensible for transient network issues
- Existing tests continue to pass without modification (local file loads succeed on first attempt)
Recommended Action
- Clean up partial SQLite files before retrying
buildGtfsDB(critical — without this, retries can't actually succeed) - Close the client on error in
buildGtfsDBto prevent connection leaks - Skip retries for local files — they'll fail identically every time
- Switch to structured logging for consistency with the codebase
- Optionally add success-after-retry logging and context support
- Re-run
make test && make lintand resubmit
516f5d6 to
ffabdaf
Compare
aaronbrethorst
left a comment
There was a problem hiding this comment.
Aditya, excellent work addressing all the feedback from round 1. Every item has been properly resolved — the partial SQLite cleanup, the connection leak fix, the local file skip, structured logging, success-after-retry logging, and the full context.Context plumbing with signal-based cancellation. The signal.NotifyContext in main.go is a particularly nice touch that goes beyond what was suggested. Just one small item to fix before merging.
Prior Feedback Status
All 6 items from the first review have been addressed:
| # | Item | Status |
|---|---|---|
| 1 | buildGtfsDB leaves partial SQLite file on disk |
Fixed — os.Remove before retry at gtfs_manager.go:608 |
| 2 | buildGtfsDB leaks open SQLite connection on failure |
Fixed — client.Close() on error path in static.go:103 |
| 3 | Retrying local file loads is unnecessary | Fixed — maxAttempts = 1 when isLocalFile at gtfs_manager.go:573 |
| 4 | Use structured logging attributes | Fixed — slog.Int/slog.Duration fields at gtfs_manager.go:587-591 |
| 5 | Log success after retry recovery | Fixed — gtfs_manager.go:639-641 |
| 6 | Accept context.Context for cancellable startup |
Fixed — full context plumbing through InitGTFSManager, cancellable select/time.After sleep, and signal.NotifyContext in main.go |
Important Issues (1 found)
1. go fmt not run on two files
Two test files have formatting issues that go fmt would fix:
-
internal/gtfs/shutdown_test.go(lines 27, 73, 109): Missing space after comma:// Current: manager, err := InitGTFSManager(context.Background(),config) // Should be: manager, err := InitGTFSManager(context.Background(), config)
-
internal/gtfs/hot_swap_test.go(line 260): Missing spaces around:=:// Current: ctx:=context.Background() // Should be: ctx := context.Background()
Fix: Run go fmt ./... and commit the changes. This is a project requirement per CLAUDE.md.
Strengths
- Clean, thorough implementation of all round 1 feedback
- The
signal.NotifyContextinmain.goelegantly connects Ctrl+C handling to the retry loop's context cancellation - Independent tracking of
staticDatavsgtfsDBavoids redundant work - The backoff schedule [5s, 15s, 30s, 60s] is sensible
- Context is properly threaded through to
GetImportMetadata,buildStopSpatialIndex, and the initial real-time feed fetch
Checklist
- Tests pass
- Lint clean
- Run
go fmt ./...(Important #1) - All round 1 feedback addressed
Recommended Action
- Run
go fmt ./...and commit - Ready to merge after that
0bda5e9 to
d5d22f6
Compare
aaronbrethorst
left a comment
There was a problem hiding this comment.
Hey Aditya, you're making great progress on the retry logic for GTFS startup. The core implementation is solid and all 6 items from the first review were cleanly addressed. Just a couple of remaining items to take care of:
Prior Feedback Status
| # | Item | Status |
|---|---|---|
| 1 | buildGtfsDB leaves partial SQLite file on disk |
Fixed |
| 2 | buildGtfsDB leaks open SQLite connection on failure |
Fixed |
| 3 | Retrying local file loads is unnecessary | Fixed |
| 4 | Use structured logging attributes | Fixed |
| 5 | Log success after retry recovery | Fixed |
| 6 | Accept context.Context for cancellable startup |
Fixed |
| 7 | Run go fmt ./... |
Still needs attention |
Important
1. go fmt still not run on two files
This was flagged in the previous review and still needs to be done. Two test files have formatting issues:
-
internal/gtfs/shutdown_test.go(lines 27, 73, 109): Missing space after comma:// Current: manager, err := InitGTFSManager(context.Background(),config) // Should be: manager, err := InitGTFSManager(context.Background(), config)
-
internal/gtfs/hot_swap_test.go(line 260): Missing spaces around:=:// Current: ctx:=context.Background() // Should be: ctx := context.Background()
Fix: Run go fmt ./... and commit the result. This is a project requirement — the commit checklist requires go fmt ./... to pass before every commit.
2. _ = os.Remove() should log on failure, not silently discard the error
At internal/gtfs/gtfs_manager.go:131-133, the partial SQLite cleanup discards the removal error:
_ = os.Remove(config.GTFSDataPath)If the removal fails (permissions, file locked by a leaked connection, read-only filesystem in a container), the next buildGtfsDB call will open the same partial file and fail with a misleading error (unique constraint violations, "database is locked"). The operator will see the misleading error with no indication that the real problem was a failed cleanup.
Fix: Log the error at warning level so operators can diagnose retry failures:
if removeErr := os.Remove(config.GTFSDataPath); removeErr != nil && !os.IsNotExist(removeErr) {
logging.LogError(logger, "Failed to clean up partial SQLite file before retry", removeErr,
slog.String("path", config.GTFSDataPath),
slog.Int("attempt", attempt),
)
}Fit and Finish
3. buildGtfsDB uses context.Background() instead of the caller's context
At internal/gtfs/static.go:94, buildGtfsDB creates its own context.Background(), ignoring the ctx that was carefully threaded through InitGTFSManager. This means a Ctrl+C during a long DownloadAndStore call won't interrupt the download — the process will appear unresponsive until the HTTP request completes or times out (up to 5 minutes). The cancellable select in the retry loop only helps between attempts, not during a download.
Consider updating buildGtfsDB to accept the caller's ctx so that the entire download/import pipeline respects cancellation. DownloadAndStore already calls http.NewRequestWithContext(ctx, ...), so it's designed to be cancellable — it just never receives the right context.
4. Consider adding test coverage for the retry logic
The retry mechanism is the core feature of this PR, but it currently has zero direct test coverage. All existing tests use local files, which sets maxAttempts = 1 and skips the retry loop entirely. The hardcoded backoff schedule ([5s, 15s, 30s, 60s]) makes it impractical to write fast tests.
One approach: make the backoff schedule configurable via a field on Config (defaulting to the current values). Tests could then use [1ms, 1ms, 1ms, 1ms] and exercise the retry, cancellation, and cleanup paths in under a second.
Thanks again, and I look forward to merging this change.
…Away#496) - Wraps loadGTFSData and buildGtfsDB in a retry loop (max 5 attempts) - Implements exponential backoff schedule - Cleans up partial SQLite files on subsequent attempts - Threads context.Context for cancellable startup
15f5eed to
400dffd
Compare
aaronbrethorst
left a comment
There was a problem hiding this comment.
Hey Aditya — excellent persistence across four review rounds. Every single item from the previous three rounds has been cleanly addressed: the partial SQLite cleanup with error logging, the connection leak fix, the local file skip, structured logging, context plumbing all the way through buildGtfsDB and signal.NotifyContext, and the configurable backoff schedule that makes the retry test fast. The implementation is solid. Just one small thing to clean up.
Prior Feedback Status
All 11 items from three previous rounds are resolved:
| Round | # | Item | Status |
|---|---|---|---|
| 1 | 1 | buildGtfsDB leaves partial SQLite file on disk |
Fixed |
| 1 | 2 | buildGtfsDB leaks open SQLite connection on failure |
Fixed |
| 1 | 3 | Retrying local file loads is unnecessary | Fixed |
| 1 | 4 | Use structured logging attributes | Fixed |
| 1 | 5 | Log success after retry recovery | Fixed |
| 1 | 6 | Accept context.Context for cancellable startup |
Fixed |
| 2 | 1 | Run go fmt ./... |
Fixed |
| 3 | 1 | go fmt still not run |
Fixed |
| 3 | 2 | Log os.Remove errors |
Fixed |
| 3 | 3 | buildGtfsDB context plumbing |
Fixed |
| 3 | 4 | Test coverage for retry logic | Fixed |
Critical
No critical issues.
Important
No important issues.
Fit and Finish
1. Remove branch-name comments in main.go
cmd/api/main.go:17, 21:
// From fix/496-gtfs-startup-retry: Graceful shutdown context
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()
// From main: Mutex profiling configurationThese look like rebase artifacts — they reference branch names, which don't mean anything to future readers. The code is self-explanatory; just remove both comments:
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()
if os.Getenv("MAGLEV_PROFILE_MUTEX") == "1" {Verdict
All prior feedback is addressed, tests pass, lint is clean, formatting is clean. The only remaining item is cosmetic. Merge after removing the branch-name comments.
Description:
Problem:
Previously, during application startup in
InitGTFSManager, the initial GTFS static data download was attempted exactly once. If the feed URL was unreachable due to a transient network issue, DNS hiccup, or upstream outage, the application would exit immediately. This caused unnecessary container churn and delays in orchestrated environments.Solution:
This PR introduces a retry mechanism with exponential backoff for the initial GTFS static data load and database build processes.
Changes Made:
InitGTFSManagerininternal/gtfs/gtfs_manager.goto wraploadGTFSDataandbuildGtfsDBin a retry loop.[5s, 15s, 30s, 60s].Testing:
make testand verified all tests pass (ok maglev.onebusaway.org/internal/gtfs 15.567s).gtfs_manager_test.gocontinue to pass without modification because they rely on local mock files (raba.zip). Since local file loading succeeds on the first attempt, the retry loop immediately breaks and avoids thetime.Sleep()backoff delays, ensuring test execution remains fast.Closes #496