fix: add exponential backoff retry for initial GTFS data load (#496) by ARCoder181105 · Pull Request #513 · OneBusAway/maglev

ARCoder181105 · 2026-02-28T09:13:35Z

Description:

Problem:
Previously, during application startup in InitGTFSManager, the initial GTFS static data download was attempted exactly once. If the feed URL was unreachable due to a transient network issue, DNS hiccup, or upstream outage, the application would exit immediately. This caused unnecessary container churn and delays in orchestrated environments.

Solution:
This PR introduces a retry mechanism with exponential backoff for the initial GTFS static data load and database build processes.

Changes Made:

Updated InitGTFSManager in internal/gtfs/gtfs_manager.go to wrap loadGTFSData and buildGtfsDB in a retry loop.
Implemented a maximum of 5 attempts (1 initial + 4 retries) with an exponential backoff schedule of [5s, 15s, 30s, 60s].
Added structured logging to capture and report each failed attempt before stalling.
If all 5 attempts fail, the application gracefully surfaces the accumulated error instead of entering an indeterminate state.

Testing:

Ran make test and verified all tests pass (ok maglev.onebusaway.org/internal/gtfs 15.567s).
Note on tests: Existing unit tests in gtfs_manager_test.go continue to pass without modification because they rely on local mock files (raba.zip). Since local file loading succeeds on the first attempt, the retry loop immediately breaks and avoids the time.Sleep() backoff delays, ensuring test execution remains fast.

Closes #496

aaronbrethorst

Hey Aditya — the motivation here is spot-on. Having the application crash immediately on a transient network hiccup during startup is fragile, and adding retry logic is the right call. The retry loop structure is clean, the backoff schedule is reasonable, and the independent tracking of staticData vs gtfsDB shows good attention to avoiding redundant work. There are a couple of important issues to address around partial state cleanup and local file behavior before this is ready to merge.

Critical Issues

1. `buildGtfsDB` leaves a partial SQLite file on disk that poisons subsequent retries

When buildGtfsDB runs, it calls gtfsdb.NewClient() which creates the SQLite file on disk and runs schema migration (creating all tables). If the subsequent DownloadAndStore or ImportFromFile call fails partway through, the partially-populated SQLite file remains on disk.

On retry, buildGtfsDB opens the same file. Schema migration succeeds silently (CREATE TABLE IF NOT EXISTS). But the tables may already contain partial data from the previous attempt, leading to:

Unique constraint violations on agencies, routes, stops, etc. — masking the original error
Or, in the unlikely case that import metadata was written before failure, a silent short-circuit where the function thinks the import is already complete

This means the retry logic as written can't actually recover from a buildGtfsDB failure — the second attempt will likely fail with a different (misleading) error.

Fix: Delete the SQLite file before retrying buildGtfsDB:

if gtfsDB == nil {
    if attempt > 1 && config.GTFSDataPath != "" && config.GTFSDataPath != ":memory:" {
        _ = os.Remove(config.GTFSDataPath)
    }
    gtfsDB, err = buildGtfsDB(config, isLocalFile, "")
    // ...
}

File: internal/gtfs/gtfs_manager.go:109-120

2. `buildGtfsDB` leaks an open SQLite connection on failure

Looking at buildGtfsDB in static.go: when DownloadAndStore or ImportFromFile fails, the function returns nil, err without closing the *gtfsdb.Client that was already created by NewClient. Each failed retry leaks an open database connection. These open connections may also hold file locks on the SQLite file, which could cause subsequent retries to fail with "database is locked" errors.

This is a pre-existing bug in buildGtfsDB, but the retry loop amplifies it from a one-time leak to a repeated one (up to 4 leaked connections).

Fix: Close the client on the error path in buildGtfsDB:

if err != nil {
    _ = client.Close()
    return nil, err
}

File: internal/gtfs/static.go (the buildGtfsDB function)

Important Issues

3. Retrying local file loads is unnecessary

When isLocalFile is true, loadGTFSData reads from disk with os.ReadFile and buildGtfsDB imports from the same local file. If the file is missing or corrupt, every retry will produce the exact same error. The system will wait through all backoff delays (5 + 15 + 30 + 60 = 110 seconds) before reporting the same failure.

The PR description correctly identifies the motivation as "if the feed URL was unreachable," but the retry logic applies equally to local files.

Fix: Skip the retry loop when isLocalFile is true — just use the original single-attempt logic.

File: internal/gtfs/gtfs_manager.go:93

4. Use structured logging attributes instead of `fmt.Sprintf`

The codebase consistently uses slog.Attr fields for structured logging. The new retry logging embeds attempt/delay into the message string:

logging.LogError(logger, fmt.Sprintf("Failed to load GTFS static data (attempt %d/%d). Retrying in %v...", attempt, maxAttempts, delay), err)

This should use structured attributes:

logging.LogError(logger, "Failed to load GTFS static data, retrying", err,
    slog.Int("attempt", attempt),
    slog.Int("max_attempts", maxAttempts),
    slog.Duration("retry_delay", delay),
)

File: internal/gtfs/gtfs_manager.go:100, 114

Suggestions

5. Log success after retry recovery

When a retry succeeds, the loop silently breaks. The operator sees error messages for attempts 1 through N-1, then nothing — no confirmation that the system recovered. Adding a success log after retries helps operators confirm recovery:

if attempt > 1 {
    logger.Info("GTFS data loaded after retry", slog.Int("attempts", attempt))
}

6. Consider accepting `context.Context` for cancellable startup

time.Sleep can't be interrupted. If the operator hits Ctrl+C during a 60-second backoff, the process won't respond until the sleep completes. This is a lower priority since it only affects startup, but accepting a context.Context and using select with time.After would follow Go conventions:

select {
case <-ctx.Done():
    return nil, ctx.Err()
case <-time.After(delay):
}

Strengths

Clear, well-organized PR description with problem/solution/changes/testing sections
The retry loop logic is structurally correct — nil guards, proper indexing, clean break
Independent tracking of staticData vs gtfsDB avoids redundant work when only one step fails
Backoff schedule of [5s, 15s, 30s, 60s] is sensible for transient network issues
Existing tests continue to pass without modification (local file loads succeed on first attempt)

Recommended Action

Clean up partial SQLite files before retrying buildGtfsDB (critical — without this, retries can't actually succeed)
Close the client on error in buildGtfsDB to prevent connection leaks
Skip retries for local files — they'll fail identically every time
Switch to structured logging for consistency with the codebase
Optionally add success-after-retry logging and context support
Re-run make test && make lint and resubmit

aaronbrethorst

Aditya, excellent work addressing all the feedback from round 1. Every item has been properly resolved — the partial SQLite cleanup, the connection leak fix, the local file skip, structured logging, success-after-retry logging, and the full context.Context plumbing with signal-based cancellation. The signal.NotifyContext in main.go is a particularly nice touch that goes beyond what was suggested. Just one small item to fix before merging.

Prior Feedback Status

All 6 items from the first review have been addressed:

#	Item	Status
1	`buildGtfsDB` leaves partial SQLite file on disk	Fixed — `os.Remove` before retry at `gtfs_manager.go:608`
2	`buildGtfsDB` leaks open SQLite connection on failure	Fixed — `client.Close()` on error path in `static.go:103`
3	Retrying local file loads is unnecessary	Fixed — `maxAttempts = 1` when `isLocalFile` at `gtfs_manager.go:573`
4	Use structured logging attributes	Fixed — `slog.Int`/`slog.Duration` fields at `gtfs_manager.go:587-591`
5	Log success after retry recovery	Fixed — `gtfs_manager.go:639-641`
6	Accept `context.Context` for cancellable startup	Fixed — full context plumbing through `InitGTFSManager`, cancellable `select`/`time.After` sleep, and `signal.NotifyContext` in `main.go`

Important Issues (1 found)

1. `go fmt` not run on two files

Two test files have formatting issues that go fmt would fix:

internal/gtfs/shutdown_test.go (lines 27, 73, 109): Missing space after comma:

// Current:
manager, err := InitGTFSManager(context.Background(),config)
// Should be:
manager, err := InitGTFSManager(context.Background(), config)

internal/gtfs/hot_swap_test.go (line 260): Missing spaces around :=:

// Current:
ctx:=context.Background()
// Should be:
ctx := context.Background()

Fix: Run go fmt ./... and commit the changes. This is a project requirement per CLAUDE.md.

Strengths

Clean, thorough implementation of all round 1 feedback
The signal.NotifyContext in main.go elegantly connects Ctrl+C handling to the retry loop's context cancellation
Independent tracking of staticData vs gtfsDB avoids redundant work
The backoff schedule [5s, 15s, 30s, 60s] is sensible
Context is properly threaded through to GetImportMetadata, buildStopSpatialIndex, and the initial real-time feed fetch

Checklist

Tests pass
Lint clean
Run go fmt ./... (Important #1)
All round 1 feedback addressed

Recommended Action

Run go fmt ./... and commit
Ready to merge after that

aaronbrethorst

Hey Aditya, you're making great progress on the retry logic for GTFS startup. The core implementation is solid and all 6 items from the first review were cleanly addressed. Just a couple of remaining items to take care of:

Prior Feedback Status

#	Item	Status
1	`buildGtfsDB` leaves partial SQLite file on disk	Fixed
2	`buildGtfsDB` leaks open SQLite connection on failure	Fixed
3	Retrying local file loads is unnecessary	Fixed
4	Use structured logging attributes	Fixed
5	Log success after retry recovery	Fixed
6	Accept `context.Context` for cancellable startup	Fixed
7	Run `go fmt ./...`	Still needs attention

Important

1. `go fmt` still not run on two files

This was flagged in the previous review and still needs to be done. Two test files have formatting issues:

internal/gtfs/shutdown_test.go (lines 27, 73, 109): Missing space after comma:

// Current:
manager, err := InitGTFSManager(context.Background(),config)
// Should be:
manager, err := InitGTFSManager(context.Background(), config)

internal/gtfs/hot_swap_test.go (line 260): Missing spaces around :=:

// Current:
ctx:=context.Background()
// Should be:
ctx := context.Background()

Fix: Run go fmt ./... and commit the result. This is a project requirement — the commit checklist requires go fmt ./... to pass before every commit.

2. `_ = os.Remove()` should log on failure, not silently discard the error

At internal/gtfs/gtfs_manager.go:131-133, the partial SQLite cleanup discards the removal error:

_ = os.Remove(config.GTFSDataPath)

If the removal fails (permissions, file locked by a leaked connection, read-only filesystem in a container), the next buildGtfsDB call will open the same partial file and fail with a misleading error (unique constraint violations, "database is locked"). The operator will see the misleading error with no indication that the real problem was a failed cleanup.

Fix: Log the error at warning level so operators can diagnose retry failures:

if removeErr := os.Remove(config.GTFSDataPath); removeErr != nil && !os.IsNotExist(removeErr) {
    logging.LogError(logger, "Failed to clean up partial SQLite file before retry", removeErr,
        slog.String("path", config.GTFSDataPath),
        slog.Int("attempt", attempt),
    )
}

Fit and Finish

3. `buildGtfsDB` uses `context.Background()` instead of the caller's context

At internal/gtfs/static.go:94, buildGtfsDB creates its own context.Background(), ignoring the ctx that was carefully threaded through InitGTFSManager. This means a Ctrl+C during a long DownloadAndStore call won't interrupt the download — the process will appear unresponsive until the HTTP request completes or times out (up to 5 minutes). The cancellable select in the retry loop only helps between attempts, not during a download.

Consider updating buildGtfsDB to accept the caller's ctx so that the entire download/import pipeline respects cancellation. DownloadAndStore already calls http.NewRequestWithContext(ctx, ...), so it's designed to be cancellable — it just never receives the right context.

4. Consider adding test coverage for the retry logic

The retry mechanism is the core feature of this PR, but it currently has zero direct test coverage. All existing tests use local files, which sets maxAttempts = 1 and skips the retry loop entirely. The hardcoded backoff schedule ([5s, 15s, 30s, 60s]) makes it impractical to write fast tests.

One approach: make the backoff schedule configurable via a field on Config (defaulting to the current values). Tests could then use [1ms, 1ms, 1ms, 1ms] and exercise the retry, cancellation, and cleanup paths in under a second.

Thanks again, and I look forward to merging this change.

…Away#496) - Wraps loadGTFSData and buildGtfsDB in a retry loop (max 5 attempts) - Implements exponential backoff schedule - Cleans up partial SQLite files on subsequent attempts - Threads context.Context for cancellable startup

aaronbrethorst

Hey Aditya — excellent persistence across four review rounds. Every single item from the previous three rounds has been cleanly addressed: the partial SQLite cleanup with error logging, the connection leak fix, the local file skip, structured logging, context plumbing all the way through buildGtfsDB and signal.NotifyContext, and the configurable backoff schedule that makes the retry test fast. The implementation is solid. Just one small thing to clean up.

Prior Feedback Status

All 11 items from three previous rounds are resolved:

Round	#	Item	Status
1	1	`buildGtfsDB` leaves partial SQLite file on disk	Fixed
1	2	`buildGtfsDB` leaks open SQLite connection on failure	Fixed
1	3	Retrying local file loads is unnecessary	Fixed
1	4	Use structured logging attributes	Fixed
1	5	Log success after retry recovery	Fixed
1	6	Accept `context.Context` for cancellable startup	Fixed
2	1	Run `go fmt ./...`	Fixed
3	1	`go fmt` still not run	Fixed
3	2	Log `os.Remove` errors	Fixed
3	3	`buildGtfsDB` context plumbing	Fixed
3	4	Test coverage for retry logic	Fixed

Critical

No critical issues.

Important

No important issues.

Fit and Finish

1. Remove branch-name comments in `main.go`

cmd/api/main.go:17, 21:

// From fix/496-gtfs-startup-retry: Graceful shutdown context
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()

// From main: Mutex profiling configuration

These look like rebase artifacts — they reference branch names, which don't mean anything to future readers. The code is self-explanatory; just remove both comments:

ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()

if os.Getenv("MAGLEV_PROFILE_MUTEX") == "1" {

Verdict

All prior feedback is addressed, tests pass, lint is clean, formatting is clean. The only remaining item is cosmetic. Merge after removing the branch-name comments.

aaronbrethorst requested changes Mar 1, 2026

View reviewed changes

ARCoder181105 requested a review from aaronbrethorst March 1, 2026 09:07

ARCoder181105 force-pushed the fix/496-gtfs-startup-retry branch 2 times, most recently from 516f5d6 to ffabdaf Compare March 1, 2026 09:12

aaronbrethorst requested changes Mar 1, 2026

View reviewed changes

ARCoder181105 force-pushed the fix/496-gtfs-startup-retry branch from 0bda5e9 to d5d22f6 Compare March 1, 2026 09:52

ARCoder181105 requested a review from aaronbrethorst March 1, 2026 09:56

aaronbrethorst requested changes Mar 1, 2026

View reviewed changes

ARCoder181105 force-pushed the fix/496-gtfs-startup-retry branch from 15f5eed to 400dffd Compare March 2, 2026 06:33

ARCoder181105 requested a review from aaronbrethorst March 2, 2026 06:41

aaronbrethorst approved these changes Mar 3, 2026

View reviewed changes

aaronbrethorst merged commit c433a1c into OneBusAway:main Mar 3, 2026
4 checks passed

aaronbrethorst mentioned this pull request Mar 3, 2026

fix: propagate context into static GTFS HTTP download #536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add exponential backoff retry for initial GTFS data load (#496)#513

fix: add exponential backoff retry for initial GTFS data load (#496)#513
aaronbrethorst merged 1 commit intoOneBusAway:mainfrom
ARCoder181105:fix/496-gtfs-startup-retry

ARCoder181105 commented Feb 28, 2026

Uh oh!

aaronbrethorst left a comment

Uh oh!

aaronbrethorst left a comment

Uh oh!

aaronbrethorst left a comment

Uh oh!

aaronbrethorst left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ARCoder181105 commented Feb 28, 2026

Description:

Changes Made:

Testing:

Uh oh!

aaronbrethorst left a comment

Choose a reason for hiding this comment

Critical Issues

1. buildGtfsDB leaves a partial SQLite file on disk that poisons subsequent retries

2. buildGtfsDB leaks an open SQLite connection on failure

Important Issues

3. Retrying local file loads is unnecessary

4. Use structured logging attributes instead of fmt.Sprintf

Suggestions

5. Log success after retry recovery

6. Consider accepting context.Context for cancellable startup

Strengths

Recommended Action

Uh oh!

aaronbrethorst left a comment

Choose a reason for hiding this comment

Prior Feedback Status

Important Issues (1 found)

1. go fmt not run on two files

Strengths

Checklist

Recommended Action

Uh oh!

aaronbrethorst left a comment

Choose a reason for hiding this comment

Prior Feedback Status

Important

1. go fmt still not run on two files

2. _ = os.Remove() should log on failure, not silently discard the error

Fit and Finish

3. buildGtfsDB uses context.Background() instead of the caller's context

4. Consider adding test coverage for the retry logic

Uh oh!

aaronbrethorst left a comment

Choose a reason for hiding this comment

Prior Feedback Status

Critical

Important

Fit and Finish

1. Remove branch-name comments in main.go

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `buildGtfsDB` leaves a partial SQLite file on disk that poisons subsequent retries

2. `buildGtfsDB` leaks an open SQLite connection on failure

4. Use structured logging attributes instead of `fmt.Sprintf`

6. Consider accepting `context.Context` for cancellable startup

1. `go fmt` not run on two files

1. `go fmt` still not run on two files

2. `_ = os.Remove()` should log on failure, not silently discard the error

3. `buildGtfsDB` uses `context.Background()` instead of the caller's context

1. Remove branch-name comments in `main.go`