Skip to content

fix: add exponential backoff retry for initial GTFS data load (#496)#513

Merged
aaronbrethorst merged 1 commit intoOneBusAway:mainfrom
ARCoder181105:fix/496-gtfs-startup-retry
Mar 3, 2026
Merged

fix: add exponential backoff retry for initial GTFS data load (#496)#513
aaronbrethorst merged 1 commit intoOneBusAway:mainfrom
ARCoder181105:fix/496-gtfs-startup-retry

Conversation

@ARCoder181105
Copy link
Copy Markdown
Contributor

Description:

Problem:
Previously, during application startup in InitGTFSManager, the initial GTFS static data download was attempted exactly once. If the feed URL was unreachable due to a transient network issue, DNS hiccup, or upstream outage, the application would exit immediately. This caused unnecessary container churn and delays in orchestrated environments.

Solution:
This PR introduces a retry mechanism with exponential backoff for the initial GTFS static data load and database build processes.

Changes Made:

  • Updated InitGTFSManager in internal/gtfs/gtfs_manager.go to wrap loadGTFSData and buildGtfsDB in a retry loop.
  • Implemented a maximum of 5 attempts (1 initial + 4 retries) with an exponential backoff schedule of [5s, 15s, 30s, 60s].
  • Added structured logging to capture and report each failed attempt before stalling.
  • If all 5 attempts fail, the application gracefully surfaces the accumulated error instead of entering an indeterminate state.

Testing:

  • Ran make test and verified all tests pass (ok maglev.onebusaway.org/internal/gtfs 15.567s).
  • Note on tests: Existing unit tests in gtfs_manager_test.go continue to pass without modification because they rely on local mock files (raba.zip). Since local file loading succeeds on the first attempt, the retry loop immediately breaks and avoids the time.Sleep() backoff delays, ensuring test execution remains fast.

Closes #496

Copy link
Copy Markdown
Member

@aaronbrethorst aaronbrethorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Aditya — the motivation here is spot-on. Having the application crash immediately on a transient network hiccup during startup is fragile, and adding retry logic is the right call. The retry loop structure is clean, the backoff schedule is reasonable, and the independent tracking of staticData vs gtfsDB shows good attention to avoiding redundant work. There are a couple of important issues to address around partial state cleanup and local file behavior before this is ready to merge.

Critical Issues

1. buildGtfsDB leaves a partial SQLite file on disk that poisons subsequent retries

When buildGtfsDB runs, it calls gtfsdb.NewClient() which creates the SQLite file on disk and runs schema migration (creating all tables). If the subsequent DownloadAndStore or ImportFromFile call fails partway through, the partially-populated SQLite file remains on disk.

On retry, buildGtfsDB opens the same file. Schema migration succeeds silently (CREATE TABLE IF NOT EXISTS). But the tables may already contain partial data from the previous attempt, leading to:

  • Unique constraint violations on agencies, routes, stops, etc. — masking the original error
  • Or, in the unlikely case that import metadata was written before failure, a silent short-circuit where the function thinks the import is already complete

This means the retry logic as written can't actually recover from a buildGtfsDB failure — the second attempt will likely fail with a different (misleading) error.

Fix: Delete the SQLite file before retrying buildGtfsDB:

if gtfsDB == nil {
    if attempt > 1 && config.GTFSDataPath != "" && config.GTFSDataPath != ":memory:" {
        _ = os.Remove(config.GTFSDataPath)
    }
    gtfsDB, err = buildGtfsDB(config, isLocalFile, "")
    // ...
}

File: internal/gtfs/gtfs_manager.go:109-120

2. buildGtfsDB leaks an open SQLite connection on failure

Looking at buildGtfsDB in static.go: when DownloadAndStore or ImportFromFile fails, the function returns nil, err without closing the *gtfsdb.Client that was already created by NewClient. Each failed retry leaks an open database connection. These open connections may also hold file locks on the SQLite file, which could cause subsequent retries to fail with "database is locked" errors.

This is a pre-existing bug in buildGtfsDB, but the retry loop amplifies it from a one-time leak to a repeated one (up to 4 leaked connections).

Fix: Close the client on the error path in buildGtfsDB:

if err != nil {
    _ = client.Close()
    return nil, err
}

File: internal/gtfs/static.go (the buildGtfsDB function)

Important Issues

3. Retrying local file loads is unnecessary

When isLocalFile is true, loadGTFSData reads from disk with os.ReadFile and buildGtfsDB imports from the same local file. If the file is missing or corrupt, every retry will produce the exact same error. The system will wait through all backoff delays (5 + 15 + 30 + 60 = 110 seconds) before reporting the same failure.

The PR description correctly identifies the motivation as "if the feed URL was unreachable," but the retry logic applies equally to local files.

Fix: Skip the retry loop when isLocalFile is true — just use the original single-attempt logic.

File: internal/gtfs/gtfs_manager.go:93

4. Use structured logging attributes instead of fmt.Sprintf

The codebase consistently uses slog.Attr fields for structured logging. The new retry logging embeds attempt/delay into the message string:

logging.LogError(logger, fmt.Sprintf("Failed to load GTFS static data (attempt %d/%d). Retrying in %v...", attempt, maxAttempts, delay), err)

This should use structured attributes:

logging.LogError(logger, "Failed to load GTFS static data, retrying", err,
    slog.Int("attempt", attempt),
    slog.Int("max_attempts", maxAttempts),
    slog.Duration("retry_delay", delay),
)

File: internal/gtfs/gtfs_manager.go:100, 114

Suggestions

5. Log success after retry recovery

When a retry succeeds, the loop silently breaks. The operator sees error messages for attempts 1 through N-1, then nothing — no confirmation that the system recovered. Adding a success log after retries helps operators confirm recovery:

if attempt > 1 {
    logger.Info("GTFS data loaded after retry", slog.Int("attempts", attempt))
}

6. Consider accepting context.Context for cancellable startup

time.Sleep can't be interrupted. If the operator hits Ctrl+C during a 60-second backoff, the process won't respond until the sleep completes. This is a lower priority since it only affects startup, but accepting a context.Context and using select with time.After would follow Go conventions:

select {
case <-ctx.Done():
    return nil, ctx.Err()
case <-time.After(delay):
}

Strengths

  • Clear, well-organized PR description with problem/solution/changes/testing sections
  • The retry loop logic is structurally correct — nil guards, proper indexing, clean break
  • Independent tracking of staticData vs gtfsDB avoids redundant work when only one step fails
  • Backoff schedule of [5s, 15s, 30s, 60s] is sensible for transient network issues
  • Existing tests continue to pass without modification (local file loads succeed on first attempt)

Recommended Action

  1. Clean up partial SQLite files before retrying buildGtfsDB (critical — without this, retries can't actually succeed)
  2. Close the client on error in buildGtfsDB to prevent connection leaks
  3. Skip retries for local files — they'll fail identically every time
  4. Switch to structured logging for consistency with the codebase
  5. Optionally add success-after-retry logging and context support
  6. Re-run make test && make lint and resubmit

@ARCoder181105 ARCoder181105 force-pushed the fix/496-gtfs-startup-retry branch 2 times, most recently from 516f5d6 to ffabdaf Compare March 1, 2026 09:12
Copy link
Copy Markdown
Member

@aaronbrethorst aaronbrethorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aditya, excellent work addressing all the feedback from round 1. Every item has been properly resolved — the partial SQLite cleanup, the connection leak fix, the local file skip, structured logging, success-after-retry logging, and the full context.Context plumbing with signal-based cancellation. The signal.NotifyContext in main.go is a particularly nice touch that goes beyond what was suggested. Just one small item to fix before merging.

Prior Feedback Status

All 6 items from the first review have been addressed:

# Item Status
1 buildGtfsDB leaves partial SQLite file on disk Fixedos.Remove before retry at gtfs_manager.go:608
2 buildGtfsDB leaks open SQLite connection on failure Fixedclient.Close() on error path in static.go:103
3 Retrying local file loads is unnecessary FixedmaxAttempts = 1 when isLocalFile at gtfs_manager.go:573
4 Use structured logging attributes Fixedslog.Int/slog.Duration fields at gtfs_manager.go:587-591
5 Log success after retry recovery Fixedgtfs_manager.go:639-641
6 Accept context.Context for cancellable startup Fixed — full context plumbing through InitGTFSManager, cancellable select/time.After sleep, and signal.NotifyContext in main.go

Important Issues (1 found)

1. go fmt not run on two files

Two test files have formatting issues that go fmt would fix:

  • internal/gtfs/shutdown_test.go (lines 27, 73, 109): Missing space after comma:

    // Current:
    manager, err := InitGTFSManager(context.Background(),config)
    // Should be:
    manager, err := InitGTFSManager(context.Background(), config)
  • internal/gtfs/hot_swap_test.go (line 260): Missing spaces around :=:

    // Current:
    ctx:=context.Background()
    // Should be:
    ctx := context.Background()

Fix: Run go fmt ./... and commit the changes. This is a project requirement per CLAUDE.md.

Strengths

  • Clean, thorough implementation of all round 1 feedback
  • The signal.NotifyContext in main.go elegantly connects Ctrl+C handling to the retry loop's context cancellation
  • Independent tracking of staticData vs gtfsDB avoids redundant work
  • The backoff schedule [5s, 15s, 30s, 60s] is sensible
  • Context is properly threaded through to GetImportMetadata, buildStopSpatialIndex, and the initial real-time feed fetch

Checklist

  • Tests pass
  • Lint clean
  • Run go fmt ./... (Important #1)
  • All round 1 feedback addressed

Recommended Action

  1. Run go fmt ./... and commit
  2. Ready to merge after that

@ARCoder181105 ARCoder181105 force-pushed the fix/496-gtfs-startup-retry branch from 0bda5e9 to d5d22f6 Compare March 1, 2026 09:52
Copy link
Copy Markdown
Member

@aaronbrethorst aaronbrethorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Aditya, you're making great progress on the retry logic for GTFS startup. The core implementation is solid and all 6 items from the first review were cleanly addressed. Just a couple of remaining items to take care of:

Prior Feedback Status

# Item Status
1 buildGtfsDB leaves partial SQLite file on disk Fixed
2 buildGtfsDB leaks open SQLite connection on failure Fixed
3 Retrying local file loads is unnecessary Fixed
4 Use structured logging attributes Fixed
5 Log success after retry recovery Fixed
6 Accept context.Context for cancellable startup Fixed
7 Run go fmt ./... Still needs attention

Important

1. go fmt still not run on two files

This was flagged in the previous review and still needs to be done. Two test files have formatting issues:

  • internal/gtfs/shutdown_test.go (lines 27, 73, 109): Missing space after comma:

    // Current:
    manager, err := InitGTFSManager(context.Background(),config)
    // Should be:
    manager, err := InitGTFSManager(context.Background(), config)
  • internal/gtfs/hot_swap_test.go (line 260): Missing spaces around :=:

    // Current:
    ctx:=context.Background()
    // Should be:
    ctx := context.Background()

Fix: Run go fmt ./... and commit the result. This is a project requirement — the commit checklist requires go fmt ./... to pass before every commit.

2. _ = os.Remove() should log on failure, not silently discard the error

At internal/gtfs/gtfs_manager.go:131-133, the partial SQLite cleanup discards the removal error:

_ = os.Remove(config.GTFSDataPath)

If the removal fails (permissions, file locked by a leaked connection, read-only filesystem in a container), the next buildGtfsDB call will open the same partial file and fail with a misleading error (unique constraint violations, "database is locked"). The operator will see the misleading error with no indication that the real problem was a failed cleanup.

Fix: Log the error at warning level so operators can diagnose retry failures:

if removeErr := os.Remove(config.GTFSDataPath); removeErr != nil && !os.IsNotExist(removeErr) {
    logging.LogError(logger, "Failed to clean up partial SQLite file before retry", removeErr,
        slog.String("path", config.GTFSDataPath),
        slog.Int("attempt", attempt),
    )
}

Fit and Finish

3. buildGtfsDB uses context.Background() instead of the caller's context

At internal/gtfs/static.go:94, buildGtfsDB creates its own context.Background(), ignoring the ctx that was carefully threaded through InitGTFSManager. This means a Ctrl+C during a long DownloadAndStore call won't interrupt the download — the process will appear unresponsive until the HTTP request completes or times out (up to 5 minutes). The cancellable select in the retry loop only helps between attempts, not during a download.

Consider updating buildGtfsDB to accept the caller's ctx so that the entire download/import pipeline respects cancellation. DownloadAndStore already calls http.NewRequestWithContext(ctx, ...), so it's designed to be cancellable — it just never receives the right context.

4. Consider adding test coverage for the retry logic

The retry mechanism is the core feature of this PR, but it currently has zero direct test coverage. All existing tests use local files, which sets maxAttempts = 1 and skips the retry loop entirely. The hardcoded backoff schedule ([5s, 15s, 30s, 60s]) makes it impractical to write fast tests.

One approach: make the backoff schedule configurable via a field on Config (defaulting to the current values). Tests could then use [1ms, 1ms, 1ms, 1ms] and exercise the retry, cancellation, and cleanup paths in under a second.

Thanks again, and I look forward to merging this change.

…Away#496)

- Wraps loadGTFSData and buildGtfsDB in a retry loop (max 5 attempts)
- Implements exponential backoff schedule
- Cleans up partial SQLite files on subsequent attempts
- Threads context.Context for cancellable startup
@ARCoder181105 ARCoder181105 force-pushed the fix/496-gtfs-startup-retry branch from 15f5eed to 400dffd Compare March 2, 2026 06:33
Copy link
Copy Markdown
Member

@aaronbrethorst aaronbrethorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Aditya — excellent persistence across four review rounds. Every single item from the previous three rounds has been cleanly addressed: the partial SQLite cleanup with error logging, the connection leak fix, the local file skip, structured logging, context plumbing all the way through buildGtfsDB and signal.NotifyContext, and the configurable backoff schedule that makes the retry test fast. The implementation is solid. Just one small thing to clean up.

Prior Feedback Status

All 11 items from three previous rounds are resolved:

Round # Item Status
1 1 buildGtfsDB leaves partial SQLite file on disk Fixed
1 2 buildGtfsDB leaks open SQLite connection on failure Fixed
1 3 Retrying local file loads is unnecessary Fixed
1 4 Use structured logging attributes Fixed
1 5 Log success after retry recovery Fixed
1 6 Accept context.Context for cancellable startup Fixed
2 1 Run go fmt ./... Fixed
3 1 go fmt still not run Fixed
3 2 Log os.Remove errors Fixed
3 3 buildGtfsDB context plumbing Fixed
3 4 Test coverage for retry logic Fixed

Critical

No critical issues.

Important

No important issues.

Fit and Finish

1. Remove branch-name comments in main.go

cmd/api/main.go:17, 21:

// From fix/496-gtfs-startup-retry: Graceful shutdown context
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()

// From main: Mutex profiling configuration

These look like rebase artifacts — they reference branch names, which don't mean anything to future readers. The code is self-explanatory; just remove both comments:

ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()

if os.Getenv("MAGLEV_PROFILE_MUTEX") == "1" {

Verdict

All prior feedback is addressed, tests pass, lint is clean, formatting is clean. The only remaining item is cosmetic. Merge after removing the branch-name comments.

@aaronbrethorst aaronbrethorst merged commit c433a1c into OneBusAway:main Mar 3, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GTFS static data load has no retry on startup

2 participants