Describe the problem
Once the pipeline is live, it captures only new changes. All consecutive dataset pairs already stored in the DB (and in GCS) represent a rich history of feed evolution that remains invisible.
Proposed solution
Add a backfill task to functions-python/tasks_executor/ that:
- Queries all
GtfsFeed records
- For each feed, retrieves its
GtfsDataset history ordered by downloaded_at
- For each consecutive pair
(previous, current) that doesn't already have a gtfs_dataset_changelog row (idempotency check): invokes the gtfs_diff module
The task should be rate-limited (e.g., N feeds per invocation) and restartable.
Alternatives considered
- Skip backfill entirely: acceptable if historical data is not needed at launch. The feature works forward-only. This issue can be deferred without blocking anything.
Describe the problem
Once the pipeline is live, it captures only new changes. All consecutive dataset pairs already stored in the DB (and in GCS) represent a rich history of feed evolution that remains invisible.
Proposed solution
Add a backfill task to
functions-python/tasks_executor/that:GtfsFeedrecordsGtfsDatasethistory ordered bydownloaded_at(previous, current)that doesn't already have agtfs_dataset_changelogrow (idempotency check): invokes thegtfs_diffmoduleThe task should be rate-limited (e.g., N feeds per invocation) and restartable.
Alternatives considered