Skip to content

[Backend] Backfill changelog records from existing datasets #1639

@cka-y

Description

@cka-y

Describe the problem

Once the pipeline is live, it captures only new changes. All consecutive dataset pairs already stored in the DB (and in GCS) represent a rich history of feed evolution that remains invisible.

Proposed solution

Add a backfill task to functions-python/tasks_executor/ that:

  1. Queries all GtfsFeed records
  2. For each feed, retrieves its GtfsDataset history ordered by downloaded_at
  3. For each consecutive pair (previous, current) that doesn't already have a gtfs_dataset_changelog row (idempotency check): invokes the gtfs_diff module

The task should be rate-limited (e.g., N feeds per invocation) and restartable.

Alternatives considered

  • Skip backfill entirely: acceptable if historical data is not needed at launch. The feature works forward-only. This issue can be deferred without blocking anything.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions