Feat/cal itp import by ianktc · Pull Request #1670 · MobilityData/mobility-feed-api

ianktc · 2026-04-24T16:12:28Z

Summary:

This pull request introduces support for importing data from Cal-ITP into the system involving an import handler and its tests.

Cal-ITP Import Feature:

functions-python/tasks_executor/src/tasks/data_import/cal_itp/import_cal_itp_feeds.py contains import handler and associated import logic
functions-python/tasks_executor/src/tasks/data_import/cal_itp/ckan_query.sql is the CKAN API SQL query to retrieve feeds from Cal-ITP
functions-python/tasks_executor/tests/tasks/data_import/cal_itp/test_cal_itp_import.py contains the associated unit and e2e tests
infra/functions-python/main.tf includes the Google Cloud Scheduler job to run the import monthly
functions-python/tasks_executor/src/main.py includes the handler to the task list

Out of scope:
Redirecting MDB feeds to new Cal-ITP: the redirect and csv defining redirect links will be included in follow up PR
Include licensing for Cal-ITP feeds (follow up PR after confirmation with Cal-ITP)

Cal-ITP Import — Execution Flow & Design Doc

Based on PR #1670 (feat/cal-itp-import)

Overview

The Cal-ITP import pipeline fetches GTFS schedule and real-time feeds from the California Integrated Travel Project (Cal-ITP) CKAN API and upserts them into the Mobility Feed API database. It runs as a scheduled HTTP Cloud Function (tasks_executor), triggered monthly by Cloud Scheduler, and fans out to dataset download and web revalidation tasks on completion.

Architecture Diagram

Cloud Scheduler (3 AM UTC, 1st of month)
  │  POST {"task": "cal_itp_import", "payload": {"dry_run": false}}
  ▼
tasks_executor  (Cloud Function — 8 GiB, 1000s timeout, Python 3.11)
  │
  ├─ main.py: tasks_executor()          ← HTTP entry point, routes by "task" key
  │
  └─ import_cal_itp_handler()           ← parses dry_run flag, calls orchestrator
       │
       └─ _import_cal_itp()             ← @with_db_session, core logic
            │
            ├─ _fetch_cal_itp_datasets()           → Cal-ITP CKAN API (SQL query)
            ├─ _filter_cal_itp_records()            → Bay Area 511 + customer-facing filter
            ├─ _process_cal_itp_dataset() × N       → per-dataset upsert (batched)
            ├─ _deprecate_stale_feeds()             → mark unseen cal_itp feeds deprecated
            └─ commit_changes()
                 ├─ db_session.commit()
                 ├─ Pub/Sub → datasets-batch-topic  (trigger dataset download per new feed)
                 └─ Cloud Tasks → web_revalidation_task_queue  (revalidate changed feeds)

Step-by-Step Execution Flow

1. Cloud Scheduler Trigger

Terraform resource: google_cloud_scheduler_job.cal_itp_import_schedule
(infra/functions-python/main.tf ~line 564)

Property	Value
Schedule	`0 0 3 * *` — 3 AM UTC, monthly (1st of each month)
Target	HTTP POST → `tasks_executor` Cloud Function URL
Auth	OIDC token from `functions_service_account`
Body	`{"task": "cal_itp_import", "payload": {"dry_run": false}}`
Attempt deadline	320 seconds
Active in prod	Yes (paused in lower environments)

2. Cloud Function — tasks_executor

Terraform resource: google_cloudfunctions2_function.tasks_executor
(infra/functions-python/main.tf ~line 1090)

Property	Value
Entry point	`tasks_executor` (in `main.py`)
Memory	8 GiB
Timeout	1000 seconds
Max instances	200
Max concurrency per instance	1
VPC	Private ranges only via vpc-connector
Secrets	`FEEDS_DATABASE_URL`, `FEEDS_CREDENTIALS`, `WEB_APP_REVALIDATE_SECRET`

Key env vars set by Terraform:

DATASET_PROCESSING_TOPIC_NAME → datasets-batch-topic-{env} (Pub/Sub)
WEB_REVALIDATION_QUEUE → Cloud Tasks queue name
WEB_APP_REVALIDATE_URL → web app revalidation endpoint
PROJECT_ID, ENVIRONMENT, SERVICE_ACCOUNT_EMAIL

3. HTTP Router — `main.py:tasks_executor()`

The function parses request.get_json() for a "task" key and dispatches to the registered handler:

tasks = {
    "cal_itp_import": {"handler": import_cal_itp_handler, ...},
    # + 12 other tasks (tdg_import, jbda_import, revalidate_feed, ...)
}
handler = tasks[task]["handler"]
result = handler(payload=payload)

For unknown tasks → HTTP 400. For handler exceptions → HTTP 500.

4. `import_cal_itp_handler(payload)`

File: import_cal_itp_feeds.py

Parses dry_run from payload (default: True)
Calls _import_cal_itp(dry_run=dry_run)

Logs and returns summary dict:

{
  "message": "Cal-ITP import executed successfully.",
  "created_gtfs": 12,
  "updated_gtfs": 5,
  "created_rt": 8,
  "total_processed_items": 120,
  "params": {"dry_run": false}
}

5. `_import_cal_itp(db_session, dry_run)` — Orchestrator

Decorated with @with_db_session (manages SQLAlchemy session lifecycle).

1. _fetch_cal_itp_datasets()        → raw list of dataset dicts from CKAN
2. _filter_cal_itp_records()        → filtered list (Bay Area + customer-facing rules)
3. for each dataset:
     _process_cal_itp_dataset()     → upsert feeds, accumulate changed IDs
     if batch boundary crossed:
       commit_changes() (partial)
4. _deprecate_stale_feeds()         → mark feeds not seen this run as deprecated
5. commit_changes() (final)
6. if dry_run: db_session.rollback() and skip all triggers

Batch size is controlled by COMMIT_BATCH_SIZE env var (default: 5).

6. Data Fetching — `_fetch_cal_itp_datasets()`

Endpoint: https://data.ca.gov/api/3/action/datastore_search_sql?sql=<encoded>
SQL query: ckan_query.sql — joins 4 CKAN datasets:

Dataset	UUID	Content
`gtfs_datasets`	`e4ca5bd4-...`	Feed URLs, entity types
`services`	`dbacfa9f-...`	Service / agency metadata
`provider_gtfs_data`	`ebe116fb-...`	Customer-facing flag, regional type
`organizations`	`677e1271-...`	Caltrans district name

Filters: is_public = 'Yes' AND at least one feed URL present
Returns: List of flat dicts containing service metadata + all feed URLs for that service

7. Record Filtering — `_filter_cal_itp_records()`

Records are grouped by service_source_record_id. For each group:

Bay Area 511 services (detected by "Bay Area 511 Regional" in any name column):

Apply priority-based deduplication:
1. Regional Precursor Feed (preferred)
2. Regional Subfeed
3. Combined Regional Feed
4. If none match → keep all

All other services:

Keep only records where gtfs_service_data_customer_facing == true/yes/1

8. Per-Dataset Processing — `_process_cal_itp_dataset()`

For each filtered dataset record:

a. Resource Expansion

Expand one dataset dict into 1–4 resource dicts:

1× GTFS Schedule (if schedule_dataset_url present)
Up to 3× GTFS-RT (trip_updates, vehicle_positions, service_alerts)
- Each gets "entity_type": ["{rt_type}"] (a single-element list)

Resources are sorted: schedule first, then RT feeds.

b. Validation — `_validate_required_cal_itp_fields()`

For each resource, validates required fields exist and are non-empty:

Schedule: schedule_source_record_id, schedule_gtfs_dataset_name, schedule_dataset_url
RT: {type}_source_record_id, {type}_gtfs_dataset_name, {type}_dataset_url

Raises InvalidCalItpFeedError on failure; resource is skipped.

c. Stable ID Generation

cal_itp-{service_source_record_id}-{type_code}

Type codes: s (schedule), tu (trip updates), vp (vehicle positions), sa (service alerts)

d. Location Mapping — `_get_cal_itp_locations()`

Maps caltrans_district_name → Location DB row:

Country: United States (hardcoded)
State: California (hardcoded)
City: caltrans_district_name

e. GTFS Schedule Feed Processing

HEAD probe via _probe_head_format() — verify URL returns a ZIP
_delete_and_recreate_feed_if_type_changed() — handles type conflicts (delete + flush + recreate)
Fingerprint comparison:
- API fingerprint: (stable_id, feed_name, provider, producer_url)
- DB fingerprint: same fields read from existing row
- If equal → skip all writes (no-op)
Update common fields: feed_name, provider, producer_url, operational_status, locations
_ensure_cal_itp_external_id() — ensure Externalid row exists for this feed

f. GTFS-RT Feed Processing

Same create/update flow as schedule
_get_entity_types_from_resource() — maps RT type string to entity type codes
- ENTITY_TYPES_MAP: {"trip_updates": "tu", "vehicle_positions": "vp", "service_alerts": "sa"}
get_or_create_entity_type(db_session, et) — upserts Entitytype rows
Links RT feed → schedule feed via static_current_feed reference
RT fingerprint additionally includes static_refs and entity_types

g. Error Handling Per Resource

Savepoint created before each resource
IntegrityError → rollback to savepoint, log, continue
Generic Exception → rollback to savepoint, log, continue

9. Stale Feed Deprecation — `_deprecate_stale_feeds()`

After all datasets are processed:

Query all Feed rows where stable_id LIKE 'cal_itp-%'
Any with a stable ID not in processed_stable_ids → set status = "deprecated"
Ensures feeds that no longer appear in Cal-ITP data are cleaned up automatically

10. Commit & Downstream Triggers — `commit_changes()`

db_session.commit()
  │
  ├── for each feed in feeds_to_publish:
  │     trigger_dataset_download(feed, execution_id)
  │       └── Pub/Sub publish → datasets-batch-topic-{env}
  │             payload: {execution_id, producer_url, feed_stable_id, feed_id, ...}
  │
  └── if changed_feed_stable_ids:
        create_web_revalidation_task(changed_feed_stable_ids)
          └── Cloud Tasks enqueue → web_revalidation_task_queue
                payload: {"task": "revalidate_feed", "payload": {"feed_stable_id": "..."}}
                scheduled at next :00 or :30 boundary (30-min deduplication window)

On IntegrityError: rollback, log, re-raise (propagates to caller).

11. Dry Run Mode

When dry_run=True (the default when called without a payload):

All DB writes happen in the session but are rolled back at the end
No Pub/Sub messages published
No Cloud Tasks enqueued
Returns the same summary dict so results can be inspected

Key Design Decisions

Decision	Rationale
Fingerprint-based diffing	Avoids unnecessary DB writes and downstream triggers on unchanged feeds
Savepoint per resource	Isolates per-resource failures; one bad feed doesn't abort the whole import
Batched commits (default 5)	Balances memory usage vs. DB round-trips for large imports
Stable IDs (`cal_itp-{id}-{type}`)	Enables idempotent upserts and stale detection across runs
Stale deprecation pass	Automatically cleans up feeds removed from Cal-ITP without manual intervention
Dry run default	Safe to invoke manually/in dev without side effects
Cloud Tasks time bucketing	Deduplicates revalidation requests within 30-minute windows to avoid fan-out storms

Entity Types Map

CKAN field	Entity type code	Description
`trip_updates`	`tu`	GTFS-RT trip updates
`vehicle_positions`	`vp`	GTFS-RT vehicle positions
`service_alerts`	`sa`	GTFS-RT service alerts

…y-feed-api into feat/cal-itp-import

Copilot

Pull request overview

Adds a new Cal-ITP data import pipeline to the tasks executor, including the import implementation, CKAN query, tests, and a scheduled monthly execution in GCP.

Changes:

Introduces Cal-ITP import handler + CKAN SQL query for retrieving feed records.
Registers the new cal_itp_import task in the tasks executor and adds unit/e2e tests.
Adds a monthly Cloud Scheduler job to invoke the Cal-ITP import task.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
infra/functions-python/main.tf	Adds a monthly Cloud Scheduler job to call the tasks executor with `cal_itp_import`.
functions-python/tasks_executor/src/main.py	Registers the new `cal_itp_import` task and handler.
functions-python/tasks_executor/src/tasks/data_import/cal_itp/import_cal_itp_feeds.py	Implements Cal-ITP dataset retrieval, filtering, upsert logic, and orchestration/commit hooks.
functions-python/tasks_executor/src/tasks/data_import/cal_itp/ckan_query.sql	Provides the CKAN datastore SQL used to retrieve Cal-ITP feed records.
functions-python/tasks_executor/tests/tasks/data_import/test_cal_itp_import.py	Adds helper/unit tests and an end-to-end DB test for the Cal-ITP import flow.

ianktc · 2026-05-25T15:19:48Z

Recent PR commit changes and the comments they address:

PR Comments/Concerns

Remove licensing code,
Using logging library, removing file logging

Dry Run rollback

I noticed that we don't perform a rollback on dry run = True with tdg commit. I think it might be necessary because the @with_db_session decorator does not rollback itself, it will commit on its own if no exception is thrown. It will close the session with session.close() in finally block but by then it will also already have committed.

We don't currently commit changes on a dry_run anywhere except one location. In the per-dataset processing, on the first run with brand new data, the feed will be created in get_or_create_feed() which will add and flush on line 105 and 106. So the first dry_run with brand new data will write the feed to the db which is the scenario I encountered.

Adding the rollback on dry_run from this point onwards isn't necessary maybe because the case only occurs once...

Pagination

SQL search endpoint itself doesn’t support pagination, we would need to use LIMIT directly in the SQL query or switch to the other endpoint "datastore_search". If we switch to the other endpoint, then we will need to perform the JOIN logic ourselves in python, but with the datastore_search_sql endpoint the JOIN is done on CKAN server side. I added a debug log statement to view the response time and size so we can decide whether its necessary to move to pagination. Currently we received~600 records with SQL query in ~349ms.

Reverse Geo-location concern

The fingerprint between db and api feeds doesn't compare the locations, so reverse geolocation overwriting the location won't result in a feed being considered an update.

ianktc added 4 commits April 24, 2026 11:21

feat: Cal-ITP import (#1642)

bbdd85d

feat: Cal-ITP import (#1642)

5b2fa0d

Merge branch 'feat/cal-itp-import' of github.com:MobilityData/mobilit…

74ea801

…y-feed-api into feat/cal-itp-import

remove redirect for later PR

c798466

ianktc requested a review from Copilot April 24, 2026 16:12

Copilot started reviewing on behalf of ianktc April 24, 2026 16:13 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

ianktc and others added 2 commits May 11, 2026 15:40

address some copilot PR review comments

69b3147

Merge branch 'main' into feat/cal-itp-import

5c6d4b4

davidgamez reviewed May 12, 2026

View reviewed changes

Comment thread functions-python/tasks_executor/src/tasks/data_import/cal_itp/import_cal_itp_feeds.py Outdated

Comment thread functions-python/tasks_executor/src/tasks/data_import/cal_itp/import_cal_itp_feeds.py Outdated

ianktc and others added 3 commits May 21, 2026 16:59

remove logging to file, remove license code

8e0c317

add logging of response size and time

35ef0f9

Merge branch 'main' into feat/cal-itp-import

377d9d8

ianktc requested a review from davidgamez May 25, 2026 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/cal itp import#1670

Feat/cal itp import#1670
ianktc wants to merge 9 commits into
mainfrom
feat/cal-itp-import

ianktc commented Apr 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ianktc commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ianktc commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cal-ITP Import — Execution Flow & Design Doc

Overview

Architecture Diagram

Step-by-Step Execution Flow

1. Cloud Scheduler Trigger

2. Cloud Function — tasks_executor

3. HTTP Router — main.py:tasks_executor()

4. import_cal_itp_handler(payload)

5. _import_cal_itp(db_session, dry_run) — Orchestrator

6. Data Fetching — _fetch_cal_itp_datasets()

7. Record Filtering — _filter_cal_itp_records()

8. Per-Dataset Processing — _process_cal_itp_dataset()

a. Resource Expansion

b. Validation — _validate_required_cal_itp_fields()

c. Stable ID Generation

d. Location Mapping — _get_cal_itp_locations()

e. GTFS Schedule Feed Processing

f. GTFS-RT Feed Processing

g. Error Handling Per Resource

9. Stale Feed Deprecation — _deprecate_stale_feeds()

10. Commit & Downstream Triggers — commit_changes()

11. Dry Run Mode

Key Design Decisions

Entity Types Map

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ianktc commented May 25, 2026

PR Comments/Concerns

Dry Run rollback

Pagination

Reverse Geo-location concern

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ianktc commented Apr 24, 2026 •

edited

Loading

3. HTTP Router — `main.py:tasks_executor()`

4. `import_cal_itp_handler(payload)`

5. `_import_cal_itp(db_session, dry_run)` — Orchestrator

6. Data Fetching — `_fetch_cal_itp_datasets()`

7. Record Filtering — `_filter_cal_itp_records()`

8. Per-Dataset Processing — `_process_cal_itp_dataset()`

b. Validation — `_validate_required_cal_itp_fields()`

d. Location Mapping — `_get_cal_itp_locations()`

9. Stale Feed Deprecation — `_deprecate_stale_feeds()`

10. Commit & Downstream Triggers — `commit_changes()`