Skip to content

fix: gtfs_feeds endpoint performance#1652

Open
davidgamez wants to merge 1 commit intomainfrom
fix/endpoint_performance
Open

fix: gtfs_feeds endpoint performance#1652
davidgamez wants to merge 1 commit intomainfrom
fix/endpoint_performance

Conversation

@davidgamez
Copy link
Copy Markdown
Member

@davidgamez davidgamez commented Apr 8, 2026

Summary:

Closes #1630

Problem

The /v1/gtfs_feeds endpoint was averaging ~2.3s in production, compared to ~100ms for
/v1/gbfs_feeds and 100–500ms for /v1/gtfs_rt_feeds.

The root cause was row explosion caused by SQLAlchemy joinedload on multiple
one-to-many relationships simultaneously. get_gtfs_feeds_query issued a single SQL query
with 15 JOINs, including a 3-level deep chain:

Gtfsfeed
  └─ latest_dataset        (joinedload → JOIN)
       └─ validation_reports  (joinedload → JOIN)
            └─ features          (joinedload → JOIN)

…plus 5 more collection JOINs via get_joinedload_options():
locations, externalids, feedrelatedlinks, redirectingids → target,
officialstatushistories.

When multiple one-to-many relationships are loaded via JOINs simultaneously, the database
returns a Cartesian product. For a feed with 3 locations × 2 validation reports × 5 features,
SQLAlchemy receives 30 rows for what is effectively 1 feed. The symptom was already
visible in code: _get_response() contained a Python-side dict-dedup
({feed.id: feed for feed in response}.values()) to throw away the duplicated ORM objects.

In production, with hundreds of feeds each having multiple locations, external IDs,
validation reports, and features, this multiplied into a massive result set that had to be
transferred from the database, deserialized, and then mostly discarded.

The same pattern affected /v1/feeds (generic feed list) and /v1/gtfs_rt_feeds to a
lesser degree.

Fix

Replace joinedload with selectinload for all collection relationships.

selectinload issues a separate SELECT … WHERE id IN (…) per relationship rather than
SQL JOINs. For a page of N feeds and R relationships, this results in R + 1 queries,
each returning exactly the rows needed — no duplication, no Cartesian product.

# Before — 1 query, 15 JOINs, row count = N × M₁ × M₂ × … (Cartesian product)
feed_query.options(
    joinedload(Gtfsfeed.latest_dataset)
        .joinedload(Gtfsdataset.validation_reports)
        .joinedload(Validationreport.features),
    *get_joinedload_options(),   # 5 more collection JOINs
)

# After — ~10 focused queries, each returning exactly N rows
feed_query.options(
    selectinload(Gtfsfeed.latest_dataset)
        .selectinload(Gtfsdataset.validation_reports)
        .selectinload(Validationreport.features),
    *get_selectinload_options(),  # already existed; used by get_all_gtfs_feeds()
)

joinedload is retained only for scalar (many-to-one) relationships like
visualization_dataset, where a JOIN adds a single column with no row multiplication.

selectinload does not hit the database exponentially. For a fixed set of
relationships, the query count is constant regardless of how many feeds are returned.
Large IN clauses are batched by SQLAlchemy (O(N/batch)) and not issued one-per-feed.

Benchmark

Measured against the test database before and after:

Metric Before After
SQL queries issued 1 ~10
JOINs in main query 15 3
Row duplication ratio N × M₁ × M₂ × … 1.0× (no duplication)

Tested in QA with PROD data and got a response ~200ms per request.

Notes

  • get_selectinload_options() already existed in db_utils.py and was already used by
    get_all_gtfs_feeds() (batch processing). This change aligns the API list endpoints with
    the existing best practice in the codebase.
  • No schema or index changes required — this is purely an ORM loading strategy fix.
  • Pre-existing test failures (2 failed, 176 errors) are unrelated to this change and
    reproduce identically on the unmodified branch.

From our AI friend

This pull request refactors how SQLAlchemy relationship loading is handled in feed queries, primarily replacing joinedload with selectinload for collection relationships. This change is aimed at preventing row explosion issues (cartesian product) when loading multiple one-to-many associations, improving both performance and correctness. Additionally, some unused imports and obsolete code paths are removed to clean up the codebase.

Relationship loading improvements:

  • Replaced joinedload with selectinload for collection relationships in feed queries (e.g., Gtfsfeed.latest_dataset.validation_reports.features, Gtfsrealtimefeed.entitytypes, Gtfsrealtimefeed.gtfs_feeds) to prevent cartesian-product row explosion and improve query efficiency. (api/src/feeds/impl/feeds_api_impl.py, shared/common/db_utils.py) [1] [2] [3] [4]
  • Updated utility functions and imports to use get_selectinload_options instead of get_joinedload_options for consistency with the new loading strategy. (api/src/feeds/impl/feeds_api_impl.py, shared/common/db_utils.py) [1] [2] [3]

Code cleanup:

  • Removed unused imports and obsolete filter logic related to EntityType, Location, and associated filters, simplifying the code and reducing maintenance overhead. (api/src/feeds/impl/feeds_api_impl.py) [1] [2] [3]
  • Simplified the _get_response method to avoid unnecessary deduplication, as the new query structure prevents duplicate results. (api/src/feeds/impl/feeds_api_impl.py)
    Expected behavior:

Explain and/or show screenshots for how you expect the pull request to work in your testing (in case other devices exhibit different behavior).

Testing tips:

Test this PR locally:

  • Populate DB
./scripts/

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)


return self._get_response(feed_query, GtfsRTFeedImpl)

entity_types_list = entity_types.split(",") if entity_types else None
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but we had all this dead code...

Comment on lines -355 to +300
response = [impl_cls.from_orm(feed) for feed in results]
return list({feed.id: feed for feed in response}.values())
return [impl_cls.from_orm(feed) for feed in results]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimizes memory handling within the function.

@davidgamez davidgamez changed the title fix: gtfs_feeds endpoint performance— replace joinedload with selectinload fix: gtfs_feeds endpoint performance Apr 8, 2026
@davidgamez davidgamez marked this pull request as ready for review April 8, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate why gtfs_feeds API endpoint significantly longer than others

1 participant