fix: gtfs_feeds endpoint performance by davidgamez · Pull Request #1652 · MobilityData/mobility-feed-api

davidgamez · 2026-04-08T15:33:05Z

Summary:

Problem

The /v1/gtfs_feeds endpoint was averaging ~2.3s in production, compared to ~100ms for
/v1/gbfs_feeds and 100–500ms for /v1/gtfs_rt_feeds.

The root cause was row explosion caused by SQLAlchemy joinedload on multiple
one-to-many relationships simultaneously. get_gtfs_feeds_query issued a single SQL query
with 15 JOINs, including a 3-level deep chain:

Gtfsfeed
  └─ latest_dataset        (joinedload → JOIN)
       └─ validation_reports  (joinedload → JOIN)
            └─ features          (joinedload → JOIN)

…plus 5 more collection JOINs via get_joinedload_options():
locations, externalids, feedrelatedlinks, redirectingids → target,
officialstatushistories.

When multiple one-to-many relationships are loaded via JOINs simultaneously, the database
returns a Cartesian product. For a feed with 3 locations × 2 validation reports × 5 features,
SQLAlchemy receives 30 rows for what is effectively 1 feed. The symptom was already
visible in code: _get_response() contained a Python-side dict-dedup
({feed.id: feed for feed in response}.values()) to throw away the duplicated ORM objects.

In production, with hundreds of feeds each having multiple locations, external IDs,
validation reports, and features, this multiplied into a massive result set that had to be
transferred from the database, deserialized, and then mostly discarded.

The same pattern affected /v1/feeds (generic feed list) and /v1/gtfs_rt_feeds to a
lesser degree.

Fix

Replace joinedload with selectinload for all collection relationships.

selectinload issues a separate SELECT … WHERE id IN (…) per relationship rather than
SQL JOINs. For a page of N feeds and R relationships, this results in R + 1 queries,
each returning exactly the rows needed — no duplication, no Cartesian product.

# Before — 1 query, 15 JOINs, row count = N × M₁ × M₂ × … (Cartesian product)
feed_query.options(
    joinedload(Gtfsfeed.latest_dataset)
        .joinedload(Gtfsdataset.validation_reports)
        .joinedload(Validationreport.features),
    *get_joinedload_options(),   # 5 more collection JOINs
)

# After — ~10 focused queries, each returning exactly N rows
feed_query.options(
    selectinload(Gtfsfeed.latest_dataset)
        .selectinload(Gtfsdataset.validation_reports)
        .selectinload(Validationreport.features),
    *get_selectinload_options(),  # already existed; used by get_all_gtfs_feeds()
)

joinedload is retained only for scalar (many-to-one) relationships like
visualization_dataset, where a JOIN adds a single column with no row multiplication.

selectinload does not hit the database exponentially. For a fixed set of
relationships, the query count is constant regardless of how many feeds are returned.
Large IN clauses are batched by SQLAlchemy (O(N/batch)) and not issued one-per-feed.

Benchmark

Measured against the test database before and after:

Metric	Before	After
SQL queries issued	1	~10
JOINs in main query	15	3
Row duplication ratio	N × M₁ × M₂ × …	1.0× (no duplication)

Tested in QA with PROD data and got a response ~200ms per request.

Notes

get_selectinload_options() already existed in db_utils.py and was already used by
get_all_gtfs_feeds() (batch processing). This change aligns the API list endpoints with
the existing best practice in the codebase.
No schema or index changes required — this is purely an ORM loading strategy fix.
Pre-existing test failures (2 failed, 176 errors) are unrelated to this change and
reproduce identically on the unmodified branch.

From our AI friend

This pull request refactors how SQLAlchemy relationship loading is handled in feed queries, primarily replacing joinedload with selectinload for collection relationships. This change is aimed at preventing row explosion issues (cartesian product) when loading multiple one-to-many associations, improving both performance and correctness. Additionally, some unused imports and obsolete code paths are removed to clean up the codebase.

Relationship loading improvements:

Replaced joinedload with selectinload for collection relationships in feed queries (e.g., Gtfsfeed.latest_dataset.validation_reports.features, Gtfsrealtimefeed.entitytypes, Gtfsrealtimefeed.gtfs_feeds) to prevent cartesian-product row explosion and improve query efficiency. (api/src/feeds/impl/feeds_api_impl.py, shared/common/db_utils.py) [1] [2] [3] [4]
Updated utility functions and imports to use get_selectinload_options instead of get_joinedload_options for consistency with the new loading strategy. (api/src/feeds/impl/feeds_api_impl.py, shared/common/db_utils.py) [1] [2] [3]

Code cleanup:

Removed unused imports and obsolete filter logic related to EntityType, Location, and associated filters, simplifying the code and reducing maintenance overhead. (api/src/feeds/impl/feeds_api_impl.py) [1] [2] [3]
Simplified the _get_response method to avoid unnecessary deduplication, as the new query structure prevents duplicate results. (api/src/feeds/impl/feeds_api_impl.py)
Expected behavior:

Explain and/or show screenshots for how you expect the pull request to work in your testing (in case other devices exhibit different behavior).

Testing tips:

Test this PR locally:

Populate DB

./scripts/

Please make sure these boxes are checked before submitting your pull request - thanks!

Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
Add or update any needed documentation to the repo
Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
Linked all relevant issues
Include screenshot(s) showing how this pull request works and fixes the issue(s)

davidgamez · 2026-04-08T15:36:05Z

api/src/feeds/impl/feeds_api_impl.py


        return self._get_response(feed_query, GtfsRTFeedImpl)

-        entity_types_list = entity_types.split(",") if entity_types else None


Not related to this PR, but we had all this dead code...

davidgamez · 2026-04-08T15:37:27Z

api/src/feeds/impl/feeds_api_impl.py

-        response = [impl_cls.from_orm(feed) for feed in results]
-        return list({feed.id: feed for feed in response}.values())
+        return [impl_cls.from_orm(feed) for feed in results]


This optimizes memory handling within the function.

replace joinedload with selectiload

6004d5d

davidgamez commented Apr 8, 2026

View reviewed changes

davidgamez changed the title ~~fix: gtfs_feeds endpoint performance— replace joinedload with selectinload~~ fix: gtfs_feeds endpoint performance Apr 8, 2026

davidgamez marked this pull request as ready for review April 8, 2026 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: gtfs_feeds endpoint performance#1652

fix: gtfs_feeds endpoint performance#1652
davidgamez wants to merge 1 commit intomainfrom
fix/endpoint_performance

davidgamez commented Apr 8, 2026 •

edited

Loading

Uh oh!

davidgamez Apr 8, 2026

Uh oh!

davidgamez Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		return self._get_response(feed_query, GtfsRTFeedImpl)

		entity_types_list = entity_types.split(",") if entity_types else None

Conversation

davidgamez commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Benchmark

Notes

From our AI friend

Uh oh!

davidgamez Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

davidgamez Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidgamez commented Apr 8, 2026 •

edited

Loading