Open
Conversation
davidgamez
commented
Apr 8, 2026
|
|
||
| return self._get_response(feed_query, GtfsRTFeedImpl) | ||
|
|
||
| entity_types_list = entity_types.split(",") if entity_types else None |
Member
Author
There was a problem hiding this comment.
Not related to this PR, but we had all this dead code...
davidgamez
commented
Apr 8, 2026
Comment on lines
-355
to
+300
| response = [impl_cls.from_orm(feed) for feed in results] | ||
| return list({feed.id: feed for feed in response}.values()) | ||
| return [impl_cls.from_orm(feed) for feed in results] |
Member
Author
There was a problem hiding this comment.
This optimizes memory handling within the function.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Closes #1630
Problem
The
/v1/gtfs_feedsendpoint was averaging ~2.3s in production, compared to ~100ms for/v1/gbfs_feedsand 100–500ms for/v1/gtfs_rt_feeds.The root cause was row explosion caused by SQLAlchemy
joinedloadon multipleone-to-many relationships simultaneously.
get_gtfs_feeds_queryissued a single SQL querywith 15 JOINs, including a 3-level deep chain:
…plus 5 more collection JOINs via
get_joinedload_options():locations,externalids,feedrelatedlinks,redirectingids → target,officialstatushistories.When multiple one-to-many relationships are loaded via JOINs simultaneously, the database
returns a Cartesian product. For a feed with 3 locations × 2 validation reports × 5 features,
SQLAlchemy receives 30 rows for what is effectively 1 feed. The symptom was already
visible in code:
_get_response()contained a Python-side dict-dedup(
{feed.id: feed for feed in response}.values()) to throw away the duplicated ORM objects.In production, with hundreds of feeds each having multiple locations, external IDs,
validation reports, and features, this multiplied into a massive result set that had to be
transferred from the database, deserialized, and then mostly discarded.
The same pattern affected
/v1/feeds(generic feed list) and/v1/gtfs_rt_feedsto alesser degree.
Fix
Replace
joinedloadwithselectinloadfor all collection relationships.selectinloadissues a separateSELECT … WHERE id IN (…)per relationship rather thanSQL JOINs. For a page of N feeds and R relationships, this results in R + 1 queries,
each returning exactly the rows needed — no duplication, no Cartesian product.
joinedloadis retained only for scalar (many-to-one) relationships likevisualization_dataset, where a JOIN adds a single column with no row multiplication.selectinloaddoes not hit the database exponentially. For a fixed set ofrelationships, the query count is constant regardless of how many feeds are returned.
Large IN clauses are batched by SQLAlchemy (O(N/batch)) and not issued one-per-feed.
Benchmark
Measured against the test database before and after:
Tested in QA with PROD data and got a response ~200ms per request.
Notes
get_selectinload_options()already existed indb_utils.pyand was already used byget_all_gtfs_feeds()(batch processing). This change aligns the API list endpoints withthe existing best practice in the codebase.
reproduce identically on the unmodified branch.
From our AI friend
This pull request refactors how SQLAlchemy relationship loading is handled in feed queries, primarily replacing
joinedloadwithselectinloadfor collection relationships. This change is aimed at preventing row explosion issues (cartesian product) when loading multiple one-to-many associations, improving both performance and correctness. Additionally, some unused imports and obsolete code paths are removed to clean up the codebase.Relationship loading improvements:
joinedloadwithselectinloadfor collection relationships in feed queries (e.g.,Gtfsfeed.latest_dataset.validation_reports.features,Gtfsrealtimefeed.entitytypes,Gtfsrealtimefeed.gtfs_feeds) to prevent cartesian-product row explosion and improve query efficiency. (api/src/feeds/impl/feeds_api_impl.py,shared/common/db_utils.py) [1] [2] [3] [4]get_selectinload_optionsinstead ofget_joinedload_optionsfor consistency with the new loading strategy. (api/src/feeds/impl/feeds_api_impl.py,shared/common/db_utils.py) [1] [2] [3]Code cleanup:
EntityType,Location, and associated filters, simplifying the code and reducing maintenance overhead. (api/src/feeds/impl/feeds_api_impl.py) [1] [2] [3]_get_responsemethod to avoid unnecessary deduplication, as the new query structure prevents duplicate results. (api/src/feeds/impl/feeds_api_impl.py)Expected behavior:
Explain and/or show screenshots for how you expect the pull request to work in your testing (in case other devices exhibit different behavior).
Testing tips:
Test this PR locally:
Please make sure these boxes are checked before submitting your pull request - thanks!
./scripts/api-tests.shto make sure you didn't break anything