feat: optimize memory allocation when converting execution response to dataframe#1125
Merged
Martozar merged 1 commit intogooddata:masterfrom Sep 2, 2025
Merged
Conversation
lupko
reviewed
Sep 2, 2025
| LabelOverrides = dict[str, dict[str, dict[str, str]]] | ||
|
|
||
|
|
||
| class _Header(ABC): |
Contributor
There was a problem hiding this comment.
i think this should also have __slots__ = () so as not to pollute class hierarchy.
lupko
reviewed
Sep 2, 2025
Comment on lines
+127
to
+129
| _unique_headers: list[_Header] = field(factory=list) | ||
| _header_to_index: dict[_Header, int] = field(factory=dict) | ||
| _indexes: list[int] = field(factory=list) |
Contributor
There was a problem hiding this comment.
i wonder if this isn't unnecessarily complicated? in the end, the goal is to reuse _Header instances right? why not having a mapping dict[_Header, _Header] and ditch the indexes altogether?
c05773a to
fcceef4
Compare
lupko
reviewed
Sep 2, 2025
Comment on lines
+127
to
+129
|
|
||
| _headers: list[_Header] = field(factory=list) | ||
| _header_to_header: dict[_Header, _Header] = field(factory=dict) |
Contributor
There was a problem hiding this comment.
i'd rename this. _header_cache or _header_dedup or something like that. the header to header sounds like a name of some pop band :)
lupko
reviewed
Sep 2, 2025
Comment on lines
+145
to
+146
| h2h = self._header_to_header | ||
| headers = self._headers |
Contributor
There was a problem hiding this comment.
i suggest to ditch these and use the self fields directly.
fcceef4 to
5e3aaf7
Compare
…o dataframe Add `optimized` flag to DataFrameFactory to enable memory-optimized conversion of execution response to pandas dataframe. Without the flag, the conversion will run as usual, storing headers as a list of dictionaries. The optimized version only stores unique headers and reference them, preventing unnecessary memory allocations when lots of duplicated headers are processed. Note that the new behaviour is optional and turned off by default, so no existing usages should be affected. JIRA: CQ-1579 risk: low
5e3aaf7 to
ac0189f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
optimizedflag to DataFrameFactory to enable memory-optimized conversion of execution response to pandas dataframe. Without the flag, the conversion will run as usual, storing headers as a list of dictionaries. The optimized version only stores unique headers and references them, thereby preventing unnecessary memory allocations when processing large numbers of duplicated headers. On large datasets, optimized conversation consumes up to 10x less memory (e.g. on 1M rows with 4 attributes original implementation consumed almost 2Gb, while optimized not more than 200Mb)Note that the new behaviour is optional and turned off by default, so no existing usages should be affected.
JIRA: CQ-1579
risk: low