fix(RecordExpander): replace deepcopy with shared reference to prevent O(N²) memory amplification#1009
Draft
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
Draft
Conversation
…t O(N²) memory amplification When remain_original_record is enabled, RecordExpander previously deep-copied the entire parent record for every expanded child. For records with large nested arrays (e.g., Stripe invoice events with hundreds of line items), this caused O(N²) memory usage — each of N children got a full copy of the parent containing all N items. Replace copy.deepcopy(parent_record) with a direct reference. The original_record field is read-only in downstream transformations and then removed, making deep copying unnecessary. All expanded siblings now share the same parent reference. Co-Authored-By: bot_apk <apk@cognition.ai>
Contributor
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1777635782-fix-record-expander-deepcopy#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1777635782-fix-record-expander-deepcopyPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces
copy.deepcopy(parent_record)with a direct shared reference inRecordExpanderwhenremain_original_record=True.Problem: When expanding records from a nested array (e.g., Stripe invoice events with many line items), the previous implementation deep-copied the entire parent record for every expanded child. For a parent with N children, this created N full copies of the parent (which itself contains all N children), resulting in O(N²) memory usage.
This appears to cause sync stalling for high-volume connectors like source-stripe, where invoice events can contain hundreds of line items. With 10 concurrent workers, the memory amplification is further compounded.
Fix: Since
original_recordis only read from in downstream transformations and then removed (viaRemoveFields), deep-copying is unnecessary. All expanded siblings now share the same parent reference, reducing memory from O(N²) to O(N).Affected code:
airbyte_cdk/sources/declarative/expanders/record_expander.py— both_apply_parent_context(dict items) and the inline non-dict branch.Declarative-First Evaluation
This fix modifies the CDK's
RecordExpandercomponent itself. No declarative YAML alternative exists — theRecordExpanderis the declarative component, and the performance issue is in its Python implementation.Test Coverage
test_record_expander_shares_parent_reference_for_dict_items— verifies all expanded dict records share the same parent reference (identity check withis)test_record_expander_shares_parent_reference_for_non_dict_items— same for non-dict itemstest_record_expander_large_expansion_memory_efficient— simulates a Stripe-like scenario with 500 line items and verifies shared referencesAll 41 tests in
test_dpath_extractor.pypass, including 17 pre-existing RecordExpander tests.Review & Testing Checklist for Human
original_recordmutates it in-place before it is removed. The shared reference is safe as long asoriginal_recordis treated as read-only.invoice_line_itemsandsubscription_itemsincremental streams work correctly after this change (they read fromoriginal_recordthen remove it).Notes
Related to https://github.com/airbytehq/oncall/issues/12136 (Stripe sync stalling after April 15).
The
RecordExpandercomponent was introduced in v7.17.0 via #859. Theremain_original_recordfeature was first used by source-stripe v6.0.0 (airbytehq/airbyte#76095) forinvoice_line_itemsandsubscription_itemsincremental streams.Link to Devin session: https://app.devin.ai/sessions/3e3a626cafe64a60a2ee62fa119d7ce5