Skip to content

fix(RecordExpander): replace deepcopy with shared reference to prevent O(N²) memory amplification#1009

Draft
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin/1777635782-fix-record-expander-deepcopy
Draft

fix(RecordExpander): replace deepcopy with shared reference to prevent O(N²) memory amplification#1009
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin/1777635782-fix-record-expander-deepcopy

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

Summary

Replaces copy.deepcopy(parent_record) with a direct shared reference in RecordExpander when remain_original_record=True.

Problem: When expanding records from a nested array (e.g., Stripe invoice events with many line items), the previous implementation deep-copied the entire parent record for every expanded child. For a parent with N children, this created N full copies of the parent (which itself contains all N children), resulting in O(N²) memory usage.

This appears to cause sync stalling for high-volume connectors like source-stripe, where invoice events can contain hundreds of line items. With 10 concurrent workers, the memory amplification is further compounded.

Fix: Since original_record is only read from in downstream transformations and then removed (via RemoveFields), deep-copying is unnecessary. All expanded siblings now share the same parent reference, reducing memory from O(N²) to O(N).

Affected code: airbyte_cdk/sources/declarative/expanders/record_expander.py — both _apply_parent_context (dict items) and the inline non-dict branch.

Declarative-First Evaluation

This fix modifies the CDK's RecordExpander component itself. No declarative YAML alternative exists — the RecordExpander is the declarative component, and the performance issue is in its Python implementation.

Test Coverage

  • test_record_expander_shares_parent_reference_for_dict_items — verifies all expanded dict records share the same parent reference (identity check with is)
  • test_record_expander_shares_parent_reference_for_non_dict_items — same for non-dict items
  • test_record_expander_large_expansion_memory_efficient — simulates a Stripe-like scenario with 500 line items and verifies shared references

All 41 tests in test_dpath_extractor.py pass, including 17 pre-existing RecordExpander tests.

Review & Testing Checklist for Human

  • Verify that no downstream consumer of original_record mutates it in-place before it is removed. The shared reference is safe as long as original_record is treated as read-only.
  • Confirm source-stripe invoice_line_items and subscription_items incremental streams work correctly after this change (they read from original_record then remove it).

Notes

Related to https://github.com/airbytehq/oncall/issues/12136 (Stripe sync stalling after April 15).

The RecordExpander component was introduced in v7.17.0 via #859. The remain_original_record feature was first used by source-stripe v6.0.0 (airbytehq/airbyte#76095) for invoice_line_items and subscription_items incremental streams.

Link to Devin session: https://app.devin.ai/sessions/3e3a626cafe64a60a2ee62fa119d7ce5

…t O(N²) memory amplification

When remain_original_record is enabled, RecordExpander previously deep-copied
the entire parent record for every expanded child. For records with large nested
arrays (e.g., Stripe invoice events with hundreds of line items), this caused
O(N²) memory usage — each of N children got a full copy of the parent containing
all N items.

Replace copy.deepcopy(parent_record) with a direct reference. The original_record
field is read-only in downstream transformations and then removed, making deep
copying unnecessary. All expanded siblings now share the same parent reference.

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1777635782-fix-record-expander-deepcopy#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1777635782-fix-record-expander-deepcopy

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

PyTest Results (Fast)

4 043 tests  +3   4 032 ✅ +3   7m 51s ⏱️ +6s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 1735375. ± Comparison against base commit 886fcf8.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

PyTest Results (Full)

4 046 tests  +3   4 034 ✅ +3   10m 52s ⏱️ -31s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 1735375. ± Comparison against base commit 886fcf8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant