build: add ijson as a runtime dependency#1011
Draft
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
Draft
build: add ijson as a runtime dependency#1011devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
Conversation
Adds the ijson streaming JSON parser as a direct dependency so connectors that ship inside the source-declarative-manifest base image can stream-parse very large JSON response bodies without materializing the full document in memory. Motivation: source-amazon-seller-partner currently OOMs while reading GET_BRAND_ANALYTICS_SEARCH_TERMS_REPORT documents that can exceed 3 GB uncompressed. See airbytehq/oncall#12143.
Contributor
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1777722612-add-ijson-dep#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1777722612-add-ijson-depPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
PyTest Results (Fast)4 040 tests ±0 4 028 ✅ - 1 6m 41s ⏱️ - 1m 4s Results for commit d2cd1ac. ± Comparison against base commit 886fcf8. This pull request skips 1 test. |
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
ijsonas a direct runtime dependency ofairbyte-cdk.ijsonis a streaming JSON parser. Adding it as a CDK dep makes it available inside thesource-declarative-manifest(SDM) base image so that manifest-only connectors with custom components can stream-parse very large JSON response bodies without materializing the entire document in memory.Motivation
source-amazon-seller-partneris currently OOMing onGET_BRAND_ANALYTICS_SEARCH_TERMS_REPORT(and the other Brand Analytics reports) when reports exceed a few GB uncompressed. Its customGzipJsonDecoderdoes:response.content— buffers the full compressed payload.gzip.decompress(...)— materializes the full decompressed bytes (~3 GB)..decode("iso-8859-1")— copies that into a Python string.json.loads(document)— parses the entire JSON tree.For a ~3.2 GB report this easily peaks at 12–20 GB of memory, well past the 8 GB cap the customer is allocating. See airbytehq/oncall#12143 for the customer-facing context.
The follow-up connector PR will add a streaming
GzipJsonDecodervariant insource-amazon-seller-partner/components.pythat usesijsonto yield records one at a time. That PR depends on this CDK release shipping in a new SDM base image.Why a direct dep, not optional
ijsonis a small (single-digit MB) wheel with prebuilt binaries for the platforms the CDK supports. Several connectors already depend on it transitively (e.g. viaunstructured/document loaders), so making it explicit is low-risk and unblocks streaming use cases for any connector that needs them.Review & Testing Checklist for Human
ijsonshows up in the newpoetry.lockand that no other dependency had its resolved version changed.import ijsonand use the C backend (ijson.backends.yajl2_c).Notes
composite_raw_decoder.py(e.g., aJsonItemsParserpowered byijson) so connectors don't have to roll their own. Intentionally deferred to keep this change minimal.Link to Devin session: https://app.devin.ai/sessions/e31a7df6ebe54ce4a68e0eecc7117555