Skip to content

build: add ijson as a runtime dependency#1011

Draft
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin/1777722612-add-ijson-dep
Draft

build: add ijson as a runtime dependency#1011
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin/1777722612-add-ijson-dep

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

Summary

Adds ijson as a direct runtime dependency of airbyte-cdk.

ijson is a streaming JSON parser. Adding it as a CDK dep makes it available inside the source-declarative-manifest (SDM) base image so that manifest-only connectors with custom components can stream-parse very large JSON response bodies without materializing the entire document in memory.

Motivation

source-amazon-seller-partner is currently OOMing on GET_BRAND_ANALYTICS_SEARCH_TERMS_REPORT (and the other Brand Analytics reports) when reports exceed a few GB uncompressed. Its custom GzipJsonDecoder does:

  1. response.content — buffers the full compressed payload.
  2. gzip.decompress(...) — materializes the full decompressed bytes (~3 GB).
  3. .decode("iso-8859-1") — copies that into a Python string.
  4. json.loads(document) — parses the entire JSON tree.

For a ~3.2 GB report this easily peaks at 12–20 GB of memory, well past the 8 GB cap the customer is allocating. See airbytehq/oncall#12143 for the customer-facing context.

The follow-up connector PR will add a streaming GzipJsonDecoder variant in source-amazon-seller-partner/components.py that uses ijson to yield records one at a time. That PR depends on this CDK release shipping in a new SDM base image.

Why a direct dep, not optional

ijson is a small (single-digit MB) wheel with prebuilt binaries for the platforms the CDK supports. Several connectors already depend on it transitively (e.g. via unstructured/document loaders), so making it explicit is low-risk and unblocks streaming use cases for any connector that needs them.

Review & Testing Checklist for Human

  • Verify ijson shows up in the new poetry.lock and that no other dependency had its resolved version changed.
  • Confirm CI passes (lint, format, mypy, unit tests, dependency / package install jobs).
  • Sanity-check that an SDM image rebuilt off this CDK can import ijson and use the C backend (ijson.backends.yajl2_c).

Notes

  • No public CDK API changes in this PR — it is purely a dependency addition.
  • A follow-up CDK PR could add a first-class streaming JSON parser to composite_raw_decoder.py (e.g., a JsonItemsParser powered by ijson) so connectors don't have to roll their own. Intentionally deferred to keep this change minimal.

Link to Devin session: https://app.devin.ai/sessions/e31a7df6ebe54ce4a68e0eecc7117555

Adds the ijson streaming JSON parser as a direct dependency so connectors that
ship inside the source-declarative-manifest base image can stream-parse very
large JSON response bodies without materializing the full document in memory.

Motivation: source-amazon-seller-partner currently OOMs while reading
GET_BRAND_ANALYTICS_SEARCH_TERMS_REPORT documents that can exceed 3 GB
uncompressed. See airbytehq/oncall#12143.
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1777722612-add-ijson-dep#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1777722612-add-ijson-dep

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

PyTest Results (Fast)

4 040 tests  ±0   4 028 ✅  - 1   6m 41s ⏱️ - 1m 4s
    1 suites ±0      12 💤 +1 
    1 files   ±0       0 ❌ ±0 

Results for commit d2cd1ac. ± Comparison against base commit 886fcf8.

This pull request skips 1 test.
unit_tests.sources.declarative.test_concurrent_declarative_source ‑ test_read_with_concurrent_and_synchronous_streams

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

PyTest Results (Full)

4 043 tests  ±0   4 031 ✅ ±0   11m 42s ⏱️ +19s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit d2cd1ac. ± Comparison against base commit 886fcf8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants