Skip to content

Conversation

@aaronsteers
Copy link
Collaborator

@aaronsteers aaronsteers commented Nov 9, 2025

What

This is a proof-of-concept PR to test the new InferredSchemaLoader feature from Python CDK prerelease version 7.4.2.post15.dev19207237224. This PR demonstrates end-to-end functionality by switching source-pokeapi to use runtime schema inference instead of a static schema.

⚠️ DO NOT MERGE - This is a research/testing PR only.

Related:

How

Three-commit strategy:

  1. Commit 1: Bump base Docker image to prerelease CDK version 7.4.2.post15.dev19207237224
  2. Commit 2: Replace InlineSchemaLoader with InferredSchemaLoader on the pokemon stream (initial attempt with configuration error)
  3. Commit 3: Fix manifest configuration - remove incorrectly nested retriever from schema_loader

Key changes:

  • metadata.yaml: Updated baseImage from 7.4.1 to prerelease 7.4.2.post15.dev19207237224
  • manifest.yaml: Switched pokemon stream from InlineSchemaLoader (static schema reference) to InferredSchemaLoader (runtime inference)
  • InferredSchemaLoader configured with record_sample_size: 50 to infer schema from up to 50 sample records

Configuration Fix (Commit 3):
The initial InferredSchemaLoader configuration incorrectly nested a retriever inside the schema_loader, causing JSON schema validation errors. InferredSchemaLoader should reuse the stream's existing retriever, not embed a new one.

Review guide

Critical items to verify:

  1. CI Test Results - Standard tests may fail with schema validation errors if the test harness uses a stable airbyte-cdk that doesn't include InferredSchemaLoader. This is expected for a prerelease POC. Check if:

    • Container-level discover and check commands work (they use the prerelease CDK in the image)
    • Standard tests fail only on schema validation, not on functional issues
  2. Schema Quality - Compare the inferred schema against the original static schema:

    • airbyte-integrations/connectors/source-pokeapi/manifest.yaml (lines 28-30): New InferredSchemaLoader config
    • Original schema was at #/schemas/pokemon (removed in this PR)
    • Expected differences: Genson may produce nullable unions like ["string", "null"]
    • Risk: With record_sample_size: 50 and PokeAPI returning single records, the inferred schema may miss optional fields
  3. Single Record Limitation - PokeAPI's pokemon endpoint (/{{config['pokemon_name']}}) returns a single Pokemon by name, not a list. This means:

    • The "sample" is effectively one record, making record_sample_size: 50 somewhat meaningless
    • Schema inference is based on ONE Pokemon's structure
    • Fields that are optional or vary by Pokemon type may be missing from the inferred schema
    • This is acceptable for POC but would need adjustment for production

Files to review:

  1. airbyte-integrations/connectors/source-pokeapi/metadata.yaml - CDK version bump (line 18)
  2. airbyte-integrations/connectors/source-pokeapi/manifest.yaml - Schema loader replacement (lines 28-30)

User Impact

For this proof-of-concept:

  • No user impact - this PR should not be merged

If this approach were adopted in production:

  • Schema would be dynamically inferred at discover time instead of being statically defined
  • Schema may differ from current schema (nullable unions, potentially missing optional fields)
  • Discover operation would make additional API calls to fetch sample records
  • For connectors with single-record endpoints like PokeAPI, schema quality may be reduced

Can this PR be safely reverted and rolled back?

  • YES 💚

This is a proof-of-concept PR on a single connector with no production impact. The changes are isolated to source-pokeapi and use a prerelease CDK version.

…7224

This commit updates the base Docker image to use the prerelease CDK version
that includes the new InferredSchemaLoader feature. This is the first commit
in a two-commit proof-of-concept test.

Related: airbytehq/airbyte-python-cdk#831
Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration
Copy link
Contributor

Original prompt from AJ Steers
Received message in Slack channel #proj-ai-connector-builder:

@Devin - Find the auto-schema detection in the Python CDK which is used in our declarative test read implementation. I believe we are inferring schema and returning it in either the "test read" implementation or the "resolve manifest" or "fully resolve manifest". Create a CDK PR which adds a new schema resolution option based on reading `n` records and inferring schema from those `n` records, basically the same thing being done (I think) in those dev-time implementation, except we'd run them (in production) at `discover` time.
Thread URL: https://airbytehq-team.slack.com/archives/C099FV37L2Z/p1762562420838709

@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2025

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Helpful Resources

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

  • /format-fix - Fixes most formatting issues.
  • /bump-version - Bumps connector versions.
    • You can specify a custom changelog by passing changelog. Example: /bump-version changelog="My cool update"
    • Leaving the changelog arg blank will auto-populate the changelog from the PR title.
  • /run-cat-tests - Runs legacy CAT tests (Connector Acceptance Tests)
  • /build-connector-images - Builds and publishes a pre-release docker image for the modified connector(s).
  • JVM connectors:
    • /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
      Example: /update-connector-cdk-version connector=destination-bigquery
    • /bump-bulk-cdk-version bump=patch changelog='foo' - Bump the Bulk CDK's version. bump can be major/minor/patch.
  • Python connectors:
    • /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
    • /poe source example lock - Alias for /poe connector source-example lock.
    • /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
    • /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.

📝 Edit this welcome message.

This commit replaces the InlineSchemaLoader with the new InferredSchemaLoader
on the pokemon stream. The InferredSchemaLoader will read 1 sample record at
discover time and infer the schema using genson.

This demonstrates the end-to-end functionality of the InferredSchemaLoader
feature from CDK PR airbytehq/airbyte-python-cdk#831.

Expected behavior:
- Discover should return a non-empty schema inferred from the sample record
- Schema may include nullable unions (e.g., ["string", "null"]) from genson
- Read operations should continue to work with the inferred schema

Related: airbytehq/airbyte-python-cdk#831
Co-Authored-By: AJ Steers <aj@airbyte.io>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2025

source-pokeapi Connector Test Results

8 tests   0 ✅  24s ⏱️
1 suites  1 💤
1 files    7 ❌

For more details on these failures, see this check.

Results for commit 1ea0213.

♻️ This comment has been updated with latest results.

…r config

InferredSchemaLoader should reuse the stream's existing retriever, not embed
a new one. The nested retriever configuration was causing JSON schema
validation errors.

Also increased record_sample_size from 1 to 50 for better schema coverage.

Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration
Copy link
Contributor

Show/Hide Detailed Report: CI Failure Root Cause Analysis

Investigation Summary

I've investigated the CI test failures and identified the root cause. The prerelease Docker image is correctly built with the InferredSchemaLoader feature, but there's a schema validation issue with how the manifest is configured.

Findings

1. Prerelease Docker Image Verification ✅

The prerelease Docker image is correctly built:

  • CDK Version: 7.4.2.post15.dev19207237224 (confirmed)
  • InferredSchemaLoader Definition: Present in declarative_component_schema.yaml
  • Schema Requirements:
    InferredSchemaLoader:
      required:
        - type
        - retriever  # ← REQUIRED!
      properties:
        retriever:
          anyOf:
            - "$ref": "#/definitions/SimpleRetriever"
            - "$ref": "#/definitions/AsyncRetriever"
            - "$ref": "#/definitions/CustomRetriever"
        record_sample_size:
          type: integer
          default: 100

2. Configuration Issue

Commit 2 (fbcb883) used this configuration:

schema_loader:
  type: InferredSchemaLoader
  record_sample_size: 1
  retriever:
    type: SimpleRetriever
    requester:
      $ref: "#/definitions/base_requester"  # ← Problem: $ref with sibling properties
      path: /{{config['pokemon_name']}}
      http_method: GET
    record_selector:
      type: RecordSelector
      extractor:
        type: DpathExtractor
        field_path: []

Commit 3 (1ea0213) incorrectly removed the retriever entirely, which made validation fail because retriever is a required property.

3. Root Cause

The issue with Commit 2 is likely the $ref pattern with sibling properties. The schema may not allow using $ref: "#/definitions/base_requester" alongside additional properties like path and http_method.

Recommended Fix

Inline the requester configuration instead of using $ref with sibling properties:

schema_loader:
  type: InferredSchemaLoader
  record_sample_size: 50
  retriever:
    type: SimpleRetriever
    requester:
      type: HttpRequester
      url_base: https://pokeapi.co/api/v2/pokemon
      path: /{{config['pokemon_name']}}
      http_method: GET
    record_selector:
      type: RecordSelector
      extractor:
        type: DpathExtractor
        field_path: []

Next Steps

@aaronsteers - Should I:

  1. Push a new commit with the inlined requester configuration?
  2. Test locally first by building the connector image and running docker run --rm airbyte/source-pokeapi:dev-latest spec?
  3. Document these findings and close the PR as "prerelease validation complete"?

This POC has successfully validated that:

  • ✅ The prerelease Docker image contains the InferredSchemaLoader feature
  • ✅ The schema definition is correct and properly requires a retriever
  • ⚠️ The manifest configuration needs adjustment to avoid $ref with sibling properties

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants