Skip to content

fix: validate empty CSV column names and improve mismatch error messages#1010

Draft
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin/1777652889-fix-csv-empty-column-validation
Draft

fix: validate empty CSV column names and improve mismatch error messages#1010
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin/1777652889-fix-csv-empty-column-validation

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

Summary

Resolves https://github.com/airbytehq/oncall/issues/12144:

When a CSV file has trailing empty columns (e.g., col1,col2,col3,,,), the CDK's _get_headers() method silently accepted empty-string column names. These propagated into the discovered schema and catalog, and when the platform deserialized the catalog, Jackson/Kotlin failed on io.airbyte.config.Field.name being non-nullable, producing an opaque "Init container error" with KotlinInvalidNullException.

This PR fixes the issue with three targeted changes:

  1. Validate empty column names in _get_headers() — After reading headers from the CSV file, check for empty or whitespace-only column names and raise AirbyteTracedException with failure_type=config_error. This surfaces the problem during discover so the customer gets a clear, actionable message.

  2. Improve existing mismatch error messages — Replaced confusing internal-facing messages that referenced "resolved to None" with clear user-facing messages:

    • MISMATCHED_COLUMNS: "CSV data row contains more columns than the header row defines."
    • MISMATCHED_ROWS: "CSV data row contains fewer columns than the header row defines."
  3. Preserve specific error context in parse_records() — Previously, parse_records() caught RecordParseError and re-raised with a generic ERROR_PARSING_RECORD message, discarding the specific column mismatch detail. Now passes through the original exception's message.

Declarative-First Evaluation

N/A — This fix targets the file-based CDK (csv_parser.py), not a declarative connector manifest.

Breaking Change Evaluation

Not breaking. No schema, spec, stream, or state changes. This adds a validation guard that raises a config_error during discover for malformed CSV headers (which previously caused an opaque platform crash), and improves error message clarity.

Test Coverage

Added 6 new test cases in unit_tests/sources/file_based/file_types/test_csv_parser.py:

  • test_get_headers_raises_on_empty_column_names (parametrized, 4 cases: trailing, middle, leading empty columns, whitespace-only)
  • test_get_headers_accepts_valid_headers — confirms valid headers still work
  • test_read_data_raises_on_empty_column_names — end-to-end through read_data()
  • test_parse_records_preserves_mismatch_error_detail — confirms the re-raised error preserves specific mismatch detail

All 60 tests pass locally.

Review & Testing Checklist for Human

  • Verify the empty column name validation logic in _get_headers() correctly identifies all edge cases (trailing, middle, leading, whitespace-only)
  • Confirm the updated error messages in exceptions.py are clear for end-users
  • Verify the parse_records() error re-raise preserves the original detail without losing context

Notes

  • Prior art: PR airbytehq/airbyte#36237 added ignore_errors_on_fields_mismatch in the same file
  • The validation only applies to headers read from the file (not user-provided or autogenerated headers)

Link to Devin session: https://app.devin.ai/sessions/c0ac93b0ed1a401ba346b7fcc93bc41b

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1777652889-fix-csv-empty-column-validation#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1777652889-fix-csv-empty-column-validation

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

PyTest Results (Fast)

4 047 tests  +7   4 036 ✅ +7   7m 45s ⏱️ ±0s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit b177c07. ± Comparison against base commit 886fcf8.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

PyTest Results (Full)

4 050 tests  +7   4 038 ✅ +7   11m 29s ⏱️ +6s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit b177c07. ± Comparison against base commit 886fcf8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants