Skip to content

GH-49058: [Python] Disallow non-UTF-8 bytes in custom metadata#49696

Open
nitrajen wants to merge 1 commit intoapache:mainfrom
nitrajen:GH-49058-non-utf8-metadata
Open

GH-49058: [Python] Disallow non-UTF-8 bytes in custom metadata#49696
nitrajen wants to merge 1 commit intoapache:mainfrom
nitrajen:GH-49058-non-utf8-metadata

Conversation

@nitrajen
Copy link
Copy Markdown

@nitrajen nitrajen commented Apr 9, 2026

Rationale for this change

Schema.fbs requires metadata keys and values to be valid UTF-8 strings, but PyArrow currently accepts arbitrary byte sequences without complaint. This means you can produce schemas that other implementations (e.g. Rust) will reject when reading.

What changes are included in this PR?

Added a UTF-8 check in KeyValueMetadata.__init__ before pushing keys/values into the C++ layer. Raises ValueError with a clear message if invalid bytes are passed.

Also replaced test_undecodable_metadata - which was testing that repr() survived bytes that shouldn't have been accepted - with a test that checks the correct behaviour.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes - passing non-UTF-8 bytes as metadata keys or values now raises ValueError instead of silently succeeding.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

⚠️ GitHub issue #49058 has been automatically assigned in GitHub to PR creator.

@nitrajen nitrajen marked this pull request as ready for review April 9, 2026 15:26
@nitrajen nitrajen requested review from AlenkaF, raulcd and rok as code owners April 9, 2026 15:26
Schema.fbs defines metadata keys and values as flatbuffer strings,
which are required to be valid UTF-8. PyArrow was silently accepting
arbitrary byte sequences, producing schemas that violate the spec and
break cross-language interoperability (e.g. Rust enforces UTF-8 via
String).

Add a UTF-8 check in KeyValueMetadata.__init__ before handing bytes
to the C++ layer. Only runs when the input is bytes, so existing
TypeError behaviour for invalid types (e.g. integers) is unchanged.
@nitrajen nitrajen force-pushed the GH-49058-non-utf8-metadata branch from 0177a16 to 5ffbda7 Compare April 9, 2026 16:42
@nitrajen
Copy link
Copy Markdown
Author

nitrajen commented Apr 9, 2026

Could someone trigger CI on this? I squashed the commits and force-pushed to clean up the branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant