GH-49058: [Python] Disallow non-UTF-8 bytes in custom metadata#49696
Open
nitrajen wants to merge 1 commit intoapache:mainfrom
Open
GH-49058: [Python] Disallow non-UTF-8 bytes in custom metadata#49696nitrajen wants to merge 1 commit intoapache:mainfrom
nitrajen wants to merge 1 commit intoapache:mainfrom
Conversation
|
|
Schema.fbs defines metadata keys and values as flatbuffer strings, which are required to be valid UTF-8. PyArrow was silently accepting arbitrary byte sequences, producing schemas that violate the spec and break cross-language interoperability (e.g. Rust enforces UTF-8 via String). Add a UTF-8 check in KeyValueMetadata.__init__ before handing bytes to the C++ layer. Only runs when the input is bytes, so existing TypeError behaviour for invalid types (e.g. integers) is unchanged.
0177a16 to
5ffbda7
Compare
Author
|
Could someone trigger CI on this? I squashed the commits and force-pushed to clean up the branch. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Schema.fbsrequires metadata keys and values to be valid UTF-8 strings, but PyArrow currently accepts arbitrary byte sequences without complaint. This means you can produce schemas that other implementations (e.g. Rust) will reject when reading.What changes are included in this PR?
Added a UTF-8 check in
KeyValueMetadata.__init__before pushing keys/values into the C++ layer. RaisesValueErrorwith a clear message if invalid bytes are passed.Also replaced
test_undecodable_metadata- which was testing thatrepr()survived bytes that shouldn't have been accepted - with a test that checks the correct behaviour.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes - passing non-UTF-8 bytes as metadata keys or values now raises
ValueErrorinstead of silently succeeding.