Skip to content

Call align_buffers() in from_ffi, remove redundant call from arrow-pyarrow#10030

Open
mbutrovich wants to merge 3 commits into
apache:mainfrom
mbutrovich:fix_10028
Open

Call align_buffers() in from_ffi, remove redundant call from arrow-pyarrow#10030
mbutrovich wants to merge 3 commits into
apache:mainfrom
mbutrovich:fix_10028

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

from_ffi / from_ffi_and_data_type (and therefore ArrowArrayStreamReader) panic inside ScalarBuffer::<i128>::from when an FFI producer hands over a Decimal128 buffer that is 8-byte aligned but not 16-byte aligned. The producer is spec-conformant — the C Data Interface only recommends 8-byte alignment — but align_of::<i128>() == 16 since Rust 1.77 on x86 (always on ARM), so arrow-rs's typed arrays require 16. JVM producers like arrow-java's NettyAllocationManager hit this regularly.

The IPC reader already handles this by calling ArrayData::align_buffers() on import (default of IpcReadOptions::require_alignment, see #5554), and arrow-pyarrow was patched the same way for #6471 / apache/arrow#43552. The C Data Interface entry points were the missing piece.

What changes are included in this PR?

  • arrow::ffi::from_ffi and from_ffi_and_data_type: call data.align_buffers() after consume(). No-op when buffers are already aligned; depends on Make ArrayData.align_buffers align child data buffers recursively #6462 making align_buffers recursive over child data.
  • arrow-pyarrow: drop the now-redundant array_data.align_buffers() call; it's covered by from_ffi.

Are these changes tested?

Yes. New regression test test_decimal128_under_aligned_round_trip in arrow-array/src/ffi.rs constructs an 8-aligned-not-16-aligned Decimal128 buffer via Buffer::from_vec(...).slice(8), imports through from_ffi, and asserts the resulting Decimal128Array values are correct. The test panics without the fix with the exact error from #10028.

Are there any user-facing changes?

No API changes. Behavior change: from_ffi / from_ffi_and_data_type (and ArrowArrayStreamReader::next) now silently realign under-aligned buffers instead of panicking. Already-aligned producers are unaffected; misaligned producers that previously panicked now succeed with a one-time copy of the offending buffer.

Comment thread arrow-array/src/ffi.rs
Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @mbutrovich

Copy link
Copy Markdown
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note on force_validate behavior that's worth addressing (either in this PR or as an explicit follow-up):

In force_validate builds, the fix as written doesn't take effect. tmp.consume()? constructs ArrayData via ArrayData::new_unchecked, which delegates to ArrayDataBuilder::build. That builder honors cfg!(feature = "force_validate") at arrow-data/src/data.rs:2167 and runs validate_data() even when skip_validation is set. validate() then rejects misaligned BufferSpec::FixedWidth buffers at arrow-data/src/data.rs:842-867, so an 8-byte-aligned Decimal128 input fails with InvalidArgumentError before the newly added data.align_buffers() can run. The same applies recursively through consume_children / consume_child, since each child also goes through new_unchecked. The new regression test is gated with #[cfg(not(feature = "force_validate"))], so CI under that feature doesn't surface the gap.

To be clear, I don't think validate()'s alignment check is wrong. The 16-byte requirement is part of arrow-rs's own ArrayData invariant (see the doc at data.rs:2184), not a C Data Interface requirement, and force_validate is precisely the knob that enforces those internal invariants. Loosening it would compromise internal consistency. The issue is ordering on the import boundary: from_ffi accepts spec-conformant C Data Interface input, where 8-byte alignment is legal, and must realign before that input becomes an ArrayData subject to arrow-rs invariants. Today the order is "validate, then realign", which treats a protocol-legal input as an invariant violation.

This gap is pre-existing (before this PR, force_validate builds returned an error rather than panicking), so it's reasonable to defer. But two things would be good even if the deeper fix lands later:

  1. Make this an explicit known limitation in the PR description and/or a comment near the new align_buffers() call, so it's clear that from_ffi only realigns under cfg(not(feature = "force_validate")) today. Without that, the inline comment "a no-op when buffers are already aligned" reads as if alignment is handled in all builds.
  2. Consider tightening the doc on ArrayData::align_buffers to note that on import boundaries it must be called before validation, since the current doc only says it's "useful when interacting with buffers produced by other systems" without addressing ordering.

If you want to close the gap in this PR, the direction is to push alignment into ImportedArrowArray::consume (at every level, including the recursive consume_child path) by replacing ArrayData::new_unchecked with something like ArrayDataBuilder::new(...).buffers(...).child_data(...).align_buffers(true).skip_validation(true).build(). ArrayDataBuilder::build runs align_buffers before validate_data (data.rs:2162-2169), so this works correctly under force_validate while preserving the existing intent of skipping FFI validation on the hot path. The outer data.align_buffers() in from_ffi then becomes redundant and should be removed to avoid a duplicate recursive scan, and the new test's cfg(not(feature = "force_validate")) gate should come off so this case is actually covered in that build.

@mbutrovich
Copy link
Copy Markdown
Contributor Author

Thanks @viirya! Filed #10034 as a followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Align buffers when importing via from_ffi / ArrowArrayStreamReader

3 participants