refactor(arrow/avro): migrate from hamba/avro to twmb/avro by prochac · Pull Request #830 · apache/arrow-go

prochac · 2026-05-26T23:09:50Z

Rationale for this change

Closes #813

What changes are included in this PR?

Primarily, the lib was changed. New UUID type support was added. Parsing simplified, that's thanks to the new lib that doesn't use map[string]any wrapper.

Are these changes tested?

All tests pass, new test for the UUID type was added

Are there any user-facing changes?

The ArrowSchemaFromAvro function was deprecated, as it causes coupling with 3rd-party Avro module

Switch the avro reader from hamba/avro to twmb/avro for schema parsing and value decoding. The new library hands back typed Go values (e.g. *big.Rat for decimals, [16]byte for fixed-uuid) instead of []byte, which shortens the data-side helpers and lets us fix a stack of bugs the old []byte arms were masking. The exported arrow/avro surface is preserved: NewOCFReader, OCFReader and its methods, the Option/With… helpers, and ArrowSchemaFromAvro all keep their signatures. A new ArrowSchemaFromAvroJSON is added alongside as the recommended entry point — it doesn't couple callers to a particular Avro library through its type signature. ArrowSchemaFromAvro is now marked Deprecated and serializes the hamba schema via json.Marshal before re-parsing with twmb; the github.com/hamba/avro/v2 dependency is kept so downstream callers continue to compile. Newly supported logical types (the mapping had them commented out under hamba): - local-timestamp-millis - local-timestamp-micros - local-timestamp-nanos - timestamp-nanos Fixes: - fixed(16)+uuid rows are no longer silently dropped — the UUIDBuilder now handles [16]byte directly. - ArrowSchemaFromAvro (deprecated wrapper) preserves logical-type annotations on fixed (uuid/decimal/duration). The wrapper now uses json.Marshal, which dispatches to each schema type's MarshalJSON; hambaAvro.Schema.String() returns Avro Parsing Canonical Form, which strips logical types by spec, and was the wrong serializer to bridge between libraries. - Heterogeneous and non-nullable multi-branch unions (e.g. ["null", A, B], ["int", "string"]) used to silently mis-decode against the wrong branch or fall through to an opaque downstream nil-field error. They now fail upfront with "unsupported avro union at <path>". - Nullable unions written as ["T", "null"] (with null in the second position) now resolve to T as expected; the old code always took Types()[1] and produced a Null-typed column for that ordering. - Map-of-primitive schemas (e.g. {"type":"map","values":"int"}) now parse; the old code asserted the value type to NamedSchema and panicked on primitive values. - Unknown SchemaNode types used to leave a nameless nil-typed field that exploded in the record builder; now they fail with "unhandled avro type %q", surfaced as "invalid avro schema". - Duration's []byte arm read Uint16 with gaps and overflowed uint32 on the milli multiply — gone; only the avro.Duration arm remains. - append*Data helpers no longer silently no-op or fmt.Sprint-coerce unknown inputs; they return "unsupported value of type %T". - Enum-symbol mismatches in appendBinaryDictData surface a clear error instead of a generic dictionary builder failure. - Schema-conversion errors wrap with %w so callers can errors.Unwrap past arrow.ErrInvalid. OCF wire format is unchanged: twmb/avro/ocf supports the same codecs (null, deflate, snappy, zstd) and files written by hamba remain readable.

zeroshade

Can you rebase and resolve the conflicts please?

zeroshade · 2026-05-27T17:11:25Z

+	node.namedCache[s.Name] = s
+	if s.Namespace != "" {
+		node.namedCache[s.Namespace+"."+s.Name] = s
+	}


It's worth noting that the hamba SchemaCache keyed on full name only, so for schemas with two records named Foo in different namespaces a bare {"type": "Foo"} will resolve to whichever was defined latest as opposed to being an error (the previous behavior under hamba).

We should either restrict to full-name lookups or document the change in behavior here that where a malformed schema may now silently produce something where hamba would have errored.

zeroshade · 2026-05-27T17:13:54Z

-	case []byte:
-		buf := bytes.NewBuffer(dt)
-		if len(dt) <= 38 {
-			var intData int64
-			err := binary.Read(buf, binary.BigEndian, &intData)
-			if err != nil {
-				return err
-			}
-			b.Append(decimal128.FromI64(intData))
-		} else {
-			var bigIntData big.Int
-			b.Append(decimal128.FromBigInt(bigIntData.SetBytes(buf.Bytes())))
-		}
-	case map[string]any:


If anyone was using custom-decoders that emitted []byte or map[string]any etc. they are now going to see "unsupported value of type X" errors where old code would have accepted the input. Let's update the description on the PR (and therefore the resulting changelog later) to note this change in behavior for anyone who was using custom-decoders.

zeroshade · 2026-05-27T17:15:12Z

+		case 36:
+			return b.AppendValueFromString(string(dt))


let's add a comment here explaining why we have a case for 36 (i guess, hex-dash UUID string?)

zeroshade · 2026-05-27T17:16:37Z

+	for i := range s.Branches {
+		b := s.Branches[i]


Suggested change

for i := range s.Branches {

b := s.Branches[i]

for _, b := range s.Branches {

zeroshade · 2026-05-27T17:18:06Z

+	root := schema.Root()
 	n := newSchemaNode()
-	n.schema = schema
-	c := n.newChild(n.schema.(avro.NamedSchema).Name(), n.schema)
+	n.node = root
+	c := n.newChild(root.Name, root)


OCF requires a named record at the top, we should probably add a guard or explicit "must be named record" check here (the hamba code would have panic'd).

prochac requested a review from zeroshade as a code owner May 26, 2026 23:09

zeroshade requested changes May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(arrow/avro): migrate from hamba/avro to twmb/avro#830

refactor(arrow/avro): migrate from hamba/avro to twmb/avro#830
prochac wants to merge 1 commit into
apache:mainfrom
prochac:refactor/avro-lib-migration

prochac commented May 26, 2026

Uh oh!

zeroshade left a comment

Uh oh!

zeroshade May 27, 2026

Uh oh!

zeroshade May 27, 2026

Uh oh!

zeroshade May 27, 2026

Uh oh!

zeroshade May 27, 2026

Uh oh!

zeroshade May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	for i := range s.Branches {
	b := s.Branches[i]
	for _, b := range s.Branches {

Conversation

prochac commented May 26, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

zeroshade May 27, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade May 27, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade May 27, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade May 27, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants