Add JavaParquetSink execution operator for the Java platform by gknz · Pull Request #731 · apache/wayang

gknz · 2026-03-23T23:43:01Z

Java Parquet Sink

Hey everyone! I built a Java platform execution operator for the existing ParquetSink logical operator, which previously only had a Spark implementation (SparkParquetSink). Now, the optimizer can choose between Java and Spark when writing Parquet files.

What this PR adds

JavaParquetSink (wayang-java/operators/)

Writes Wayang Records to Parquet files using the parquet-avro library
Follows the same pattern/logic as SparkParquetSink
Infers Avro schema automatically by sampling up to 50 records
Uses RecordType field names when available, falls back to field0, field1, etc.
Uses Snappy compression
Handles file overwrite mode

ParquetSinkMapping (wayang-java/mapping/)

Connects the logical ParquetSink to JavaParquetSink
Registered in Java platform Mappings.java

Fluent API (DataQuantaBuilder.scala)

Added writeParquet method to DataQuantaBuilder so users can call it from the fluent API (from my understanding, this was missing — only the DataQuanta.scala layer existed)

Unit Tests (JavaParquetSinkTest.java)

testWriteStringRecords: verifies basic write and read-back
testWriteMixedTypeRecords: verifies type inference (Int, String, Double)
testWriteWithRecordType: verifies column names from RecordType are preserved in the Parquet schema
testOverwriteExistingFile: verifies overwrite mode replaces data

Testing

All 4 unit tests pass. I also tested the operator in a different pipeline (not included here) that writes hourly aggregated data to Parquet and reads it back for further processing, and it was working as intended,

- Add JavaParquetSink: writes Records to Parquet using parquet-avro with Snappy compression and automatic schema inference - Add ParquetSinkMapping and register in Java Mappings - Add writeParquet to DataQuantaBuilder for Java fluent API - Add JavaParquetSinkTest with 4 unit tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JavaParquetSink execution operator for the Java platform#731

Add JavaParquetSink execution operator for the Java platform#731
gknz wants to merge 1 commit intoapache:mainfrom
gknz:feature/java-parquet-sink

gknz commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gknz commented Mar 23, 2026

Java Parquet Sink

What this PR adds

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant