Skip to content

Add JavaParquetSink execution operator for the Java platform#731

Open
gknz wants to merge 1 commit intoapache:mainfrom
gknz:feature/java-parquet-sink
Open

Add JavaParquetSink execution operator for the Java platform#731
gknz wants to merge 1 commit intoapache:mainfrom
gknz:feature/java-parquet-sink

Conversation

@gknz
Copy link

@gknz gknz commented Mar 23, 2026

Java Parquet Sink

Hey everyone! I built a Java platform execution operator for the existing ParquetSink logical operator, which previously only had a Spark implementation (SparkParquetSink). Now, the optimizer can choose between Java and Spark when writing Parquet files.

What this PR adds

JavaParquetSink (wayang-java/operators/)

  • Writes Wayang Records to Parquet files using the parquet-avro library
  • Follows the same pattern/logic as SparkParquetSink
  • Infers Avro schema automatically by sampling up to 50 records
  • Uses RecordType field names when available, falls back to field0, field1, etc.
  • Uses Snappy compression
  • Handles file overwrite mode

ParquetSinkMapping (wayang-java/mapping/)

  • Connects the logical ParquetSink to JavaParquetSink
  • Registered in Java platform Mappings.java

Fluent API (DataQuantaBuilder.scala)

  • Added writeParquet method to DataQuantaBuilder so users can call it from the fluent API (from my understanding, this was missing — only the DataQuanta.scala layer existed)

Unit Tests (JavaParquetSinkTest.java)

  • testWriteStringRecords: verifies basic write and read-back
  • testWriteMixedTypeRecords: verifies type inference (Int, String, Double)
  • testWriteWithRecordType: verifies column names from RecordType are preserved in the Parquet schema
  • testOverwriteExistingFile: verifies overwrite mode replaces data

Testing

All 4 unit tests pass. I also tested the operator in a different pipeline (not included here) that writes hourly aggregated data to Parquet and reads it back for further processing, and it was working as intended,

- Add JavaParquetSink: writes Records to Parquet using parquet-avro
  with Snappy compression and automatic schema inference
- Add ParquetSinkMapping and register in Java Mappings
- Add writeParquet to DataQuantaBuilder for Java fluent API
- Add JavaParquetSinkTest with 4 unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant