Skip to content

Core, Data, Spark: Moving Spark to use the new FormatModel API#15328

Open
pvary wants to merge 3 commits intoapache:mainfrom
pvary:spark_model
Open

Core, Data, Spark: Moving Spark to use the new FormatModel API#15328
pvary wants to merge 3 commits intoapache:mainfrom
pvary:spark_model

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Feb 15, 2026

Part of: #12298
Implementation of the new API: #12774

SparkFormatModel and related changes

return super.newPositionDeleteWriter(file, spec, partition);
} else {
LOG.info(
"Deprecated feature used. Position delete row schema is used to create the position delete writer.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we mark this as @deprecated?

@singhpk234
Copy link
Contributor

singhpk234 commented Feb 16, 2026

can we run the benchmarks for spark to see how the benchmarks turns out to be post these : https://github.com/apache/iceberg/tree/main/spark/v4.1/spark/src/jmh/java/org/apache/iceberg/spark ?

@pvary
Copy link
Contributor Author

pvary commented Feb 16, 2026

can we run the benchmarks for spark to see how the benchmarks turns out to be post these : https://github.com/apache/iceberg/tree/main/spark/v4.1/spark/src/jmh/java/org/apache/iceberg/spark ?

Added some new tests for Parquet (readUsingRegistryReader, readWithProjectionUsingRegistryReader, readUsingRegistryReader, readWithProjectionUsingRegistryReader, writeUsingRegistryWriter, writeUsingRegistryWriter):

Benchmark                                                                          Mode  Cnt  Score   Error  Units
SparkParquetReadersFlatDataBenchmark.readUsingIcebergReader                          ss    5  0.311 ± 0.005   s/op
SparkParquetReadersFlatDataBenchmark.readUsingIcebergReaderUnsafe                    ss    5  0.396 ± 0.018   s/op
SparkParquetReadersFlatDataBenchmark.readUsingRegistryReader                         ss    5  0.326 ± 0.049   s/op
SparkParquetReadersFlatDataBenchmark.readUsingSparkReader                            ss    5  0.408 ± 0.008   s/op
SparkParquetReadersFlatDataBenchmark.readWithProjectionUsingIcebergReader            ss    5  0.185 ± 0.018   s/op
SparkParquetReadersFlatDataBenchmark.readWithProjectionUsingIcebergReaderUnsafe      ss    5  0.363 ± 0.018   s/op
SparkParquetReadersFlatDataBenchmark.readWithProjectionUsingRegistryReader           ss    5  0.213 ± 0.026   s/op
SparkParquetReadersFlatDataBenchmark.readWithProjectionUsingSparkReader              ss    5  0.273 ± 0.019   s/op
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReader                        ss    5  0.184 ± 0.018   s/op
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReaderUnsafe                  ss    5  0.219 ± 0.026   s/op
SparkParquetReadersNestedDataBenchmark.readUsingRegistryReader                       ss    5  0.179 ± 0.035   s/op
SparkParquetReadersNestedDataBenchmark.readUsingSparkReader                          ss    5  0.223 ± 0.015   s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReader          ss    5  0.077 ± 0.010   s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReaderUnsafe    ss    5  0.137 ± 0.007   s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingRegistryReader         ss    5  0.080 ± 0.006   s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingSparkReader            ss    5  0.103 ± 0.003   s/op
SparkParquetWritersFlatDataBenchmark.writeUsingIcebergWriter                         ss    5  2.602 ± 0.064   s/op
SparkParquetWritersFlatDataBenchmark.writeUsingRegistryWriter                        ss    5  2.593 ± 0.074   s/op
SparkParquetWritersFlatDataBenchmark.writeUsingSparkWriter                           ss    5  2.594 ± 0.054   s/op
SparkParquetWritersNestedDataBenchmark.writeUsingIcebergWriter                       ss    5  1.559 ± 0.022   s/op
SparkParquetWritersNestedDataBenchmark.writeUsingRegistryWriter                      ss    5  1.569 ± 0.043   s/op
SparkParquetWritersNestedDataBenchmark.writeUsingSparkWriter                         ss    5  1.595 ± 0.046   s/op

The differences are barely noticeable in any direction. There should not be any real difference as the resulting readers and writers are using the same code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants