perf: Add ReflectionCache for Iceberg serialization optimization [iceberg] by Shekharrajak · Pull Request #3558 · apache/datafusion-comet

Shekharrajak · 2026-02-20T18:17:48Z

Which issue does this PR close?

Closes #3456.

Rationale for this change

PR #3298 added reflection caching optimizations for Iceberg serialization, but these were lost during subsequent refactoring in #3349 and #3443. The current code performs redundant Class.forName() and getMethod() calls for every task (tens of thousands of times for large tables), causing significant serialization overhead.

What changes are included in this PR?

Add ReflectionCache case class
Update serializePartitions() to create cache once and pass to helper methods
Update extractDeleteFilesList() and serializePartitionData() to use cached methods
Add field ID mapping cache to avoid redundant buildFieldIdMapping() calls per-task
Add CometIcebergSerializationBenchmark to measure serialization performance

How are these changes tested?

Existing Iceberg integration tests ensure correctness is preserved

Benchmark:

Metric	Before	After	Improvement
serializePartitions()	7,235 ms	5,211 ms	28% faster
Class.forName()	233.5 ns	~0 ns	cached
getMethod()	18.2 ns	~0 ns	cached

mbutrovich

Thanks @Shekharrajak! First round of feedback. I need to dig into failure handling on this PR next week. Whenever you push your next commit we'll get coverage on the Iceberg tests.

spark/src/main/scala/org/apache/comet/serde/operator/CometIcebergNativeScan.scala

spark/src/test/scala/org/apache/spark/sql/benchmark/CometIcebergSerializationBenchmark.scala

spark/src/main/scala/org/apache/comet/iceberg/IcebergReflection.scala

Shekharrajak · 2026-02-21T04:03:00Z

spark/src/main/scala/org/apache/comet/iceberg/IcebergReflection.scala

+ * Cache for Iceberg reflection metadata to avoid repeated class loading and method lookups.
+ *
+ * This cache is created once per serializePartitions() call and passed to helper methods. It
+ * provides ~50% serialization speedup by eliminating redundant reflection operations that would


Cache at the right scope - Create cache once per serializePartitions() call, not per-task and Pass cache to helpers: Don't let helper methods do their own class loading

Shekharrajak · 2026-02-21T17:49:41Z

Github CI Checks are looking fine.

coderfender · 2026-02-22T05:48:53Z

I think we would be benefit from some unit tests for the new cache case classes. Also we could only init this lazily? Not sure if I am missing something but 'structLikeClass' seems to be not used anywhere after init in the cache constructor?

Shekharrajak · 2026-02-23T03:39:04Z

I think we would be benefit from some unit tests for the new cache case classe

Added

Shekharrajak · 2026-02-23T03:48:18Z

'structLikeClass' seems to be not used anywhere

removed .

Also we could only init this lazily?

For rarely-used fields (but all current fields are used per-task, so eager init is fine)

coderfender · 2026-02-23T21:38:05Z

spark/src/main/scala/org/apache/comet/iceberg/IcebergReflection.scala

+    specToJsonMethod: Method,
+    deleteContentMethod: Method,
+    deleteSpecIdMethod: Method,
+    deleteEqualityIdsMethod: Method)


I am not sure if this cache is truly helpful or perhaps if the reward is worth the complexity being introduced. Here are my concerns and I would love to know your opinion

We have ~ 20 + fields making it difficult to maintain and manage

We are sometimes caching methods vs classes . Can we be consistent and perhaps logically bucket this ?

What happens if a Class or method not is not found ? Would we rather throw an error or fail silently ? What do you think is the path of least resistance here ?

Why did we extends Logging for the companion object ?

Why factory instead of singleton approach ? Perhaps a single instance is useful across ?

Can we guarantee that this is thread safe and only populated when iceberg is involved

Are we catering / handling all supported iceberg versions and their signatures?

nit : might also want to rename class to indicate that it is in the Iceberg subsystem

coderfender · 2026-02-23T21:40:09Z

spark/src/test/scala/org/apache/comet/iceberg/ReflectionCacheSuite.scala

+
+    val cache = ReflectionCache.create()
+
+    for (_ <- 1 to 10) {


I am not sure what is the purpose of this loop or this test here ?

It was to check everytime we should be able to access the class/method .

coderfender · 2026-02-23T21:41:00Z

spark/src/test/scala/org/apache/comet/iceberg/ReflectionCacheSuite.scala

+    }
+  }
+
+  test("ReflectionCache schemaToJsonMethod is accessible") {


May be an overkill to have this in a single test ?

coderfender · 2026-02-23T21:44:40Z

spark/src/test/scala/org/apache/comet/iceberg/IcebergReflectionCacheSuite.scala

+    val cache1 = ReflectionCache.create()
+    val cache2 = ReflectionCache.create()
+
+    assert(cache1.contentScanTaskClass != null)


Not sure if the tests are truly checking if the caches are independent ? The should ultimately be holding ref to the same iceberg methods / instances right ?

coderfender · 2026-02-23T21:47:21Z

spark/src/test/scala/org/apache/comet/iceberg/IcebergReflectionCacheSuite.scala

+    } catch {
+      case _: ClassNotFoundException => false
+    }
+  }


It would be great to see tests across various iceberg versions , missing / partial constructors and asserting expected behavior. Also we might want to add further tests to see if the methods returned are executable and they aren't corrupted while building the cache

Shekharrajak changed the title ~~Add ReflectionCache for Iceberg serialization optimization (#3456)~~ Add ReflectionCache for Iceberg serialization optimization Feb 20, 2026

Shekharrajak changed the title ~~Add ReflectionCache for Iceberg serialization optimization~~ perf: Add ReflectionCache for Iceberg serialization optimization Feb 20, 2026

mbutrovich changed the title ~~perf: Add ReflectionCache for Iceberg serialization optimization~~ perf: Add ReflectionCache for Iceberg serialization optimization [iceberg] Feb 20, 2026

mbutrovich self-requested a review February 20, 2026 20:13

mbutrovich requested changes Feb 20, 2026

View reviewed changes

Shekharrajak commented Feb 21, 2026

View reviewed changes

Shekharrajak force-pushed the feature/iceberg-serialization-optimizations-3456 branch from 4ad1055 to 3697a89 Compare February 21, 2026 05:14

Add ReflectionCache for Iceberg serialization optimization (apache#3456)

2840052

Shekharrajak force-pushed the feature/iceberg-serialization-optimizations-3456 branch from 3697a89 to 2840052 Compare February 21, 2026 05:16

Shekharrajak added 2 commits February 21, 2026 10:48

Remove verbose comments from ReflectionCache

accda68

Fix benchmark iteration pattern per review

b20f803

Add unit tests for ReflectionCache

ca23272

Remove unused structLikeClass from ReflectionCache

d058526

coderfender reviewed Feb 23, 2026

View reviewed changes

coderfender suggested changes Feb 23, 2026

View reviewed changes

Shekharrajak added 2 commits February 24, 2026 14:31

Remove unnecessary loop from ReflectionCacheSuite

19d7d28

Rename ReflectionCache to IcebergReflectionCache and fix test name

3bcdb01

Comments

Conversation

Shekharrajak commented Feb 20, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shekharrajak Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak commented Feb 21, 2026

Uh oh!

coderfender commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shekharrajak commented Feb 23, 2026

Uh oh!

Shekharrajak commented Feb 23, 2026

Uh oh!

coderfender Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderfender Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderfender Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderfender Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderfender Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderfender commented Feb 22, 2026 •

edited

Loading

coderfender Feb 23, 2026 •

edited

Loading

coderfender Feb 23, 2026 •

edited

Loading