Skip to content

Comments

perf: Add ReflectionCache for Iceberg serialization optimization [iceberg]#3558

Open
Shekharrajak wants to merge 7 commits intoapache:mainfrom
Shekharrajak:feature/iceberg-serialization-optimizations-3456
Open

perf: Add ReflectionCache for Iceberg serialization optimization [iceberg]#3558
Shekharrajak wants to merge 7 commits intoapache:mainfrom
Shekharrajak:feature/iceberg-serialization-optimizations-3456

Conversation

@Shekharrajak
Copy link
Contributor

Which issue does this PR close?

Closes #3456.

Rationale for this change

PR #3298 added reflection caching optimizations for Iceberg serialization, but these were lost during subsequent refactoring in #3349 and #3443. The current code performs redundant Class.forName() and getMethod() calls for every task (tens of thousands of times for large tables), causing significant serialization overhead.

What changes are included in this PR?

  • Add ReflectionCache case class
  • Update serializePartitions() to create cache once and pass to helper methods
  • Update extractDeleteFilesList() and serializePartitionData() to use cached methods
  • Add field ID mapping cache to avoid redundant buildFieldIdMapping() calls per-task
  • Add CometIcebergSerializationBenchmark to measure serialization performance

How are these changes tested?

  • Existing Iceberg integration tests ensure correctness is preserved

Benchmark:

Metric Before After Improvement
serializePartitions() 7,235 ms 5,211 ms 28% faster
Class.forName() 233.5 ns ~0 ns cached
getMethod() 18.2 ns ~0 ns cached

@Shekharrajak Shekharrajak changed the title Add ReflectionCache for Iceberg serialization optimization (#3456) Add ReflectionCache for Iceberg serialization optimization Feb 20, 2026
@Shekharrajak Shekharrajak changed the title Add ReflectionCache for Iceberg serialization optimization perf: Add ReflectionCache for Iceberg serialization optimization Feb 20, 2026
@mbutrovich mbutrovich changed the title perf: Add ReflectionCache for Iceberg serialization optimization perf: Add ReflectionCache for Iceberg serialization optimization [iceberg] Feb 20, 2026
@mbutrovich mbutrovich self-requested a review February 20, 2026 20:13
Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Shekharrajak! First round of feedback. I need to dig into failure handling on this PR next week. Whenever you push your next commit we'll get coverage on the Iceberg tests.

* Cache for Iceberg reflection metadata to avoid repeated class loading and method lookups.
*
* This cache is created once per serializePartitions() call and passed to helper methods. It
* provides ~50% serialization speedup by eliminating redundant reflection operations that would
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache at the right scope - Create cache once per serializePartitions() call, not per-task and Pass cache to helpers: Don't let helper methods do their own class loading

@Shekharrajak Shekharrajak force-pushed the feature/iceberg-serialization-optimizations-3456 branch from 4ad1055 to 3697a89 Compare February 21, 2026 05:14
@Shekharrajak Shekharrajak force-pushed the feature/iceberg-serialization-optimizations-3456 branch from 3697a89 to 2840052 Compare February 21, 2026 05:16
@Shekharrajak
Copy link
Contributor Author

Github CI Checks are looking fine.

@coderfender
Copy link
Contributor

coderfender commented Feb 22, 2026

I think we would be benefit from some unit tests for the new cache case classes. Also we could only init this lazily? Not sure if I am missing something but 'structLikeClass' seems to be not used anywhere after init in the cache constructor?

@Shekharrajak
Copy link
Contributor Author

I think we would be benefit from some unit tests for the new cache case classe

Added

@Shekharrajak
Copy link
Contributor Author

'structLikeClass' seems to be not used anywhere

removed .

Also we could only init this lazily?

For rarely-used fields (but all current fields are used per-task, so eager init is fine)

specToJsonMethod: Method,
deleteContentMethod: Method,
deleteSpecIdMethod: Method,
deleteEqualityIdsMethod: Method)
Copy link
Contributor

@coderfender coderfender Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this cache is truly helpful or perhaps if the reward is worth the complexity being introduced. Here are my concerns and I would love to know your opinion

  1. We have ~ 20 + fields making it difficult to maintain and manage
  2. We are sometimes caching methods vs classes . Can we be consistent and perhaps logically bucket this ?
  3. What happens if a Class or method not is not found ? Would we rather throw an error or fail silently ? What do you think is the path of least resistance here ?
  4. Why did we extends Logging for the companion object ?
  5. Why factory instead of singleton approach ? Perhaps a single instance is useful across ?
  6. Can we guarantee that this is thread safe and only populated when iceberg is involved
  7. Are we catering / handling all supported iceberg versions and their signatures?
  8. nit : might also want to rename class to indicate that it is in the Iceberg subsystem


val cache = ReflectionCache.create()

for (_ <- 1 to 10) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what is the purpose of this loop or this test here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to check everytime we should be able to access the class/method .

}
}

test("ReflectionCache schemaToJsonMethod is accessible") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be an overkill to have this in a single test ?

val cache1 = ReflectionCache.create()
val cache2 = ReflectionCache.create()

assert(cache1.contentScanTaskClass != null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the tests are truly checking if the caches are independent ? The should ultimately be holding ref to the same iceberg methods / instances right ?

} catch {
case _: ClassNotFoundException => false
}
}
Copy link
Contributor

@coderfender coderfender Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to see tests across various iceberg versions , missing / partial constructors and asserting expected behavior. Also we might want to add further tests to see if the methods returned are executable and they aren't corrupted while building the cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: Optimize CometIcebergNativeScan serialization

3 participants