HIVE-29451: Optimize MapWork to configure JobConf once per table #6317
+11
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.



What changes were proposed in this pull request?
This PR optimizes the configureJobConf method in MapWork.java to eliminate redundant job configuration calls during the map phase initialization.
Modified File: ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java
Logic Change: Introduced a Set within the partition iteration loop.
Mechanism: The code now checks if a TableDesc has already been processed before invoking PlanUtils.configureJobConf(tableDesc, job).
Result: The configuration logic, which includes expensive operations like loading StorageHandlers via reflection, is now executed only once per unique table, rather than once per partition.
Why are the changes needed?
Performance Bottleneck in Job Initialization: Currently, the MapWork.configureJobConf method iterates over aliasToPartnInfo.values(), which contains an entry for every single partition participating in the scan. Inside this loop, it calls PlanUtils.configureJobConf for every partition.
The Issue:
Redundancy: If a query reads 10,000 partitions from the same table, PlanUtils.configureJobConf is called 10,000 times with the exact same TableDesc.
Expensive Operations: PlanUtils.configureJobConf invokes HiveUtils.getStorageHandler, which uses Java Reflection (Class.forName) to load the storage handler class. Repeatedly performing reflection and credential handling for thousands of identical partition objects adds significant, avoidable overhead to the job setup phase.
Impact of Fix:
Complexity Reduction: Reduces the configuration complexity from O(N) (where N is the number of partitions) to O(T) (where T is the number of unique tables).
Scalability: significantly improves the startup time for jobs scanning large numbers of partitions.
Safety: The worst-case scenario (single-partition reads) incurs only the negligible cost of a HashSet instantiation and a single add operation, preserving existing performance for small jobs.
Does this PR introduce any user-facing change?
No. This is an internal optimization to the MapWork plan generation phase. While users may experience faster job startup times for queries involving large numbers of partitions, there are no changes to the user interface, SQL syntax, or configuration properties.
How was this patch tested?
The patch was verified using local unit tests in the ql (Query Language) module to ensure no regressions were introduced by the optimization.
Build Verification: Ran a clean install on the ql module to ensure compilation and dependency integrity.
mvn clean install -pl ql -am -DskipTests
Unit Testing: Executed relevant tests in the ql module, specifically targeting the planning logic components to verify that MapWork configuration remains correct.
mvn test -pl ql -Dtest=TestMapWork
mvn test -pl ql -Dtest="org.apache.hadoop.hive.ql.plan.*"