fix: Use FileSystem on getPaths() instead of mapreduce.Job#19418
Open
JWuCines wants to merge 2 commits intoapache:masterfrom
Open
fix: Use FileSystem on getPaths() instead of mapreduce.Job#19418JWuCines wants to merge 2 commits intoapache:masterfrom
JWuCines wants to merge 2 commits intoapache:masterfrom
Conversation
3800037 to
4689a57
Compare
FrankChen021
reviewed
May 6, 2026
Member
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 1 |
| P2 | 1 |
| P3 | 0 |
| Total | 2 |
This is an automated review by Codex GPT-5
78ea639 to
ac8479e
Compare
36f4061 to
11b96db
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #19411.
Description
When using
index_parallelwithHdfsInputSourceon a Kerberized HDFS cluster where the NameNode has KMS configured, the ingestion task unnecessarily attempts to acquire a KMS delegation token. This happens becauseHdfsInputSource.getPaths()usesFileInputFormat.getSplits()for path/glob expansion, which internally callsTokenCache.obtainTokensForNamenodes(), cascading intoKMSClientProvider.getDelegationToken(). Druid's native ingestion authenticates directly via Kerberos TGT and never needs these delegation tokens.Replaced FileInputFormat with direct FileSystem.globStatus() calls
Replaced the
FileInputFormat/Job-based path expansion inHdfsInputSource.getPaths()with directFileSystem.globStatus()calls. This achieves the same HDFS glob expansion without entering the MapReduceTokenCachecode path, eliminating the unnecessary KMS contact.The inner
HdfsFileInputFormathelper class and allorg.apache.hadoop.mapreduceimports have been removed. No other file in thedruid-hdfs-storagemodule references the MapReduce API.Preserved FileInputFormat filtering semantics
_or.are excluded, matching Hadoop'sFileInputFormat.hiddenFileFilter.FileInputFormat's default behavior whenmapreduce.input.fileinputformat.input.dir.recursiveis not set.org.apache.hadoop.util.StringUtils.split(), preserving the same comma-separation and escape behavior asFileInputFormat.addInputPaths().Updated documentation
Updated the
pathsproperty description indocs/ingestion/input-sources.mdto document the non-recursive directory traversal behavior, hidden file filtering, and the use of glob patterns (e.g.,**/*.json) for ingesting files from nested directories.Added unit tests for getPaths() edge cases
Added a new
GetPathsTestinner class toHdfsInputSourceTestwith seven tests:testGetPathsWithGlobMatchingNoFiles— glob matching no files returns an empty collectiontestGetPathsFiltersZeroLengthFiles— zero-length files are excluded, non-empty files are includedtestGetPathsWithMultipleInputPaths— multiple distinct glob patterns are resolved correctlytestGetPathsWithCommaSeparatedString— comma-separated path strings are split and resolvedtestGetPathsFiltersHiddenFiles— files starting with_or.are excluded from glob resultstestGetPathsDirectoryListsFilesNonRecursively— subdirectories and hidden files within a directory are skippedtestGetPathsSkipsHiddenDirectories— hidden directories matched by a glob are not descended intoRelease note
Fixed an issue where
HdfsInputSourcewithindex_parallelunnecessarily contacted KMS when using Kerberized HDFS, causing task failures if KMS was unreachable. The fix replaces the internal use of Hadoop MapReduceFileInputFormatfor path expansion with directFileSystem.globStatus()calls, while preserving hidden file filtering and non-recursive directory listing semantics.Key changed/added classes in this PR
HdfsInputSourceHdfsInputSourceTestdocs/ingestion/input-sources.mdThis PR has: