[#11353] fix(hive): fix three ClassLoader/Kerberos bugs causing HMS connection failure#11355
[#11353] fix(hive): fix three ClassLoader/Kerberos bugs causing HMS connection failure#11355JoegenUSTC wants to merge 7 commits into
Conversation
Bug 1: use stable baseLoader (HiveClientFactory.class.getClassLoader()) instead of TCCL in createHiveClientWithBackend(). TCCL is not stable across threads, causing UGI to be loaded by two different ClassLoaders so the TGT stored after login() is invisible at connection time. Bug 2: in createHiveClientImpl(), load HiveVersion from the isolated ClassLoader before calling getConstructor(). HiveClientImpl is a barrier class redefined via defineClass; its constructor expects the HiveVersion type from the isolated CL scope. Passing the system CL's HiveVersion causes NoSuchMethodException. Bug 3: in createHiveClientInternal(), the non-impersonation Kerberos branch was missing Subject binding. GSSAPI requires a JAAS Subject to be bound to the current thread (javax.security.auth.useSubjectCredsOnly is true by default). Wrap the createHiveClientImpl call with realUgi.doAs(), consistent with the impersonation path. Also add KerberosClient.getRealLoginUgi() as the public accessor needed by the doAs fix, and add 'Cannot find Hive jar directory' to the Hive2 fallback condition to handle environments where HIVE3 libs are absent.
|
@JoegenUSTC |
…sLoader/Kerberos fixes Unit tests (TestHiveClientFactoryClassLoader, no external dependencies): - testFactoryClassLoaderIsStableAcrossThreads: proves that HiveClientFactory.class.getClassLoader() returns the same instance across threads (rationale for Bug 1 fix). - testTCCLIsUnstableAcrossThreads: contrast test proving that TCCL differs per thread, confirming why using TCCL as baseLoader was a bug. - testCreateHiveClientImplLoadsHiveVersionFromIsolatedClassLoader: verifies that createHiveClientImpl() calls loadClass(HiveVersion) on the isolated ClassLoader before getConstructor(), not the system CL's HiveVersion.class directly. Uses a mock ClassLoader that returns the real HiveVersion for the loadClass call and throws a sentinel ClassNotFoundException for HiveClientImpl, proving the HiveVersion step succeeds (Bug 2 fix). - testSystemClHiveVersionCausesNoSuchMethodException: contrast test proving that passing the wrong CL's HiveVersion to getConstructor() throws NoSuchMethodException (demonstrates the original Bug 2 failure mode). Integration test (TestHive2HMSWithKerberosNoImpersonation, requires @tag("gravitino-docker-test")): - Extends TestHive2HMSWithKerberos but sets authentication.impersonation-enable=false to exercise the non-impersonation Kerberos path in createHiveClientInternal() — the path fixed by Bug 3 (missing realUgi.doAs()). The existing TestHive2HMSWithKerberos only covered impersonation=true.
Thanks for the suggestion! I've added tests in a follow-up commit ( Unit tests (
Integration test (
|
Code Coverage Report
Files
|
|
The CI fails. Can you take time to fix it? @JoegenUSTC |
…e2HMSWithKerberosNoImpersonation The previous startHiveContainer() override called super.startHiveContainer() and then re-created hiveClient via new HiveClientFactory(...). This was redundant because TestHive2HMSWithKerberos.startHiveContainer() already calls createHiveProperties() via polymorphism, so the overridden createHiveProperties() (with impersonation-enable=false) is used for the initial hiveClient creation. The redundant second HiveClientFactory construction caused a NullPointerException in CI when the Kerberos container initialization failed: the exception left hiveClient=null, and all test methods then failed with NPE on hiveClient.createDatabase(...). Fix: remove startHiveContainer() override entirely. Only createHiveProperties() needs to be overridden to set impersonation-enable=false.
The root cause was that The fix removes the All tests pass locally ( |
Done. |
@yuqi1129
Could you please re-run the failed jobs? Thank you! |
|
@yuqi1129 Thank you for reopening! It seems the CI workflows are still awaiting approval. Could you please approve them to run? Thank you! |
What changes were proposed in this pull request?
This PR fixes three independent bugs in
HiveClientFactorythat together prevent Hive Metastoreconnections from working when Kerberos authentication is enabled and impersonation is disabled.
Changes:
createHiveClientWithBackend()– UseHiveClientFactory.class.getClassLoader()asbaseLoaderinstead ofThread.currentThread().getContextClassLoader()(TCCL). TCCL isunstable across threads; using it causes
UserGroupInformationto be loaded by two differentClassLoaders, making the TGT stored after
login()invisible at connection time.createHiveClientImpl()– LoadHiveVersionfrom the isolatedHiveClientClassLoaderbefore calling
getConstructor().HiveClientImplis a barrier class redefined viadefineClassinside the isolated ClassLoader; its constructor expects theHiveVersiontypefrom the isolated ClassLoader scope. Passing the system ClassLoader's
HiveVersion.classcauses
NoSuchMethodException.createHiveClientInternal()– Add arealUgi.doAs()wrapper for the non-impersonationKerberos branch. The JVM default
javax.security.auth.useSubjectCredsOnly=truemeans GSSAPIonly finds credentials in the current thread's JAAS Subject context.
KerberosClient.login()stores the TGT in
realLoginUgi.subjectbut does not bind it to the current thread;ugi.doAs()is required to do so. The impersonation branch already does this correctly viacreateProxyHiveClientImpl; this fix brings the non-impersonation path in line.Also adds
KerberosClient.getRealLoginUgi()as the public accessor needed by fix #3, and adds"Cannot find Hive jar directory"to the Hive2 fallback condition to handle environments whereHIVE3 libs directory is absent.
Why are the changes needed?
Without these fixes, any Hive catalog configured with Kerberos authentication and impersonation
disabled fails to connect to HMS with one or more of:
All three bugs exist on the same code path and must be fixed together for the feature to work.
Fix: #11353
Does this PR introduce any user-facing change?
No API or configuration changes. This is a bug fix for an existing feature
(Kerberos HMS authentication without impersonation) that was silently broken.
How was this patch tested?
Verified on an internal deployment with a real Kerberos-secured HMS (non-impersonation mode):
GET /api/metalakes/{metalake}/catalogs/{catalog}/schemasreturns schema list correctly after fixGET /api/metalakes/{metalake}/catalogs/{catalog}/schemas/{schema}/tablesreturns table list correctly after fixGSSExceptionorNoSuchMethodExceptionin server logsUnit tests in
hive-metastore-commonall pass (./gradlew :catalogs:hive-metastore-common:test -PskipITs).Note: a full Kerberos integration test requires a KDC environment which is not available in
standard CI. A follow-up issue can track adding a Docker-based KDC integration test.