Skip to content

[#11353] fix(hive): fix three ClassLoader/Kerberos bugs causing HMS connection failure#11355

Open
JoegenUSTC wants to merge 7 commits into
apache:mainfrom
JoegenUSTC:bugfix-hive-classloader-kerberos
Open

[#11353] fix(hive): fix three ClassLoader/Kerberos bugs causing HMS connection failure#11355
JoegenUSTC wants to merge 7 commits into
apache:mainfrom
JoegenUSTC:bugfix-hive-classloader-kerberos

Conversation

@JoegenUSTC
Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR fixes three independent bugs in HiveClientFactory that together prevent Hive Metastore
connections from working when Kerberos authentication is enabled and impersonation is disabled.

Changes:

  1. createHiveClientWithBackend() – Use HiveClientFactory.class.getClassLoader() as
    baseLoader instead of Thread.currentThread().getContextClassLoader() (TCCL). TCCL is
    unstable across threads; using it causes UserGroupInformation to be loaded by two different
    ClassLoaders, making the TGT stored after login() invisible at connection time.

  2. createHiveClientImpl() – Load HiveVersion from the isolated HiveClientClassLoader
    before calling getConstructor(). HiveClientImpl is a barrier class redefined via
    defineClass inside the isolated ClassLoader; its constructor expects the HiveVersion type
    from the isolated ClassLoader scope. Passing the system ClassLoader's HiveVersion.class
    causes NoSuchMethodException.

  3. createHiveClientInternal() – Add a realUgi.doAs() wrapper for the non-impersonation
    Kerberos branch. The JVM default javax.security.auth.useSubjectCredsOnly=true means GSSAPI
    only finds credentials in the current thread's JAAS Subject context. KerberosClient.login()
    stores the TGT in realLoginUgi.subject but does not bind it to the current thread;
    ugi.doAs() is required to do so. The impersonation branch already does this correctly via
    createProxyHiveClientImpl; this fix brings the non-impersonation path in line.

Also adds KerberosClient.getRealLoginUgi() as the public accessor needed by fix #3, and adds
"Cannot find Hive jar directory" to the Hive2 fallback condition to handle environments where
HIVE3 libs directory is absent.

Why are the changes needed?

Without these fixes, any Hive catalog configured with Kerberos authentication and impersonation
disabled fails to connect to HMS with one or more of:

GSS initiate failed
No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
java.lang.NoSuchMethodException: HiveClientImpl.<init>(HiveVersion, Properties)

All three bugs exist on the same code path and must be fixed together for the feature to work.

Fix: #11353

Does this PR introduce any user-facing change?

No API or configuration changes. This is a bug fix for an existing feature
(Kerberos HMS authentication without impersonation) that was silently broken.

How was this patch tested?

Verified on an internal deployment with a real Kerberos-secured HMS (non-impersonation mode):

  • GET /api/metalakes/{metalake}/catalogs/{catalog}/schemas returns schema list correctly after fix
  • GET /api/metalakes/{metalake}/catalogs/{catalog}/schemas/{schema}/tables returns table list correctly after fix
  • No GSSException or NoSuchMethodException in server logs

Unit tests in hive-metastore-common all pass (./gradlew :catalogs:hive-metastore-common:test -PskipITs).

Note: a full Kerberos integration test requires a KDC environment which is not available in
standard CI. A follow-up issue can track adding a Docker-based KDC integration test.

Bug 1: use stable baseLoader (HiveClientFactory.class.getClassLoader())
instead of TCCL in createHiveClientWithBackend(). TCCL is not stable
across threads, causing UGI to be loaded by two different ClassLoaders
so the TGT stored after login() is invisible at connection time.

Bug 2: in createHiveClientImpl(), load HiveVersion from the isolated
ClassLoader before calling getConstructor(). HiveClientImpl is a barrier
class redefined via defineClass; its constructor expects the HiveVersion
type from the isolated CL scope. Passing the system CL's HiveVersion
causes NoSuchMethodException.

Bug 3: in createHiveClientInternal(), the non-impersonation Kerberos
branch was missing Subject binding. GSSAPI requires a JAAS Subject to
be bound to the current thread (javax.security.auth.useSubjectCredsOnly
is true by default). Wrap the createHiveClientImpl call with
realUgi.doAs(), consistent with the impersonation path.

Also add KerberosClient.getRealLoginUgi() as the public accessor needed
by the doAs fix, and add 'Cannot find Hive jar directory' to the Hive2
fallback condition to handle environments where HIVE3 libs are absent.
@jerryshao jerryshao requested review from diqiu50 and yuqi1129 June 2, 2026 09:38
@yuqi1129
Copy link
Copy Markdown
Contributor

yuqi1129 commented Jun 2, 2026

@JoegenUSTC
Can you add some UTs and ITs to cover and verify your changes here?

…sLoader/Kerberos fixes

Unit tests (TestHiveClientFactoryClassLoader, no external dependencies):

- testFactoryClassLoaderIsStableAcrossThreads: proves that
  HiveClientFactory.class.getClassLoader() returns the same instance
  across threads (rationale for Bug 1 fix).

- testTCCLIsUnstableAcrossThreads: contrast test proving that TCCL
  differs per thread, confirming why using TCCL as baseLoader was a bug.

- testCreateHiveClientImplLoadsHiveVersionFromIsolatedClassLoader:
  verifies that createHiveClientImpl() calls loadClass(HiveVersion) on
  the isolated ClassLoader before getConstructor(), not the system CL's
  HiveVersion.class directly. Uses a mock ClassLoader that returns the
  real HiveVersion for the loadClass call and throws a sentinel
  ClassNotFoundException for HiveClientImpl, proving the HiveVersion
  step succeeds (Bug 2 fix).

- testSystemClHiveVersionCausesNoSuchMethodException: contrast test
  proving that passing the wrong CL's HiveVersion to getConstructor()
  throws NoSuchMethodException (demonstrates the original Bug 2 failure
  mode).

Integration test (TestHive2HMSWithKerberosNoImpersonation,
requires @tag("gravitino-docker-test")):

- Extends TestHive2HMSWithKerberos but sets
  authentication.impersonation-enable=false to exercise the
  non-impersonation Kerberos path in createHiveClientInternal() — the
  path fixed by Bug 3 (missing realUgi.doAs()). The existing
  TestHive2HMSWithKerberos only covered impersonation=true.
@JoegenUSTC
Copy link
Copy Markdown
Author

@JoegenUSTC Can you add some UTs and ITs to cover and verify your changes here?

Thanks for the suggestion!

I've added tests in a follow-up commit (9ec3efcf):

Unit tests (TestHiveClientFactoryClassLoader, no external dependencies, all 4 pass locally):

  • Two tests for Bug 1: one proves HiveClientFactory.class.getClassLoader() is stable across threads (the rationale for the fix), one contrast test shows TCCL is thread-dependent and therefore an unreliable baseLoader.
  • Two tests for Bug 2: one verifies that createHiveClientImpl() calls loadClass(HiveVersion) on the isolated ClassLoader before getConstructor() (so no NoSuchMethodException), one contrast test reproduces the original failure by passing a mismatched CL's HiveVersion to getConstructor() and asserting it throws NoSuchMethodException.

Integration test (TestHive2HMSWithKerberosNoImpersonation, @Tag("gravitino-docker-test")):

  • Extends the existing TestHive2HMSWithKerberos but sets authentication.impersonation-enable=false. This exercises exactly the non-impersonation Kerberos path fixed by Bug 3 (the missing realUgi.doAs()). The existing test only covered impersonation=true.
  • Requires a live KDC (Docker), so it runs under the gravitino-docker-test tag only.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Code Coverage Report

Overall Project 66.77% +0.04% 🟢
Files changed 84.54% 🟢

Module Coverage
aliyun 1.72% 🔴
api 46.82% 🟢
authorization-common 85.96% 🟢
aws 3.66% 🔴
azure 2.47% 🔴
catalog-common 10.04% 🔴
catalog-fileset 80.33% 🟢
catalog-glue 66.08% 🟢
catalog-hive 79.33% 🟢
catalog-jdbc-clickhouse 80.02% 🟢
catalog-jdbc-common 45.31% 🟢
catalog-jdbc-doris 80.28% 🟢
catalog-jdbc-hologres 54.03% 🟢
catalog-jdbc-mysql 79.23% 🟢
catalog-jdbc-oceanbase 78.38% 🟢
catalog-jdbc-postgresql 82.29% 🟢
catalog-jdbc-starrocks 78.51% 🟢
catalog-kafka 77.01% 🟢
catalog-lakehouse-generic 44.89% 🟢
catalog-lakehouse-hudi 79.1% 🟢
catalog-lakehouse-iceberg 85.74% 🟢
catalog-lakehouse-paimon 79.29% 🟢
catalog-model 77.72% 🟢
cli 44.51% 🟢
client-java 77.91% 🟢
common 49.99% 🟢
core 82.49% 🟢
filesystem-hadoop3 76.97% 🟢
flink 0.0% 🔴
flink-common 41.2% 🟢
flink-runtime 0.0% 🔴
gcp 14.12% 🔴
hadoop-common 10.39% 🔴
hive-metastore-common 53.5% +3.88% 🟢
iceberg-common 56.54% 🟢
iceberg-rest-server 72.84% 🟢
idp-basic 85.99% 🟢
integration-test-common 0.0% 🔴
jobs 66.17% 🟢
lance-common 20.83% 🔴
lance-rest-server 60.27% 🟢
lineage 53.02% 🟢
optimizer 82.95% 🟢
optimizer-api 21.95% 🔴
server 85.73% 🟢
server-common 73.13% 🟢
spark 32.79% 🔴
spark-common 37.99% 🔴
trino-connector 39.44% 🔴
Files
Module File Coverage
hive-metastore-common HiveClientFactory.java 85.59% 🟢
KerberosClient.java 82.89% 🟢

@yuqi1129
Copy link
Copy Markdown
Contributor

yuqi1129 commented Jun 3, 2026

The CI fails. Can you take time to fix it? @JoegenUSTC

weijiajun and others added 2 commits June 3, 2026 16:08
…e2HMSWithKerberosNoImpersonation

The previous startHiveContainer() override called super.startHiveContainer()
and then re-created hiveClient via new HiveClientFactory(...). This was
redundant because TestHive2HMSWithKerberos.startHiveContainer() already
calls createHiveProperties() via polymorphism, so the overridden
createHiveProperties() (with impersonation-enable=false) is used for the
initial hiveClient creation.

The redundant second HiveClientFactory construction caused a
NullPointerException in CI when the Kerberos container initialization
failed: the exception left hiveClient=null, and all test methods then
failed with NPE on hiveClient.createDatabase(...).

Fix: remove startHiveContainer() override entirely. Only createHiveProperties()
needs to be overridden to set impersonation-enable=false.
@JoegenUSTC
Copy link
Copy Markdown
Author

The CI fails. Can you take time to fix it? @JoegenUSTC
@yuqi1129
Thanks for flagging this! I've fixed the CI failure in commit 11fbb32.

The root cause was that TestHive2HMSWithKerberosNoImpersonation.startHiveContainer() was
redundantly constructing a second HiveClientFactory after super.startHiveContainer(). Since
TestHive2HMSWithKerberos.startHiveContainer() already calls createHiveProperties() via
polymorphism, the overridden createHiveProperties() (with impersonation-enable=false) was
already in effect for the initial hiveClient creation. The redundant second construction failed
in the CI Kerberos environment, leaving hiveClient=null and causing NPE in all test methods.

The fix removes the startHiveContainer() override entirely — only createHiveProperties()
needs to be overridden.

All tests pass locally (./gradlew :catalogs:hive-metastore-common:test -PskipITs).
Could you please approve the CI workflows to run? Thank you!

@yuqi1129
Copy link
Copy Markdown
Contributor

yuqi1129 commented Jun 3, 2026

The CI fails. Can you take time to fix it? @JoegenUSTC
@yuqi1129
Thanks for flagging this! I've fixed the CI failure in commit 11fbb32.

The root cause was that TestHive2HMSWithKerberosNoImpersonation.startHiveContainer() was redundantly constructing a second HiveClientFactory after super.startHiveContainer(). Since TestHive2HMSWithKerberos.startHiveContainer() already calls createHiveProperties() via polymorphism, the overridden createHiveProperties() (with impersonation-enable=false) was already in effect for the initial hiveClient creation. The redundant second construction failed in the CI Kerberos environment, leaving hiveClient=null and causing NPE in all test methods.

The fix removes the startHiveContainer() override entirely — only createHiveProperties() needs to be overridden.

All tests pass locally (./gradlew :catalogs:hive-metastore-common:test -PskipITs). Could you please approve the CI workflows to run? Thank you!

Done.

@JoegenUSTC
Copy link
Copy Markdown
Author

The CI fails. Can you take time to fix it? @JoegenUSTC
@yuqi1129
Thanks for flagging this! I've fixed the CI failure in commit 11fbb32.

The root cause was that TestHive2HMSWithKerberosNoImpersonation.startHiveContainer() was redundantly constructing a second HiveClientFactory after super.startHiveContainer(). Since TestHive2HMSWithKerberos.startHiveContainer() already calls createHiveProperties() via polymorphism, the overridden createHiveProperties() (with impersonation-enable=false) was already in effect for the initial hiveClient creation. The redundant second construction failed in the CI Kerberos environment, leaving hiveClient=null and causing NPE in all test methods.
The fix removes the startHiveContainer() override entirely — only createHiveProperties() needs to be overridden.
All tests pass locally (./gradlew :catalogs:hive-metastore-common:test -PskipITs). Could you please approve the CI workflows to run? Thank you!

Done.

@yuqi1129
All three CI failures are transient infrastructure issues, unrelated to this PR's changes:

  1. catalog-fileset:test — Job was canceled by the CI runner (timeout), not a test failure.
  2. catalog-glue:test — Docker image pull timeout (motoserver/moto:5.1.4 not fetched within 2 minutes); trino-connector-435-439:compileJava — JDK toolchain download timeout (SocketTimeoutException). Both are network issues on the CI runner.
  3. trino-connector-473-478 — Job was canceled by the CI runner (timeout), not a test failure.

Could you please re-run the failed jobs? Thank you!

@yuqi1129 yuqi1129 closed this Jun 3, 2026
@yuqi1129 yuqi1129 reopened this Jun 3, 2026
@JoegenUSTC
Copy link
Copy Markdown
Author

@yuqi1129 Thank you for reopening! It seems the CI workflows are still awaiting approval. Could you please approve them to run? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants