Skip to content

[SPARK-56448][CONNECT] Fix NPE on Spark Connect client restart due to ammonite compile cache#55720

Open
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/spark_SPARK-56448
Open

[SPARK-56448][CONNECT] Fix NPE on Spark Connect client restart due to ammonite compile cache#55720
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/spark_SPARK-56448

Conversation

@yadavay-amzn
Copy link
Copy Markdown

@yadavay-amzn yadavay-amzn commented May 6, 2026

What changes were proposed in this pull request?

Fixes SPARK-56448 (https://issues.apache.org/jira/browse/SPARK-56448).

The Spark Connect REPL uses Ammonite. Ammonite's default Storage.Folder
persists compiled predef classes under ~/.ammonite/<version>/cache. On a
subsequent REPL start from the same working directory, the cached
CodePredef class is reloaded but its reference to the per-session
ArgsPredef helper is stale, producing a NullPointerException during
predef initialization.

This PR switches the Connect REPL's compile cache to Storage.InMemory
so every session starts fresh and no stale cache is carried across
restarts. The ammonite.Main construction is extracted into a
package-private helper (ConnectRepl.newAmmoniteMain) to keep the
regression test localised to unit-level.

Why are the changes needed?

The stale-cache failure is a user-visible crash on every second
invocation of bin/spark-shell --remote sc://... from the same working
directory. Repro steps are on the JIRA.

Does this PR introduce any user-facing change?

There is one minor observable tradeoff: because the compile cache is
now in-memory rather than persisted, each REPL start recompiles the
predef instead of reading the cached classfiles. This adds ~a few
hundred milliseconds to subsequent REPL startups but eliminates the
NPE. We believe this is the correct tradeoff — a small startup cost
is preferable to a hard failure.

How was this patch tested?

Added ConnectReplSuite with a regression test that asserts the
ammonite.Main returned by ConnectRepl.newAmmoniteMain exposes no
on-disk cache directory (storageBackend.dirOpt.isEmpty) and is an
instance of Storage.InMemory.

The dirOpt check is an observable behavioural assertion: if the
Storage.InMemory() wiring is reverted, ammonite.Main falls back to
the default Storage.Folder(Defaults.ammoniteHome) and the assertion
fails with a clear message rather than silently compiling. I verified
this locally by temporarily reverting only the Storage.InMemory()
line and re-running the test; it fails with
Some(/home/<user>/.ammonite) was not empty. Restoring the fix makes
the test pass.

Targeted test run:

./build/sbt "connect-client-jvm/Test/testOnly org.apache.spark.sql.application.ConnectReplSuite"
...
[info] - SPARK-56448: Connect REPL uses an in-memory Ammonite storage backend (107 milliseconds)
[info] Tests: succeeded 1, failed 0

Also ran ./dev/scalastyle locally — passes.

Was this patch authored or co-authored using generative AI tooling?

Yes

… ammonite compile cache

The Spark Connect REPL uses Ammonite. Ammonite's default Storage.Folder
persists compiled predef classes under ~/.ammonite/<version>/cache. On
a subsequent REPL start from the same working directory, the cached
CodePredef class is reloaded but its reference to the per-session
ArgsPredef helper is stale, producing a NullPointerException during
predef initialization.

Use Storage.InMemory for the Connect REPL's compile cache so every
session starts fresh. Extract the Main construction into a package-
private helper to keep the test localised to unit-level.

Regression test added: asserts that the ammonite.Main returned by
ConnectRepl.newAmmoniteMain exposes no on-disk cache directory
(storageBackend.dirOpt.isEmpty) and is an instance of Storage.InMemory.
The dirOpt assertion is an observable behavioural check -- if the
Storage.InMemory wiring is reverted, ammonite.Main falls back to
Storage.Folder(~/.ammonite) and the test fails with a clear message
rather than silently compiling.
Copy link
Copy Markdown
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should be refined to check for the actual NPE rather than the internal state of dirOpt (while this state currently triggers the crash in this Ammonite version, it's an indirect check).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants