[SPARK-56964][INFRA] Share Maven precompile artifact across maven_test matrix by zhengruifeng · Pull Request #55766 · apache/spark

zhengruifeng · 2026-05-08T11:59:59Z

What changes were proposed in this pull request?

Follow-up to SPARK-56768 (#55726), which introduced the same kind of shared-precompile pattern for the SBT-driven build_and_test.yml. This PR applies the analogous optimization to .github/workflows/maven_test.yml - the reusable workflow that the scheduled build_maven*.yml jobs call to run Maven-based scala tests across multiple JDK versions.

Each of the 12 matrix entries today runs three steps back-to-back:

mvn -DskipTests <profiles> clean install (~25-40m of redundant compile, identical across all entries)
mvn clean -pl assembly (small cleanup, conditional on module)
mvn -pl <TEST_MODULES> ... test (the actual per-entry test phase)

Step 1 is byte-equivalent across every matrix entry: same 9 Maven profiles, same -DskipTests, same -Djava.version=<input>. This PR factors it into a single precompile-maven job whose output every entry consumes.

Concrete changes

New precompile-maven job runs mvn -DskipTests <profiles> clean install once on the same runs-on: ${{ inputs.os }} runner. The same shell wrapper, same MAVEN_OPTS, same profile set, same JAVA_VERSION/-ea substitution as the matrix entries use today.
The job tars two pieces and uploads them as a multi-file artifact:
- compile-target.tar.gz - all */target/ directories from the workspace.
- compile-m2-spark.tar.gz - ~/.m2/repository/org/apache/spark/, needed by the matrix's mvn -pl X test to resolve cross-module Spark dependencies that aren't in the reactor.
Artifact name: spark-maven-compile-<branch>-java<java>-<run_id>. The JDK is encoded in the name because build_maven.yml, build_maven_java21.yml, build_maven_java25.yml use different JDKs and bytecode is JDK-specific.
The build matrix job adds precompile-maven to needs: and uses if: (!cancelled()) so the matrix runs even if precompile fails or is cancelled.
New "Download precompiled artifact" / "Extract precompiled artifact" steps with the same optional/fallback design as the SBT version:
- if: needs.precompile-maven.result == 'success' on download.
- continue-on-error: true on both steps.
- if: steps.download-precompiled.outcome == 'success' on extract.

Inside the existing "Run tests" bash, the mvn clean install line is gated:

if [ "${{ steps.extract-precompiled.outcome }}" = "success" ]; then
  echo "Reusing precompiled artifact, skipping local Maven clean install."
else
  ./build/mvn ... clean install
fi

The rest of the bash (the clean -pl assembly cleanup and the per-entry test invocations) is unchanged.

Optional: graceful fallback if precompile fails

Same pattern as the SBT extensions:

precompile-maven is continue-on-error: true - a failed or cancelled precompile does not fail the workflow.
Download/extract have continue-on-error: true and skip if the upstream step didn't succeed.
The bash runs the original mvn clean install whenever the artifact wasn't usable.

So a precompile failure degrades to today's behavior, not a workflow failure.

Why two artifact files

Maven's mvn -pl X test resolves cross-module dependencies (other Spark modules) from ~/.m2/repository/org/apache/spark/ rather than from the workspace's target/. We need both:

target/ so the matrix entry's main/test classes for module X are present (Maven sees they're up-to-date and skips re-compilation thanks to mtime preservation by tar).
~/.m2/repository/org/apache/spark/ so the artifact resolution for inter-module Spark deps doesn't fall back to "module not found" or trigger a recursive build.

The matrix entry extracts both into their respective locations (./*/target/... for the first, ~/.m2/repository/org/apache/spark/ for the second).

Measured savings

Comparing the apache/spark scheduled build_maven.yml run on 2026-05-17 (25992372470) against the validation push of this PR on 2026-05-20 (26153415924), both JDK 17 / Scala 2.13 / Hadoop 3:

	Before	After	Δ
Sum of 12 matrix entries	17:58:04	9:44:11	−8:13:53
+ new `precompile-maven` job		0:49:24
Total CI compute per run	17:58:04	10:33:35	−7:24:29 (−41%)

Every matrix entry drops by 28–53 min (≈40 min average), matching the redundant mvn -DskipTests … clean install (~25–40 min) that this PR removes from each entry. Multiplied across the three scheduled Maven workflows (JDK 17 / 21 / 25), the daily saving is ~22 h of org-shared CI capacity.

See this comment for the full per-entry breakdown and notes on the wall-clock trade-off (precompile + matrix is sequential, so end-to-end wall-clock grows by ~20 min on official infra; the much larger compute saving comes from removing the redundant compile from every matrix entry).

The sql/hive-thriftserver matrix entry has a special case ("To avoid a compilation loop ... run clean install instead") that re-runs clean install regardless. In the measured run that entry still saved ~39 min, likely because the cached ~/.m2/repository/org/apache/spark/ from the precompile artifact shortens its re-run.

Does this PR introduce any user-facing change?

No. CI infrastructure change only.

How was this patch tested?

Exercised end-to-end by validation run 26153415924 of build_maven.yml on the PR branch (JDK 17). Both expected log signatures appeared:

precompile-maven job: [INFO] BUILD SUCCESS from Maven, plus the ls -lh compile-target.tar.gz compile-m2-spark.tar.gz line.
Matrix entries' "Run tests" step: Reusing precompiled artifact, skipping local Maven clean install.

The fallback path (full mvn clean install when the artifact is missing or extraction fails) is preserved by continue-on-error: true on the precompile job and the download/extract steps; on that path each matrix entry runs mvn clean install itself, identical to today's behavior.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

Follow-up to SPARK-56768. Adds a `precompile-maven` job to `maven_test.yml` that runs `mvn clean install -DskipTests` once and publishes the resulting `target/` trees plus `~/.m2/repository/org/apache/spark/` as a GitHub Actions artifact. Each of the 12 matrix entries now consumes that artifact instead of running its own `mvn clean install` from scratch. The Maven version of the optimization differs from the SBT one in two places: 1. We tar two pieces and upload as a single multi-file artifact: `compile-target.tar.gz` (workspace target/ trees) and `compile-m2-spark.tar.gz` (the Spark portion of the local Maven repository, needed for cross-module dependency resolution at `mvn -pl X test` time). 2. The artifact name is JDK-tagged `spark-maven-compile-<branch>-java<java>-<run_id>` because the build_maven_*.yml callers use different JDKs (17, 21, 25) and each produces non-interchangeable bytecode. Same optional/fallback design as SPARK-56768: - `precompile-maven` is `continue-on-error: true`; a failure does not fail the workflow run. - The matrix uses `if: (!cancelled())` so it runs even on precompile failure or cancellation. - The "Download precompiled artifact" step is gated on `needs.precompile-maven.result == 'success'` and has `continue-on-error: true`. - The "Extract precompiled artifact" step is gated on the download succeeding. - Inside the "Run tests" bash, the local `mvn clean install` is run only when `steps.extract-precompiled.outcome != 'success'`. Otherwise the artifact's classes/jars are used directly. Logs `Reusing precompiled artifact, skipping local Maven clean install.` for visibility. The hive-thriftserver special case (line ~228, "To avoid a compilation loop") still does its own `clean install` and is not touched by this PR; it does ~1 of 12 entries' worth of redundant work, which is acceptable. Estimated saving: roughly 11 of the 12 matrix entries skip ~25-40m of Maven clean install each; netting ~300m+ of CI compute saved per scheduled run, per JDK. Generated-by: Claude Code (Opus 4.7)

…connect Eleven of the 12 matrix entries wipe assembly/target/ immediately after extraction via the existing `mvn clean -pl assembly` step (SPARK-51628, which exists to keep the SPARK-51600 fix path covered by the daily Maven test). Including assembly in the artifact wastes upload + download bandwidth for those 11 entries. This commit: 1. Excludes `assembly/` from the find pattern in the precompile-maven "Package compile output" step. Uses `-prune` so any nested target/ dirs under assembly are also excluded. 2. Adds an explicit `mvn install -pl assembly` step in the matrix entry's bash, gated on `MODULES_TO_TEST = "connect"` and the artifact reuse path. The connect entry is the only one that needs the assembly built (SPARK-51628 leaves it out of the cleanup for that reason); now we build it on demand instead of carrying it around for entries that throw it away. The SPARK-51628 cleanup step (`mvn clean -pl assembly` for non-connect) still runs and is now a near-no-op for the reuse path; it remains a correctness guard for the fallback path that does run `clean install`. Generated-by: Claude Code (Opus 4.7)

Mirror the comment used on the existing matrix-job cache steps so a future maintainer knows the macOS gate on these new cache steps is a workaround for the upstream GHA hashFiles failure tracked in SPARK-54466 / actions/runner-images#13341, and can be removed once that issue is resolved. Generated-by: Claude Code (Opus 4.7)

The `mvn clean -pl assembly` step exists to wipe assembly/target/ so tests exercise the SPARK-51600 prepend fallback. On the precompile reuse path the assembly module is already excluded from the artifact, so the cleanup is a no-op (~5-10s of wasted Maven invocation per non-connect entry, ~50-100s per scheduled run). Move the cleanup into the fallback branch, where it's still needed. The reuse path's regression coverage is preserved by the artifact having no assembly to begin with. Generated-by: Claude Code (Opus 4.7)

REVERT BEFORE MERGE. Adds `push:` to the trigger list and removes the `if: github.repository == 'apache/spark'` job-level gate so each push to this branch on the fork fires build_maven.yml. This exercises maven_test.yml end-to-end with the precompile-maven changes from this PR. Generated-by: Claude Code (Opus 4.7)

zhengruifeng · 2026-05-21T10:58:52Z

Measured CI time: before vs. after

Comparing a recent scheduled build_maven.yml run on master against the validation push on this branch (both JDK 17, Scala 2.13, Hadoop 3).

Runs

Before: apache/spark scheduled run 25992372470 on master, 2026-05-17 (pre-optimization).
After: validation push run 26153415924 on this PR branch, 2026-05-20 (with the precompile artifact).

Per-matrix-entry duration (sorted by "before")

Matrix entry	Before	After	Δ
sql#core - other tests	2:03:26	1:34:50	−0:28:36
sql#core - slow tests	1:59:59	1:29:42	−0:30:17
sql#core - extended tests	1:51:01	1:04:28	−0:46:33
sql#hive - other tests	1:46:57	1:05:02	−0:41:55
connector#kafka-0-10, …	1:41:22	0:58:30	−0:42:52
core,launcher,common, …	1:39:35	0:57:00	−0:42:35
repl,sql#hive-thriftserver	1:15:19	0:36:02	−0:39:17
mllib-local,mllib,sql#pipelines	1:14:18	0:31:20	−0:42:58
connect	1:13:42	0:20:21	−0:53:21
sql#hive - slow tests	1:11:35	0:33:29	−0:38:06
sql#api,catalyst,yarn,k8s#core	1:09:22	0:24:00	−0:45:22
graphx,streaming,hadoop-cloud	0:51:28	0:09:27	−0:42:01

Every entry drops by 28–53 min (≈40 min on average), matching the redundant mvn -DskipTests … clean install (~25–40 min) the PR removes from each matrix entry. The repl,sql#hive-thriftserver entry still saves ~39 min here despite the "compilation-loop" special case that re-runs clean install — likely because the cached ~/.m2/repository/org/apache/spark/ from the precompile artifact still shortens that re-run.

Aggregate

Metric	Before	After	Δ
Sum of matrix entries	17:58:04	9:44:11	−8:13:53
+ new `precompile-maven` job		0:49:24
Total CI compute per run	17:58:04	10:33:35	−7:24:29 (−41%)
Workflow wall-clock	2:03:30	3:27:51	+1:24:21

On the wall-clock delta: the +1h 24m is mostly fork-runner queueing — in the after-run, matrix jobs started in a stagger between 10:38 and 11:47 (slowest entry waited ~1h 17m for a runner), whereas on apache/spark all 12 entries start within 3 s. Netting out the queue and looking at precompile + longest reduced matrix entry ≈ 49 min + 1h 35m = ~2h 24m, vs. 2h 3m baseline — i.e. roughly +20 min sequential cost on official infra, in exchange for the ~7h 25m compute saving per scheduled run. Across the three scheduled Maven workflows (JDK 17 / 21 / 25), that's ~22 h of CI compute saved per day.

This also confirms the PR description's "~315–325m (~5h) net saved per run" estimate is actually conservative on this run (measured ~7h 25m).

…s PR" This reverts commit c43e36c.

dongjoon-hyun

+1, LGTM. Thank you.

…t matrix ### What changes were proposed in this pull request? Follow-up to [SPARK-56768](https://issues.apache.org/jira/browse/SPARK-56768) (#55726), which introduced the same kind of shared-precompile pattern for the SBT-driven `build_and_test.yml`. This PR applies the analogous optimization to `.github/workflows/maven_test.yml` - the reusable workflow that the scheduled `build_maven*.yml` jobs call to run Maven-based scala tests across multiple JDK versions. Each of the 12 matrix entries today runs three steps back-to-back: 1. `mvn -DskipTests <profiles> clean install` (~25-40m of redundant compile, identical across all entries) 2. `mvn clean -pl assembly` (small cleanup, conditional on module) 3. `mvn -pl <TEST_MODULES> ... test` (the actual per-entry test phase) Step 1 is byte-equivalent across every matrix entry: same 9 Maven profiles, same `-DskipTests`, same `-Djava.version=<input>`. This PR factors it into a single `precompile-maven` job whose output every entry consumes. ### Concrete changes - New `precompile-maven` job runs `mvn -DskipTests <profiles> clean install` once on the same `runs-on: ${{ inputs.os }}` runner. The same shell wrapper, same `MAVEN_OPTS`, same profile set, same `JAVA_VERSION/-ea` substitution as the matrix entries use today. - The job tars two pieces and uploads them as a multi-file artifact: - `compile-target.tar.gz` - all `*/target/` directories from the workspace. - `compile-m2-spark.tar.gz` - `~/.m2/repository/org/apache/spark/`, needed by the matrix's `mvn -pl X test` to resolve cross-module Spark dependencies that aren't in the reactor. Artifact name: `spark-maven-compile-<branch>-java<java>-<run_id>`. The JDK is encoded in the name because `build_maven.yml`, `build_maven_java21.yml`, `build_maven_java25.yml` use different JDKs and bytecode is JDK-specific. - The `build` matrix job adds `precompile-maven` to `needs:` and uses `if: (!cancelled())` so the matrix runs even if precompile fails or is cancelled. - New "Download precompiled artifact" / "Extract precompiled artifact" steps with the same optional/fallback design as the SBT version: - `if: needs.precompile-maven.result == 'success'` on download. - `continue-on-error: true` on both steps. - `if: steps.download-precompiled.outcome == 'success'` on extract. - Inside the existing "Run tests" bash, the `mvn clean install` line is gated: ```bash if [ "${{ steps.extract-precompiled.outcome }}" = "success" ]; then echo "Reusing precompiled artifact, skipping local Maven clean install." else ./build/mvn ... clean install fi ``` The rest of the bash (the `clean -pl assembly` cleanup and the per-entry `test` invocations) is unchanged. ### Optional: graceful fallback if precompile fails Same pattern as the SBT extensions: - `precompile-maven` is `continue-on-error: true` - a failed or cancelled precompile does not fail the workflow. - Download/extract have `continue-on-error: true` and skip if the upstream step didn't succeed. - The bash runs the original `mvn clean install` whenever the artifact wasn't usable. So a precompile failure degrades to today's behavior, not a workflow failure. ### Why two artifact files Maven's `mvn -pl X test` resolves cross-module dependencies (other Spark modules) from `~/.m2/repository/org/apache/spark/` rather than from the workspace's `target/`. We need both: - `target/` so the matrix entry's main/test classes for module X are present (Maven sees they're up-to-date and skips re-compilation thanks to mtime preservation by `tar`). - `~/.m2/repository/org/apache/spark/` so the artifact resolution for inter-module Spark deps doesn't fall back to "module not found" or trigger a recursive build. The matrix entry extracts both into their respective locations (`./*/target/...` for the first, `~/.m2/repository/org/apache/spark/` for the second). ### Measured savings Comparing the apache/spark scheduled `build_maven.yml` run on 2026-05-17 ([25992372470](https://github.com/apache/spark/actions/runs/25992372470)) against the validation push of this PR on 2026-05-20 ([26153415924](https://github.com/zhengruifeng/spark/actions/runs/26153415924)), both JDK 17 / Scala 2.13 / Hadoop 3: | | Before | After | Δ | |---|---:|---:|---:| | Sum of 12 matrix entries | 17:58:04 | 9:44:11 | −8:13:53 | | + new `precompile-maven` job | | 0:49:24 | | | **Total CI compute per run** | **17:58:04** | **10:33:35** | **−7:24:29 (−41%)** | Every matrix entry drops by 28–53 min (≈40 min average), matching the redundant `mvn -DskipTests … clean install` (~25–40 min) that this PR removes from each entry. Multiplied across the three scheduled Maven workflows (JDK 17 / 21 / 25), the daily saving is ~22 h of org-shared CI capacity. See [this comment](#55766 (comment)) for the full per-entry breakdown and notes on the wall-clock trade-off (precompile + matrix is sequential, so end-to-end wall-clock grows by ~20 min on official infra; the much larger compute saving comes from removing the redundant compile from every matrix entry). The `sql/hive-thriftserver` matrix entry has a special case ("To avoid a compilation loop ... run `clean install` instead") that re-runs `clean install` regardless. In the measured run that entry still saved ~39 min, likely because the cached `~/.m2/repository/org/apache/spark/` from the precompile artifact shortens its re-run. ### Does this PR introduce _any_ user-facing change? No. CI infrastructure change only. ### How was this patch tested? Exercised end-to-end by validation run [26153415924](https://github.com/zhengruifeng/spark/actions/runs/26153415924) of `build_maven.yml` on the PR branch (JDK 17). Both expected log signatures appeared: - `precompile-maven` job: `[INFO] BUILD SUCCESS` from Maven, plus the `ls -lh compile-target.tar.gz compile-m2-spark.tar.gz` line. - Matrix entries' "Run tests" step: `Reusing precompiled artifact, skipping local Maven clean install.` The fallback path (full `mvn clean install` when the artifact is missing or extraction fails) is preserved by `continue-on-error: true` on the precompile job and the download/extract steps; on that path each matrix entry runs `mvn clean install` itself, identical to today's behavior. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) Closes #55766 from zhengruifeng/share-precompile-maven-test. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 74816d7) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

zhengruifeng · 2026-05-22T02:26:36Z

thanks, merged to master/4.x/4.2

LuciferYang · 2026-05-22T03:01:18Z

late LGTM

LuciferYang · 2026-05-22T03:03:04Z

A new Maven job has been launched to verify the effectiveness:

https://github.com/apache/spark/actions/runs/26265926910

zhengruifeng changed the title ~~[INFRA] Share Maven precompile artifact across maven_test matrix~~ [SPARK-56964][INFRA] Share Maven precompile artifact across maven_test matrix May 20, 2026

zhengruifeng added 5 commits May 20, 2026 09:19

zhengruifeng force-pushed the share-precompile-maven-test branch from c496b74 to c43e36c Compare May 20, 2026 09:20

zhengruifeng marked this pull request as ready for review May 21, 2026 08:58

zhengruifeng requested review from HyukjinKwon, LuciferYang, cloud-fan and dongjoon-hyun May 21, 2026 11:02

Revert "[INFRA][TEMP] Trigger build_maven.yml on push to validate thi…

7d99f4e

…s PR" This reverts commit c43e36c.

HyukjinKwon approved these changes May 21, 2026

View reviewed changes

dongjoon-hyun approved these changes May 21, 2026

View reviewed changes

zhengruifeng closed this in 74816d7 May 22, 2026

zhengruifeng deleted the share-precompile-maven-test branch May 22, 2026 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56964][INFRA] Share Maven precompile artifact across maven_test matrix#55766

[SPARK-56964][INFRA] Share Maven precompile artifact across maven_test matrix#55766
zhengruifeng wants to merge 6 commits into
apache:masterfrom
zhengruifeng:share-precompile-maven-test

zhengruifeng commented May 8, 2026 •

edited

Loading

Uh oh!

zhengruifeng commented May 21, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

zhengruifeng commented May 22, 2026

Uh oh!

LuciferYang commented May 22, 2026

Uh oh!

LuciferYang commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhengruifeng commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Concrete changes

Optional: graceful fallback if precompile fails

Why two artifact files

Measured savings

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented May 21, 2026

Measured CI time: before vs. after

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented May 22, 2026

Uh oh!

LuciferYang commented May 22, 2026

Uh oh!

LuciferYang commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhengruifeng commented May 8, 2026 •

edited

Loading