Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
c6483e3
Move the apt-get installs up and more consistnetnly use APT_INSTALL s…
sfc-gh-hkarau Apr 17, 2026
da16bb3
Fall back on install failure of poorely cached apt-get update
sfc-gh-hkarau Apr 20, 2026
dbb00f5
Try and fix the PKGS ref not flowing throw to the ohterside of the ||
sfc-gh-hkarau Apr 20, 2026
4b65d11
Ok focal is dead dead, lets move to jammy
sfc-gh-hkarau Apr 20, 2026
f80cad7
Use deadsnakes for Python 3.8
sfc-gh-hkarau Apr 20, 2026
8acd5f3
Ok we need gpg-agent for add-apt-repository?
sfc-gh-hkarau Apr 20, 2026
737dd17
Use APT_INSTALL so we don't block forver (mybad)
sfc-gh-hkarau Apr 20, 2026
3d36203
Fix aptinstall usage
sfc-gh-hkarau Apr 20, 2026
08d60da
Use 3.8 pip bootstrap, install 3.9 venv and 3.8 venv support, pin bac…
sfc-gh-hkarau Apr 21, 2026
de4e4be
Pin back some more
sfc-gh-hkarau Apr 22, 2026
78227d3
Keep python3.8 but not via pypy3.8 since there's no pandas wheel for …
sfc-gh-hkarau Apr 28, 2026
7faca4e
While fixing it the pypa pip bootstrap switched to 3.10 oldest version.
sfc-gh-hkarau May 1, 2026
6813916
ugh apt-get flakes
sfc-gh-hkarau May 2, 2026
c3fca50
This is kind of hacky but gh keeps timing out on add-apt-repository.
sfc-gh-hkarau May 5, 2026
e661592
Add a comment explaining the DDoS
sfc-gh-hkarau May 5, 2026
0a24e25
And try and fallback to Python src build since DDoS
sfc-gh-hkarau May 5, 2026
a92bfb9
Fix src bld
sfc-gh-hkarau May 5, 2026
ce3e152
Add /usr/local/bin to end of path for alt install.
sfc-gh-hkarau May 5, 2026
9e9660d
oh we also get 3.9 from deadsnakes....
sfc-gh-hkarau May 5, 2026
1699976
Ok fall back to fcix mirror iff regular archive is dead
sfc-gh-hkarau May 5, 2026
e70f9fc
When we install Python from src we don't get setuptools or venv
sfc-gh-hkarau May 5, 2026
28bd5fb
I wonder if maybe just the mariadb 10.5.12 container is too dead.
sfc-gh-hkarau May 5, 2026
b1f4d29
Cleanup
sfc-gh-hkarau May 5, 2026
9ba7f3d
Apparently R package installs can just silently fail, love that, lets…
sfc-gh-hkarau May 5, 2026
2db7bbd
Ok R apparently just silently fails and marks packages as installed w…
sfc-gh-hkarau May 5, 2026
494fb33
hmm mysql scheme auth
sfc-gh-hkarau May 5, 2026
ba94573
Bump mypy for the iceberg type erasure issue (otherwise we'll mark as…
sfc-gh-hkarau May 5, 2026
469c1f1
Python3.8 list
sfc-gh-hkarau May 6, 2026
abd303b
Use raw Python3.8 if present too.
sfc-gh-hkarau May 6, 2026
7ec1542
pin back some roxygen2 deps to work around the ! cannot set an attrib…
sfc-gh-hkarau May 6, 2026
94fa0b6
Add all dev deps for testing in 3.8/3.9
sfc-gh-hkarau May 6, 2026
9dd5e6d
typo
sfc-gh-hkarau May 6, 2026
5fab623
Retry docker image pulls in JDBC integration suites (#11)
holdenk May 7, 2026
b00d985
Add pyarrow to base container image and bump mypy version in the CI c…
holdenk May 7, 2026
8c59e78
Skip mypy following of pydantic to avoid 0.991 JsonValue crash (#13)
holdenk May 7, 2026
5b7d2b0
Back to previous version of mypy
holdenk May 7, 2026
8fc5b23
Work around roxygen2 bug with S3 metadata on R primitives (#15)
holdenk May 8, 2026
1bf34ac
Install python reqs
holdenk May 8, 2026
3eb8acc
Disable SparkR in CI it's broken and has been for awhile, in practice…
holdenk May 8, 2026
de6ab4f
Typo
holdenk May 8, 2026
2379399
hmmm does it pass without 3.8? It's just type errors in 3.8
holdenk May 8, 2026
1ec7e70
Pin back pandas and plotly to probably supported versions
sfc-gh-hkarau May 9, 2026
df10c5a
Change version spec in req file
holdenk May 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ jobs:
\"build\": \"$build\",
\"pyspark\": \"$pyspark\",
\"pyspark-pandas\": \"$pandas\",
\"sparkr\": \"$sparkr\",
\"sparkr\": \"false\",
\"tpcds-1g\": \"$tpcds\",
\"docker-integration-tests\": \"$docker\",
\"scala-213\": \"$build\",
Expand Down Expand Up @@ -436,10 +436,14 @@ jobs:
with:
distribution: temurin
java-version: ${{ matrix.java }}
- name: List Python packages (Python 3.9, PyPy3)
- name: Install Python packages (Python 3.9, Python3.8)
run: |
python3.9 -m pip install -r ./dev/requirements.txt
python3.8 -m pip install -r ./dev/requirements.txt
- name: List Python packages (Python 3.9, Python3.8)
run: |
python3.9 -m pip list
pypy3 -m pip list
python3.8 -m pip list
- name: Install Conda for pip packaging test
if: ${{ matrix.modules == 'pyspark-errors' }}
run: |
Expand Down Expand Up @@ -542,6 +546,7 @@ jobs:
# R issues at docker environment
export TZ=UTC
export _R_CHECK_SYSTEM_CLOCK_=FALSE
Rscript -e "library(testthat); library(knitr); library(rmarkdown); library(markdown)"
./dev/run-tests --parallelism 1 --modules sparkr
- name: Upload test results to report
if: always()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,43 @@ abstract class DockerJDBCIntegrationSuite
private var pulled: Boolean = false
protected var jdbcUrl: String = _

// Number of retry attempts for transient Docker registry / daemon errors
// (e.g. 5xx responses from Docker Hub, which can be flaky in CI).
private val dockerOpMaxAttempts =
sys.props.getOrElse("spark.test.docker.retryAttempts", "5").toInt
private val dockerOpInitialBackoffMs =
sys.props.getOrElse("spark.test.docker.retryInitialBackoffMs", "2000").toLong

/**
* Retry a Docker operation that may transiently fail due to registry / daemon
* availability issues (HTTP 5xx, network glitches, etc.). Uses exponential backoff.
*/
private def retryOnDockerError[T](description: String)(op: => T): T = {
var attempt = 1
var backoff = dockerOpInitialBackoffMs
var lastError: Throwable = null
while (attempt <= dockerOpMaxAttempts) {
try {
return op
} catch {
case NonFatal(e) =>
lastError = e
if (attempt == dockerOpMaxAttempts) {
log.error(
s"Docker operation '$description' failed after $attempt attempt(s); giving up.", e)
} else {
log.warn(
s"Docker operation '$description' failed on attempt $attempt of " +
s"$dockerOpMaxAttempts; retrying in ${backoff}ms.", e)
Thread.sleep(backoff)
backoff = math.min(backoff * 2, 30000L)
}
}
attempt += 1
}
throw lastError
}

override def beforeAll(): Unit = runIfTestsEnabled(s"Prepare for ${this.getClass.getName}") {
super.beforeAll()
try {
Expand All @@ -140,17 +177,23 @@ abstract class DockerJDBCIntegrationSuite
// Ensure that the Docker image is installed:
docker.inspectImageCmd(db.imageName).exec()
} catch {
case e: NotFoundException =>
case _: NotFoundException =>
log.warn(s"Docker image ${db.imageName} not found; pulling image from registry")
docker.pullImageCmd(db.imageName)
.start()
.awaitCompletion(connectionTimeout.value.toSeconds, TimeUnit.SECONDS)
retryOnDockerError(s"pull image ${db.imageName}") {
docker.pullImageCmd(db.imageName)
.start()
.awaitCompletion(connectionTimeout.value.toSeconds, TimeUnit.SECONDS)
}
pulled = true
}

docker.pullImageCmd(db.imageName)
.start()
.awaitCompletion(connectionTimeout.value.toSeconds, TimeUnit.SECONDS)
// Re-pull to ensure we have the latest version of the image. The registry
// (e.g. Docker Hub) is occasionally flaky in CI with 5xx responses, so retry.
retryOnDockerError(s"pull image ${db.imageName}") {
docker.pullImageCmd(db.imageName)
.start()
.awaitCompletion(connectionTimeout.value.toSeconds, TimeUnit.SECONDS)
}

val hostConfig = HostConfig
.newHostConfig()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ class MariaDBKrbIntegrationSuite extends DockerKrbJDBCIntegrationSuite {
override val jdbcPort = 3306

override def getJdbcUrl(ip: String, port: Int): String =
s"jdbc:mysql://$ip:$port/mysql?user=$principal"
s"jdbc:mysql://$ip:$port/mysql?user=$principal&permitMysqlScheme"

override def getEntryPoint: Option[String] =
Some("/docker-entrypoint/mariadb_docker_entrypoint.sh")
Expand Down
104 changes: 75 additions & 29 deletions dev/infra/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,60 +15,106 @@
# limitations under the License.
#

# Image for building and testing Spark branches. Based on Ubuntu 20.04.
# Image for building and testing Spark branches. Based on Ubuntu 22.04.
# See also in https://hub.docker.com/_/ubuntu
FROM ubuntu:focal-20221019
FROM ubuntu:jammy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pin this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going back and forth on this, given we do an apt-get update anyways personally I think pinning it is actually counter productive.


ENV FULL_REFRESH_DATE 20221118
SHELL ["/bin/bash", "-o", "pipefail", "-c"]

ENV FULL_REFRESH_DATE 20260420

ENV DEBIAN_FRONTEND noninteractive
ENV DEBCONF_NONINTERACTIVE_SEEN true

ARG APT_INSTALL="apt-get install --no-install-recommends -y"
ARG APT_INSTALL="apt-get install -y"

RUN apt-get clean
RUN apt-get update
RUN $APT_INSTALL software-properties-common git libxml2-dev pkg-config curl wget openjdk-8-jdk libpython3-dev python3-pip python3-setuptools python3.8 python3.9
RUN update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
ENV PATH "$PATH:/usr/local/bin"

RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9
RUN timeout 5 bash -c 'exec 3<>/dev/tcp/archive.ubuntu.com/80 && printf "HEAD /ubuntu/ HTTP/1.1\r\nHost: archive.ubuntu.com\r\nConnection: close\r\n\r\n" >&3 && IFS= read -r s <&3 && [[ "$s" =~ ^HTTP/.*[[:space:]](2|3)[0-9][0-9] ]]' || find /etc/apt -type f \( -name '*.list' -o -name '*.sources' \) -exec sed -i.bak -e 's|archive\.ubuntu\.com|mirror.fcix.net|g' -e 's|security\.ubuntu\.com|mirror.fcix.net|g' {} +
RUN apt-get clean && apt-get update
RUN PKGS="software-properties-common git libxml2-dev pkg-config curl wget openjdk-8-jdk libpython3-dev python3-pip python3-setuptools build-essential gfortran libopenblas-dev liblapack-dev gpg gpg-agent software-properties-common gcc g++ make libc6-dev libffi-dev libcurl4-openssl-dev libssl-dev openssl zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev liblzma-dev tk-dev uuid-dev pandoc libuv1-dev libuv1"; $APT_INSTALL $PKGS || (apt-get update && $APT_INSTALL $PKGS)
RUN update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

RUN add-apt-repository ppa:pypy/ppa
RUN apt update
RUN $APT_INSTALL gfortran libopenblas-dev liblapack-dev
RUN $APT_INSTALL build-essential
# We also want Python 3.8 since that's the oldest supported version for Spark 3.5
# Also ubuntu is under a DDoS so retry adding, and finally fallback to python.org 3.8 release
RUN ( \
(add-apt-repository -y ppa:deadsnakes/ppa || add-apt-repository -y ppa:deadsnakes/ppa) && \
(apt-get update || apt-get update) && \
PKGS="python3.8 python3.9 python3.9-venv python3.8-venv"; ($APT_INSTALL $PKGS || apt-get update && $APT_INSTALL $PKGS) \
) || \
(PYTHON_VERSION=3.8.20; \
curl -O https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz && \
tar -xzf Python-${PYTHON_VERSION}.tgz && \
cd Python-${PYTHON_VERSION} && \
./configure --enable-shared --prefix=/usr/local LDFLAGS="-Wl,--rpath=/usr/local/lib" && \
make altinstall && \
cd .. && \
PYTHON_VERSION=3.9.25; \
curl -O https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz && \
tar -xzf Python-${PYTHON_VERSION}.tgz && \
cd Python-${PYTHON_VERSION} && \
./configure --enable-shared --prefix=/usr/local LDFLAGS="-Wl,--rpath=/usr/local/lib" && \
make altinstall)

RUN mkdir -p /usr/local/pypy/pypy3.8 && \
curl -sqL https://downloads.python.org/pypy/pypy3.8-v7.3.11-linux64.tar.bz2 | tar xjf - -C /usr/local/pypy/pypy3.8 --strip-components=1 && \
ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
RUN curl -sS https://bootstrap.pypa.io/pip/3.9/get-pip.py | python3.9

RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
RUN curl -sS https://bootstrap.pypa.io/pip/3.8/get-pip.py | python3.8

RUN $APT_INSTALL gnupg ca-certificates pandoc
RUN echo 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' >> /etc/apt/sources.list
RUN echo 'deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/' >> /etc/apt/sources.list
RUN gpg --keyserver hkps://keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9
RUN gpg -a --export E084DAB9 | apt-key add -
RUN add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
RUN apt update
RUN $APT_INSTALL r-base libcurl4-openssl-dev qpdf libssl-dev zlib1g-dev
RUN Rscript -e "install.packages(c('remotes', 'knitr', 'markdown', 'rmarkdown', 'testthat', 'e1071', 'survival', 'arrow', 'roxygen2', 'xml2'), repos='https://cloud.r-project.org/')"
RUN $APT_INSTALL r-base
RUN Rscript -e "install.packages(c('remotes'), repos='https://cloud.r-project.org/')"

RUN Rscript -e "remotes::install_cran('testthat');" && Rscript -e "library(testthat);"
# rmarkdown bits
RUN Rscript -e "remotes::install_cran('fs');library(fs)"
RUN Rscript -e "remotes::install_cran('sass');library(sass)"

# Install generic packages we let float

RUN Rscript -e " \
options(repos = c(CRAN = 'https://cloud.r-project.org/')); \
pkgs <- c('knitr', 'markdown', 'rmarkdown', 'e1071', 'survival', 'arrow', 'xml2'); \
remotes::install_cran(pkgs, upgrade = 'never'); \
missing <- pkgs[!vapply(pkgs, requireNamespace, logical(1), quietly = TRUE)]; \
if (length(missing)) stop('Missing R packages after install: ', paste(missing, collapse = ', ')); \
"

# See more in SPARK-39959, roxygen2 < 7.2.1
RUN apt-get install -y libcurl4-openssl-dev libgit2-dev libssl-dev libxml2-dev \
libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev \
libtiff5-dev libjpeg-dev
RUN Rscript -e "install.packages(c('remotes'), repos='https://cloud.r-project.org/')"
RUN Rscript -e "remotes::install_version('pkgload', version = '1.3.2', repos = 'https://cloud.r-project.org'); \
remotes::install_version('pkgbuild', version = '1.4.0', repos = 'https://cloud.r-project.org'); \
remotes::install_version('desc', version = '1.4.2', repos = 'https://cloud.r-project.org'); \
remotes::install_version('rlang', version = '1.1.1', repos = 'https://cloud.r-project.org'); \
remotes::install_version('cli', version = '3.6.1', repos = 'https://cloud.r-project.org'); \
remotes::install_version('purrr', version = '1.0.1', repos = 'https://cloud.r-project.org')"
RUN Rscript -e "remotes::install_version('roxygen2', version='7.2.0', repos='https://cloud.r-project.org')"

# Sanity check the R install
RUN Rscript -e " \
library(testthat); \
library(knitr); \
library(markdown); \
library(rmarkdown); \
library(roxygen2); \
library(xml2);"

# See more in SPARK-39735
ENV R_LIBS_SITE "/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library"

RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage matplotlib
RUN python3.9 -m pip install 'numpy==1.25.1' 'pyarrow==12.0.1' 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage 'matplotlib==3.7.2' openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
RUN python3.8 -m pip install setuptools virtualenv
RUN python3.9 -m pip install setuptools virtualenv

RUN python3.8 -m pip install --only-binary=pandas numpy pandas 'scipy<1.9' coverage 'matplotlib==3.7.2' 'mypy==0.982'
RUN python3.9 -m pip install 'numpy==1.25.1' 'pyarrow==12.0.1' 'pandas<=2.0.3' 'scipy<=1.10' unittest-xml-reporting 'plotly>=4.8' 'mlflow>=2.3.1' coverage 'matplotlib==3.7.2' openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' 'blinker==1.4' 'mypy==0.982'

# Add Python deps for Spark Connect.
RUN python3.9 -m pip install 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 'protobuf==3.20.3' 'googleapis-common-protos==1.56.4'

# Add torch as a testing dependency for TorchDistributor
RUN python3.9 -m pip install 'torch==2.0.1' 'torchvision==0.15.2' torcheval

# pyarrow
RUN python3.9 -m pip install 'pyarrow<13.0.0'
RUN python3.8 -m pip install 'pyarrow<13.0.0'
6 changes: 3 additions & 3 deletions dev/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ py4j

# PySpark dependencies (optional)
numpy
pyarrow<13.0.0
pandas
pyarrow<13.0.0,=>4.0.0
pandas<3,=>1.0.5
scipy
plotly
plotly<6
mlflow>=2.3.1
scikit-learn
matplotlib
Expand Down
6 changes: 6 additions & 0 deletions python/mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,12 @@ ignore_missing_imports = True
[mypy-grpc.*]
ignore_missing_imports = True

; pydantic is pulled in transitively (e.g. via mlflow). mypy has issues
; serializing pydantic v2's recursive JsonValue type, so skip following it.
[mypy-pydantic.*]
ignore_missing_imports = True
follow_imports = skip

; Ignore errors for proto generated code
[mypy-pyspark.sql.connect.proto.*, pyspark.sql.connect.proto]
ignore_errors = True