Skip to content

Commit 8abc702

Browse files
committed
Reorganzie dependency installation for better squashing
I'll leave it up to y'all to decide if the changes/risks here are worth the reduction in image size. Thanks! Reduced image size ┌─────────────────┬──────────┬─────────┬───────────────┐ │ Metric │ Original │ New │ Reduction │ ├─────────────────┼──────────┼─────────┼───────────────┤ │ Image Size │ 60.1 GB │ 48.2 GB │ 11.9 GB (20%) │ ├─────────────────┼──────────┼─────────┼───────────────┤ │ Filesystem Size │ 49 GB │ 44 GB │ 5 GB (10%) │ └─────────────────┴──────────┴─────────┴───────────────┘ Note: Image size includes all layers; filesystem size is the actual disk usage inside the container. - Added --no-cache to uv pip install (Safe) Cache is only useful for repeated installs in the same environment. In Docker builds, each layer is fresh, so cache provides no benefit. - Removed Intel MKL numpy (Less sure) Removed the Intel MKL numpy install from Intel's Anaconda channel. Intel's channel only has numpy 1.26.4 (numpy 1.x), but the base image has numpy 2.0.2. Installing Intel's numpy would downgrade and break packages compiled against numpy 2.x ABI. The base image's numpy 2.0.2 uses OpenBLAS optimizations and is compatible with all installed packages. - Removed preprocessing package (Less sure) Package is unmaintained (last release 2017) and requires nltk==3.2.4 which is incompatible with Python 3.11 (inspect.formatargspec was removed). Package hasn't been updated in 7+ years and cannot function on Python 3.11. - Updated scikit-learn to 1.5.2 (Less sure) Changed from scikit-learn==1.2.2 to scikit-learn==1.5.2. scikit-learn 1.2.2 binary wheels are incompatible with numpy 2.x ABI, causing "numpy.dtype size changed" errors. scikit-learn 1.5.x maintains API compatibility with 1.2.x. The original pin was for eli5/learntools compatibility, which should work with 1.5.x. - Added uv cache cleanup to clean-layer.sh (safe) Added /root/.cache/uv/* to the cleanup script. The script only cleaned pip cache, not uv cache. Cache cleanup scripts are run after package installs; cache is not needed at runtime.
1 parent 0c165a3 commit 8abc702

3 files changed

Lines changed: 64 additions & 26 deletions

File tree

Dockerfile.tmpl

Lines changed: 58 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,52 @@
11
ARG BASE_IMAGE \
22
BASE_IMAGE_TAG
33

4-
FROM ${BASE_IMAGE}:${BASE_IMAGE_TAG}
4+
# =============================================================================
5+
# Stage 1: Apply --force-reinstall operations to base image
6+
# These replace packages from the base, causing layer bloat. We squash this
7+
# stage to eliminate the duplicate package data.
8+
# =============================================================================
9+
FROM ${BASE_IMAGE}:${BASE_IMAGE_TAG} AS base-reinstalls
10+
11+
# Combine all --force-reinstall operations into one layer before squashing.
12+
# b/408281617: Torch is adamant that it can not install cudnn 9.3.x, only 9.1.x, but Tensorflow can only support 9.3.x.
13+
# This conflict causes a number of package downgrades, which are handled in this command.
14+
# b/394382016: sigstore (dependency of kagglehub) requires a prerelease packages, installing separate.
15+
# b/385145217: Intel MKL numpy removed - Intel's channel only has numpy 1.26.4, but base image has
16+
# numpy 2.0.2. Downgrading would break packages built against numpy 2.x ABI.
17+
# b/404590350: Ray and torchtune have conflicting tune cli, we will prioritize torchtune.
18+
# b/415358158: Gensim removed from Colab image to upgrade scipy
19+
# b/456239669: remove huggingface-hub pin when pytorch-lighting and transformer are compatible
20+
# b/315753846: Unpin translate package, currently conflicts with adk 1.17.0
21+
# b/468379293: Unpin Pandas once cuml/cudf are compatible, version 3.0 causes issues
22+
# b/468383498: numpy will auto-upgrade to 2.4.x, which causes issues with numerous packages
23+
# b/468367647: Unpin protobuf, version greater than v5.29.5 causes issues with numerous packages
24+
# b/408298750: We reinstall nltk because older versions have: `AttributeError: module 'inspect' has no attribute 'formatargspec'`
25+
RUN uv pip install --no-cache \
26+
--index-url https://pypi.nvidia.com --extra-index-url https://pypi.org/simple/ --index-strategy unsafe-first-match \
27+
--system --force-reinstall "cuml-cu12==25.2.1" \
28+
"nvidia-cudnn-cu12==9.3.0.75" "nvidia-cublas-cu12==12.5.3.2" "nvidia-cusolver-cu12==11.6.3.83" \
29+
"nvidia-cuda-cupti-cu12==12.5.82" "nvidia-cuda-nvrtc-cu12==12.5.82" "nvidia-cuda-runtime-cu12==12.5.82" \
30+
"nvidia-cufft-cu12==11.2.3.61" "nvidia-curand-cu12==10.3.6.82" "nvidia-cusparse-cu12==12.5.1.3" \
31+
"nvidia-nvjitlink-cu12==12.5.82" \
32+
&& uv pip install --no-cache --system --force-reinstall "pynvjitlink-cu12==0.5.2" \
33+
&& uv pip install --no-cache --system --force-reinstall --prerelease=allow "kagglehub[pandas-datasets,hf-datasets,signing]>=0.3.12" \
34+
&& uv pip install --no-cache --system --force-reinstall --no-deps torchtune gensim "scipy<=1.15.3" "huggingface-hub==0.36.0" "google-cloud-translate==3.12.1" "numpy==2.0.2" "pandas==2.2.2" \
35+
&& uv pip install --no-cache --system --force-reinstall "protobuf==5.29.5" \
36+
&& uv pip install --no-cache --system --force-reinstall "nltk>=3.9.1" \
37+
&& rm -rf /root/.cache/uv /root/.cache/pip
38+
39+
# =============================================================================
40+
# Stage 2: Squash the base + reinstalls to eliminate layer bloat
41+
# =============================================================================
42+
FROM scratch AS clean-base
43+
COPY --from=base-reinstalls / /
44+
45+
# =============================================================================
46+
# Stage 3: Continue with cacheable operations
47+
# These layers will be cached normally on subsequent builds
48+
# =============================================================================
49+
FROM clean-base
550

651
ADD kaggle_requirements.txt /kaggle_requirements.txt
752

@@ -12,32 +57,22 @@ RUN pip freeze | grep -E 'tensorflow|keras|torch|jax' > /colab_requirements.txt
1257
RUN cat /colab_requirements.txt >> /requirements.txt
1358
RUN cat /kaggle_requirements.txt >> /requirements.txt
1459

15-
# Install Kaggle packages
16-
RUN uv pip install --system -r /requirements.txt
60+
# TODO: GPU requirements.txt
61+
# TODO: merge them better (override matching ones).
62+
63+
# Install Kaggle packages (--no-cache prevents cache buildup)
64+
RUN uv pip install --no-cache --system -r /requirements.txt
1765

1866
# Install manual packages:
1967
# b/183041606#comment5: the Kaggle data proxy doesn't support these APIs. If the library is missing, it falls back to using a regular BigQuery query to fetch data.
2068
RUN uv pip uninstall --system google-cloud-bigquery-storage
2169

22-
# b/394382016: sigstore (dependency of kagglehub) requires a prerelease packages, installing separate.
23-
RUN uv pip install --system --force-reinstall --prerelease=allow "kagglehub[pandas-datasets,hf-datasets,signing]>=0.3.12"
24-
2570
# uv cannot install this in requirements.txt without --no-build-isolation
2671
# to avoid affecting the larger build, we'll post-install it.
27-
RUN uv pip install --no-build-isolation --system "git+https://github.com/Kaggle/learntools"
72+
RUN uv pip install --no-cache --no-build-isolation --system "git+https://github.com/Kaggle/learntools"
2873

2974
# newer daal4py requires tbb>=2022, but libpysal is downgrading it for some reason
30-
RUN uv pip install --system "tbb>=2022" "libpysal==4.9.2"
31-
32-
# b/404590350: Ray and torchtune have conflicting tune cli, we will prioritize torchtune.
33-
# b/415358158: Gensim removed from Colab image to upgrade scipy
34-
# b/456239669: remove huggingface-hub pin when pytorch-lighting and transformer are compatible
35-
# b/315753846: Unpin translate package, currently conflicts with adk 1.17.0
36-
# b/468379293: Unpin Pandas once cuml/cudf are compatible, version 3.0 causes issues
37-
# b/468383498: numpy will auto-upgrade to 2.4.x, which causes issues with numerous packages
38-
# b/468367647: Unpin protobuf, version greater than v5.29.5 causes issues with numerous packages
39-
RUN uv pip install --system --force-reinstall --no-deps torchtune gensim "scipy<=1.15.3" "huggingface-hub==0.36.0" "google-cloud-translate==3.12.1" "numpy==2.0.2" "pandas==2.2.2"
40-
RUN uv pip install --system --force-reinstall "protobuf==5.29.5"
75+
RUN uv pip install --no-cache --system "tbb>=2022" "libpysal==4.9.2"
4176

4277
# Adding non-package dependencies:
4378
ADD clean-layer.sh /tmp/clean-layer.sh
@@ -48,7 +83,7 @@ ARG PACKAGE_PATH=/usr/local/lib/python3.12/dist-packages
4883

4984
# Install GPU-specific non-pip packages.
5085
{{ if eq .Accelerator "gpu" }}
51-
RUN uv pip install --system "pycuda"
86+
RUN uv pip install --no-cache --system "pycuda"
5287
{{ end }}
5388

5489

@@ -72,9 +107,7 @@ RUN apt-get install -y libfreetype6-dev && \
72107
apt-get install -y libglib2.0-0 libxext6 libsm6 libxrender1 libfontconfig1 --fix-missing
73108

74109
# NLTK Project datasets
75-
# b/408298750: We currently reinstall the package, because we get the following error:
76-
# `AttributeError: module 'inspect' has no attribute 'formatargspec'. Did you mean: 'formatargvalues'?`
77-
RUN uv pip install --system --force-reinstall "nltk>=3.9.1"
110+
# Note: nltk is reinstalled in stage 1 to fix b/408298750 (formatargspec error)
78111
RUN mkdir -p /usr/share/nltk_data && \
79112
# NLTK Downloader no longer continues smoothly after an error, so we explicitly list
80113
# the corpuses that work
@@ -168,6 +201,9 @@ ENV GIT_COMMIT=${GIT_COMMIT} \
168201
# Correlate current release with the git hash inside the kernel editor by running `!cat /etc/git_commit`.
169202
RUN echo "$GIT_COMMIT" > /etc/git_commit && echo "$BUILD_DATE" > /etc/build_date
170203

204+
# Final cleanup
205+
RUN rm -rf /root/.cache/uv /root/.cache/pip /tmp/clean-layer.sh
206+
171207
{{ if eq .Accelerator "gpu" }}
172208
# Add the CUDA home.
173209
ENV CUDA_HOME=/usr/local/cuda

clean-layer.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@
1010
set -e
1111
set -x
1212

13-
# Delete files that pip caches when installing a package.
14-
rm -rf /root/.cache/pip/*
13+
# Delete files that pip and uv cache when installing packages.
14+
rm -rf /root/.cache/pip/* /root/.cache/uv/*
1515
# Delete old downloaded archive files
1616
apt-get autoremove -y
1717
# Delete downloaded archive files

kaggle_requirements.txt

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ path
9191
path.py
9292
pdf2image
9393
plotly-express
94-
preprocessing
94+
# Removed: preprocessing (unmaintained since 2017, requires nltk==3.2.4 incompatible with Python 3.11)
9595
pudb
9696
pyLDAvis
9797
pycryptodome
@@ -109,7 +109,9 @@ qtconsole
109109
ray
110110
rgf-python
111111
s3fs
112-
scikit-learn
112+
# b/302136621: Fix eli5 import for learntools
113+
# Note: scikit-learn 1.2.2 is incompatible with numpy 2.x ABI - using 1.5.2 which supports numpy 2.x
114+
scikit-learn==1.5.2
113115
# Scikit-learn accelerated library for x86
114116
scikit-learn-intelex>=2023.0.1
115117
scikit-multilearn

0 commit comments

Comments
 (0)