Skip to content

TIKA-4703: Add Docker CI pipelines for tika-server and tika-grpc#2715

Open
nddipiazza wants to merge 3 commits intomainfrom
TIKA-4703-docker-ci
Open

TIKA-4703: Add Docker CI pipelines for tika-server and tika-grpc#2715
nddipiazza wants to merge 3 commits intomainfrom
TIKA-4703-docker-ci

Conversation

@nddipiazza
Copy link
Copy Markdown
Contributor

Summary

Moves Docker build infrastructure into the main tika repo so that Docker image releases are tied directly to Tika releases, eliminating the need for cross-repo coordination with tika-docker and tika-grpc-docker.

  • Snapshot workflow (main branch push): builds and pushes apache/tika, apache/tika-full, and apache/tika-grpc snapshot images to Docker Hub
  • Release workflow (version tag push): builds and pushes versioned + latest tags for all three images
  • tika-server Dockerfiles: copied from tika-docker repo (source of truth), plus new Dockerfile.snapshot variants that use the Maven assembly output instead of downloading from Apache mirrors
  • tika-grpc docker-build: Dockerfile, entrypoint script, and build context assembly script
  • TikaGrpcServer: now falls back to a bundled empty default-tika-config.json from classpath when no -c flag is provided, matching standard Java application conventions
  • Tested locally: all three images (minimal, full, grpc) build and start successfully

Required Setup

DOCKERHUB_USERNAME and DOCKERHUB_TOKEN secrets must be configured in the repo settings for the workflows to push images.

Test plan

  • tika-server minimal: HTTP 200 on port 9998, user 35002:35002
  • tika-server full: HTTP 200 on port 9998, user 35002:35002, ImageMagick verified
  • tika-grpc: gRPC server starts on port 9090, all plugins loaded, no config file required
  • Test Docker push to personal Docker Hub
  • Verify snapshot workflow triggers on main merge
  • Verify release workflow triggers on version tag

🤖 Generated with Claude Code

nddipiazza and others added 3 commits March 27, 2026 09:02
Move Docker build infrastructure into the main tika repo so that
Docker image releases are tied directly to Tika releases rather than
requiring cross-repo coordination with tika-docker/tika-grpc-docker.

Snapshot workflow (main branch push):
- Builds tika-server minimal and full images from Maven output
- Builds tika-grpc image from Maven output
- Pushes snapshot tags to Docker Hub (e.g. 4.0.0-SNAPSHOT)

Release workflow (version tag push):
- Builds tika-server minimal/full from Apache mirror JARs with GPG
  verification (multi-arch: amd64, arm64, arm/v7, s390x)
- Builds tika-grpc from Maven output (multi-arch: amd64, arm64)
- Pushes versioned + latest tags to Docker Hub

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tika-server snapshot Dockerfiles: use assembly tgz (thin JAR + lib/)
  instead of the thin JAR alone, matching the 4.x packaging model
- tika-grpc: bundle default-tika-config.json so the server starts
  without requiring a config volume mount
- tika-grpc: pass -c, -p, and --plugin-roots as CLI args instead of
  system properties so TikaGrpcServer actually picks them up
- tika-grpc: default port is now 9090 (configurable via TIKA_GRPC_PORT)

Tested locally: all three images (minimal, full, grpc) build and start
successfully.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TikaGrpcServer now falls back to a bundled default-tika-config.json
from the classpath when no -c flag is provided, matching normal Java
application conventions. The default config is empty (no pre-configured
fetchers/emitters) — users configure these at runtime.

This removes the need for a separate config file in the Docker image.
The entrypoint only passes -c when TIKA_CONFIG env var is explicitly set.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nddipiazza
Copy link
Copy Markdown
Contributor Author

Adding clarification: the Docker images are published to the existing Docker Hub repositories:

These are the same Docker Hub repos currently used by tika-docker and tika-grpc-docker — the GitHub Actions workflows will publish to the same locations, just automated from the main tika repo instead of manually.

@nddipiazza
Copy link
Copy Markdown
Contributor Author

@bartek

&& apt-get clean -y \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

EXPOSE 9090
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the ARG suggested previously to run as a nonroot user

Suggested change
EXPOSE 9090
USER $UID_GID
EXPOSE 9090

# License for the specific language governing permissions and limitations under
# the License.

FROM ubuntu:plucky
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tika-server Dockerfiles run as USER 35002:35002 (matching the upstream tika-docker convention), but this Dockerfile has no USER directive. The gRPC server runs as root. docker-tool.sh even asserts 35002:35002 in its test function.

Should just need to add ARG UID_GID="35002:35002" like in the tika-server Dockerfile and reference that ARG is a USER directive.

Suggested change
FROM ubuntu:plucky
# "random" uid/gid hopefully not used anywhere else
# This needs to be set globally and then referenced in
# the subsequent stages -- see TIKA-3912
ARG UID_GID="35002:35002"
FROM ubuntu:plucky

COPY plugins/ /tika/plugins/
COPY config/ /tika/config/
COPY bin/ /tika/bin
ARG JRE='openjdk-17-jre-headless'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tika-server images default to openjdk-21-jre-headless. Any reason to pin grpc to 17? If intentional, might be worth a comment explaining why, otherwise someone will "fix" it later and potentially break something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants