Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
82bfcd5
update the declared module path for forked slurm-operator
syutogether Jul 23, 2025
fe97a0e
fix the declared module path for our forked slurm-operator
syutogether Jul 23, 2025
3f42f59
Merge branch 'syu/tcl-1682-fix-module-path' of ssh://github.com/toget…
syutogether Jul 23, 2025
51bd8f1
Merge pull request #4 from togethercomputer/syu/tcl-1682-fix-module-path
syutogether Jul 23, 2025
f4bb007
Merge remote-tracking branch 'upstream/release-1.0' into release-1.0
eb3095 Nov 20, 2025
910e26f
add node-cordon, login chart, Helm/operator tweaks
jhu-svg Nov 24, 2025
6231fd6
Add shmSize and existingDataClaims support to loginset-cr.yaml
jhu-svg Jan 23, 2026
3f23377
fix update
jhu-svg Jan 24, 2026
085c2fa
fix: use initContainer to set correct permissions on slurmdbd.conf
jhu-svg Jan 27, 2026
6b78f4b
use tcp probe - fix slurmctl pod
jhu-svg Jan 27, 2026
b245a49
fix nodeset bugs
jhu-svg Jan 28, 2026
e654fe0
Add tolerateError for job list in GetNodeDeadlines
jhu-svg Jan 28, 2026
d3d8824
Merge pull request #15 from togethercomputer/slurm-1.0-together-changes
jhu-svg Jan 30, 2026
13facbf
TCL-3968 Fix login conflicts
eb3095 Feb 12, 2026
0c0124c
Merge pull request #16 from togethercomputer/ebenner/TCL-3968
eb3095 Feb 12, 2026
8b1aaee
TCL-4123 Fix login chart spec
eb3095 Feb 22, 2026
e763bc9
Merge pull request #17 from togethercomputer/ebenner/TCL-4123
eb3095 Feb 22, 2026
8bee33c
Fix slurm login template helpers
jhu-svg Mar 2, 2026
016e363
Add SACKD_OPTIONS env var to login deployment
jhu-svg Mar 3, 2026
760836d
Fix slurmctld reconfigure deadlock and login auth failures
jhu-svg Mar 6, 2026
75440e4
Add Linear ticket references (TCL-4401, TCL-4402, TCL-4403)
jhu-svg Mar 6, 2026
b95314c
Merge pull request #18 from togethercomputer/jhu/fix-slurm-login-dnsc…
jhu-svg Mar 11, 2026
57c7bdc
Fix login pod permissions: split config (644) from auth keys (600)
jhu-svg Mar 16, 2026
0406014
Merge pull request #19 from togethercomputer/jhu/fix-slurm-login-dnsc…
jhu-svg Mar 17, 2026
76cd497
fix login initContainer image to avoid Docker Hub rate limits
jhu-svg Mar 20, 2026
b20f5b8
Merge pull request #20 from togethercomputer/jhu/fix-login-init-image…
jhu-svg Mar 20, 2026
4a95674
TCL-5107: feat: add initScriptLogin and initScriptNodes values
eb3095 Mar 31, 2026
9910102
Merge pull request #21 from togethercomputer/ebenner/TCL-5107
eb3095 Apr 3, 2026
7037592
Fix scratch vols on slinky 1 (#22)
sagrawal-byte Apr 7, 2026
c1b2082
feat: add system defaults to buildSlurmConf (TCL-5588)
jhu-svg Apr 20, 2026
ac70382
feat: add ConstrainRAMSpace=yes to default cgroup.conf
jhu-svg Apr 21, 2026
3e252a9
feat: add JobRequeue=0 to system defaults
jhu-svg Apr 22, 2026
e76f20c
Merge pull request #23 from togethercomputer/TCL-5588/system-defaults…
jhu-svg Apr 24, 2026
27d15ad
TCL-5576: feat: add topology spread for login pods
eb3095 Apr 24, 2026
6b79afa
Merge pull request #24 from togethercomputer/ebenner/TCL-5576
jhu-svg Apr 24, 2026
5fd2366
TCL-5951: feat: add DeleteNode to SlurmControlInterface
jhu-svg May 4, 2026
96f9c16
TCL-5951: feat: reconcile orphaned Slurm node registrations
jhu-svg May 4, 2026
c95f729
fix: scope orphan cleanup to current NodeSet
jhu-svg May 4, 2026
14dcfdb
fix: avoid cross-NodeSet orphan cleanup
jhu-svg May 5, 2026
9bf337c
fix: use hostname template prefix for orphan node matching
jhu-svg May 5, 2026
3eb4716
Merge pull request #26 from togethercomputer/jhu/tcl-5951-reconcile-o…
jhu-svg May 8, 2026
b2ee203
Merge pull request #25 from togethercomputer/jhu/tcl-5951-add-delete-…
jhu-svg May 8, 2026
b97c976
[TCL-6165] feat: add --namespace flag for namespace-scoped manager (#27)
sagrawal-byte May 11, 2026
ec9afa8
TCL-6170: feat: Docker Hub CI and registry defaults (#28)
eb3095 May 11, 2026
c6e194f
TCL-6170 Fix github actions (#30)
eb3095 May 12, 2026
2afa522
Ebenner/tcl 6170 fix 2 (#31)
eb3095 May 12, 2026
bb96508
Fresh
eb3095 May 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions .github/workflows/container-images-1.0.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Build and push slurm-operator container images to Docker Hub (1.0 line).
#
# Tagging:
# - push to slurm-1.0-together-changes: togethercomputer/slurm-operator:<VERSION> (from ./VERSION, no suffix)
# - pull_request (same-repo only): ...:<VERSION>-dev-<short_sha>
#
# GitHub only evaluates workflows that exist on the ref being pushed. This file must live on
# slurm-1.0-together-changes (not only on main), or pushes to that branch will not run any workflow.
#
# Required repository secrets:
# - DOCKERHUB_USERNAME
# - DOCKERHUB_TOKEN
# Optional:
# - ROBOT_GITHUB_TOKEN

name: Container images v1.0

on:
pull_request:
branches:
- slurm-1.0-together-changes
push:
branches:
- slurm-1.0-together-changes
workflow_dispatch:

env:
REGISTRY: togethercomputer
RELEASE_BRANCH: slurm-1.0-together-changes

concurrency:
group: container-images-1.0-${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true

jobs:
build-and-push:
runs-on: ubuntu-latest
permissions:
contents: read
steps:
- name: Checkout
uses: actions/checkout@v4
with:
ref: ${{ github.event.pull_request.head.sha || github.sha }}
fetch-depth: 0

- name: Set image VERSION for docker-bake
id: image-version
run: |
set -euo pipefail
BASE="$(tr -d ' \n\r\t' < VERSION)"
SHORT="$(git rev-parse --short HEAD)"
if [ "${{ github.event_name }}" = "pull_request" ]; then
IMG="${BASE}-dev-${SHORT}"
else
IMG="${BASE}"
fi
echo "version=${IMG}" >> "$GITHUB_OUTPUT"
echo "Image VERSION (make/docker-bake): ${IMG}"

- name: Whether to push to Docker Hub
id: should-push
env:
RELEASE_REF: refs/heads/${{ env.RELEASE_BRANCH }}
run: |
if [ "${{ github.event_name }}" = "push" ] && [ "${{ github.ref }}" = "${RELEASE_REF}" ]; then
echo "push=true" >> "$GITHUB_OUTPUT"
elif [ "${{ github.event_name }}" = "pull_request" ] && \
[ "${{ github.event.pull_request.head.repo.full_name }}" = "${{ github.repository }}" ]; then
echo "push=true" >> "$GITHUB_OUTPUT"
elif [ "${{ github.event_name }}" = "workflow_dispatch" ] && [ "${{ github.ref }}" = "${RELEASE_REF}" ]; then
echo "push=true" >> "$GITHUB_OUTPUT"
else
echo "push=false" >> "$GITHUB_OUTPUT"
fi

- name: Set up Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod

- name: Configure git for private Go modules
run: |
git config --global url."https://together-robot:${ROBOT_GITHUB_TOKEN}@github.com/togethercomputer/".insteadOf "https://github.com/togethercomputer/"
env:
ROBOT_GITHUB_TOKEN: ${{ secrets.ROBOT_GITHUB_TOKEN }}

- name: Set up Helm
uses: azure/setup-helm@v4
with:
version: v3.16.3

- name: Set up QEMU
uses: docker/setup-qemu-action@v3

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Log in to Docker Hub
if: steps.should-push.outputs.push == 'true'
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}

- name: Build (images + charts, no registry push)
if: steps.should-push.outputs.push != 'true'
run: make build VERSION="${{ steps.image-version.outputs.version }}"

- name: Package charts (push path)
if: steps.should-push.outputs.push == 'true'
run: make build-chart

- name: Build and push images
if: steps.should-push.outputs.push == 'true'
run: make push-images VERSION="${{ steps.image-version.outputs.version }}"
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -137,3 +137,7 @@ values-*.yaml
.helm_ls_cache
docs/build/html
*debug_bin*
test-values.yaml

# codemogger db
.codemogger/
5 changes: 3 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ version-match: version ## Check if versions are consistent.
.PHONY: all
all: build ## Build slurm-operator.

REGISTRY ?= slinky.slurm.net
REGISTRY ?= togethercomputer
BUILDER ?= project-v3-builder

.PHONY: build
Expand All @@ -81,7 +81,8 @@ build-chart: ## Build charts.
push: push-images push-charts ## Push OCI packages.

.PHONY: push-images
push-images: build-images ## Push container images.
push-images: ## Build and push container images (single buildx bake --push).
- $(CONTAINER_TOOL) buildx create --name $(BUILDER)
REGISTRY=$(REGISTRY) VERSION=$(VERSION) $(CONTAINER_TOOL) buildx bake --builder=$(BUILDER) --push

.PHONY: push-charts
Expand Down
16 changes: 8 additions & 8 deletions PROJECT
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ plugins:
manifests.sdk.operatorframework.io/v2: {}
scorecard.sdk.operatorframework.io/v2: {}
projectName: slurm-operator
repo: github.com/SlinkyProject/slurm-operator
repo: github.com/togethercomputer/slurm-operator
resources:
- api:
crdVersion: v1beta1
Expand All @@ -18,7 +18,7 @@ resources:
domain: slurm.net
group: slinky
kind: Controller
path: github.com/SlinkyProject/slurm-operator/api/v1beta1
path: github.com/togethercomputer/slurm-operator/api/v1beta1
version: v1beta1
webhooks:
validation: true
Expand All @@ -30,7 +30,7 @@ resources:
domain: slurm.net
group: slinky
kind: RestApi
path: github.com/SlinkyProject/slurm-operator/api/v1beta1
path: github.com/togethercomputer/slurm-operator/api/v1beta1
version: v1beta1
webhooks:
validation: true
Expand All @@ -42,7 +42,7 @@ resources:
domain: slurm.net
group: slinky
kind: Accounting
path: github.com/SlinkyProject/slurm-operator/api/v1beta1
path: github.com/togethercomputer/slurm-operator/api/v1beta1
version: v1beta1
webhooks:
validation: true
Expand All @@ -54,7 +54,7 @@ resources:
domain: slurm.net
group: slinky
kind: NodeSet
path: github.com/SlinkyProject/slurm-operator/api/v1beta1
path: github.com/togethercomputer/slurm-operator/api/v1beta1
version: v1beta1
webhooks:
validation: true
Expand All @@ -66,7 +66,7 @@ resources:
domain: slurm.net
group: slinky
kind: LoginSet
path: github.com/SlinkyProject/slurm-operator/api/v1beta1
path: github.com/togethercomputer/slurm-operator/api/v1beta1
version: v1beta1
webhooks:
validation: true
Expand All @@ -78,7 +78,7 @@ resources:
domain: slurm.net
group: slinky
kind: Token
path: github.com/SlinkyProject/slurm-operator/api/v1beta1
path: github.com/togethercomputer/slurm-operator/api/v1beta1
version: v1beta1
webhooks:
validation: true
Expand All @@ -90,6 +90,6 @@ resources:
domain: slurm.net
group: slinky
kind: Controller
path: github.com/SlinkyProject/slurm-operator/api/v1beta1
path: github.com/togethercomputer/slurm-operator/api/v1beta1
version: v1beta1
version: "3"
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
<div align="center">

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg?style=for-the-badge)](./LICENSES/Apache-2.0.txt)
[![Tag](https://img.shields.io/github/v/tag/SlinkyProject/slurm-operator?style=for-the-badge)](https://github.com/SlinkyProject/slurm-operator/tags/)
[![Go-Version](https://img.shields.io/github/go-mod/go-version/SlinkyProject/slurm-operator?style=for-the-badge)](./go.mod)
[![Last-Commit](https://img.shields.io/github/last-commit/SlinkyProject/slurm-operator?style=for-the-badge)](https://github.com/SlinkyProject/slurm-operator/commits/)
[![Tag](https://img.shields.io/github/v/tag/togethercomputer/slurm-operator?style=for-the-badge)](https://github.com/togethercomputer/slurm-operator/tags/)
[![Go-Version](https://img.shields.io/github/go-mod/go-version/togethercomputer/slurm-operator?style=for-the-badge)](./go.mod)
[![Last-Commit](https://img.shields.io/github/last-commit/togethercomputer/slurm-operator?style=for-the-badge)](https://github.com/togethercomputer/slurm-operator/commits/)

</div>

Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.0.0
1.0.10
23 changes: 20 additions & 3 deletions cmd/manager/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ import (
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/cache"
"sigs.k8s.io/controller-runtime/pkg/event"
"sigs.k8s.io/controller-runtime/pkg/healthz"
"sigs.k8s.io/controller-runtime/pkg/log/zap"
Expand Down Expand Up @@ -55,6 +56,7 @@ type Flags struct {
metricsAddr string
secureMetrics bool
enableHTTP2 bool
namespace string
}

func parseFlags(flags *Flags) {
Expand All @@ -81,6 +83,9 @@ func parseFlags(flags *Flags) {
"If set the metrics endpoint is served securely")
flag.BoolVar(&flags.enableHTTP2, "enable-http2", false,
"If set, HTTP/2 will be enabled for the metrics and webhook servers")
flag.StringVar(&flags.namespace, "namespace", "",
"If set, the operator only watches Slinky resources in this namespace. "+
"Empty (the default) watches all namespaces.")
flag.Parse()
}

Expand All @@ -107,17 +112,29 @@ func main() {
tlsOpts = append(tlsOpts, disableHTTP2)
}

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
leaderElectionID := "0033bda7.slinky.slurm.net"

mgrOpts := ctrl.Options{
Scheme: scheme,
Metrics: server.Options{
TLSOpts: tlsOpts,
BindAddress: flags.metricsAddr,
},
HealthProbeBindAddress: flags.probeAddr,
LeaderElection: flags.enableLeaderElection,
LeaderElectionID: "0033bda7.slinky.slurm.net",
LeaderElectionID: leaderElectionID,
LeaderElectionReleaseOnCancel: true,
})
}

// Restrict informers to a single namespace and give the leader-election
// lock a unique name so multiple operator instances can coexist on the
// same cluster without racing over the same CRs.
if flags.namespace != "" {
mgrOpts.Cache.DefaultNamespaces = map[string]cache.Config{flags.namespace: {}}
mgrOpts.LeaderElectionID = leaderElectionID + "." + flags.namespace
}

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), mgrOpts)
if err != nil {
setupLog.Error(err, "unable to start manager")
os.Exit(1)
Expand Down
8 changes: 4 additions & 4 deletions docker-bake.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
################################################################################

variable "REGISTRY" {
default = "ghcr.io/slinkyproject"
default = "togethercomputer"
}

variable "VERSION" {
Expand All @@ -22,15 +22,15 @@ target "_common" {
labels = {
# Ref: https://github.com/opencontainers/image-spec/blob/v1.0/annotations.md
"org.opencontainers.image.authors" = "slinky@schedmd.com"
"org.opencontainers.image.documentation" = "https://github.com/SlinkyProject/slurm-operator"
"org.opencontainers.image.documentation" = "https://github.com/togethercomputer/slurm-operator"
"org.opencontainers.image.license" = "Apache-2.0"
"org.opencontainers.image.vendor" = "SchedMD LLC."
"org.opencontainers.image.version" = "${VERSION}"
"org.opencontainers.image.source" = "https://github.com/SlinkyProject/slurm-operator"
"org.opencontainers.image.source" = "https://github.com/togethercomputer/slurm-operator"
# Ref: https://docs.redhat.com/en/documentation/red_hat_software_certification/2025/html/red_hat_openshift_software_certification_policy_guide/assembly-requirements-for-container-images_openshift-sw-cert-policy-introduction#con-image-metadata-requirements_openshift-sw-cert-policy-container-images
"vendor" = "SchedMD LLC."
"version" = "${VERSION}"
"release" = "https://github.com/SlinkyProject/slurm-operator"
"release" = "https://github.com/togethercomputer/slurm-operator"
}
}

Expand Down
10 changes: 5 additions & 5 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -345,12 +345,12 @@ limitations under the License.

.. |License| image:: https://img.shields.io/badge/License-Apache_2.0-blue.svg?style=for-the-badge
:target: ./LICENSES/Apache-2.0.txt
.. |Tag| image:: https://img.shields.io/github/v/tag/SlinkyProject/slurm-operator?style=for-the-badge
:target: https://github.com/SlinkyProject/slurm-operator/tags/
.. |Go-Version| image:: https://img.shields.io/github/go-mod/go-version/SlinkyProject/slurm-operator?style=for-the-badge
.. |Tag| image:: https://img.shields.io/github/v/tag/togethercomputer/slurm-operator?style=for-the-badge
:target: https://github.com/togethercomputer/slurm-operator/tags/
.. |Go-Version| image:: https://img.shields.io/github/go-mod/go-version/togethercomputer/slurm-operator?style=for-the-badge
:target: ./go.mod
.. |Last-Commit| image:: https://img.shields.io/github/last-commit/SlinkyProject/slurm-operator?style=for-the-badge
:target: https://github.com/SlinkyProject/slurm-operator/commits/
.. |Last-Commit| image:: https://img.shields.io/github/last-commit/togethercomputer/slurm-operator?style=for-the-badge
:target: https://github.com/togethercomputer/slurm-operator/commits/

.. toctree::
:maxdepth: 2
Expand Down
2 changes: 1 addition & 1 deletion docs/versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,4 +58,4 @@ any kind (e.g., component flag changes).
[semver]: https://semver.org/
[slurm-bridge]: https://github.com/SlinkyProject/slurm-bridge
[slurm-client]: https://github.com/SlinkyProject/slurm-client
[slurm-operator]: https://github.com/SlinkyProject/slurm-operator
[slurm-operator]: https://github.com/togethercomputer/slurm-operator
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
module github.com/SlinkyProject/slurm-operator
module github.com/togethercomputer/slurm-operator

go 1.25.0

Expand Down
2 changes: 1 addition & 1 deletion helm/slurm-operator-crds/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ home: https://slinky.schedmd.com/
icon: https://github.com/SlinkyProject/docs/blob/main/docs/_static/images/slinky.svg

sources:
- https://github.com/SlinkyProject/slurm-operator
- https://github.com/togethercomputer/slurm-operator

maintainers:
- name: SchedMD LLC.
Expand Down
2 changes: 1 addition & 1 deletion helm/slurm-operator-crds/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,5 @@ Slurm Operator CRDs

## Source Code

* <https://github.com/SlinkyProject/slurm-operator>
* <https://github.com/togethercomputer/slurm-operator>

2 changes: 1 addition & 1 deletion helm/slurm-operator/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ home: https://slinky.schedmd.com/
icon: https://github.com/SlinkyProject/docs/blob/main/docs/_static/images/slinky.svg

sources:
- https://github.com/SlinkyProject/slurm-operator
- https://github.com/togethercomputer/slurm-operator

maintainers:
- name: SchedMD LLC.
Expand Down
2 changes: 1 addition & 1 deletion helm/slurm-operator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Slurm Operator

## Source Code

* <https://github.com/SlinkyProject/slurm-operator>
* <https://github.com/togethercomputer/slurm-operator>

## Requirements

Expand Down
1 change: 1 addition & 0 deletions helm/slurm-operator/templates/operator/rbac.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ rules:
- list
- watch
- patch
- update
- apiGroups:
- ""
resources:
Expand Down
2 changes: 1 addition & 1 deletion helm/slurm/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ icon: https://github.com/SlinkyProject/docs/blob/main/docs/_static/images/slurm-
sources:
- https://github.com/SchedMD/slurm
- https://github.com/SlinkyProject/containers
- https://github.com/SlinkyProject/slurm-operator
- https://github.com/togethercomputer/slurm-operator

maintainers:
- name: SchedMD LLC.
Expand Down
Loading