Skip to content

feat(operator): Ingress + Gateway API controller (synapse-as-ingress, cert-manager HTTP-01)#20

Merged
pigri merged 6 commits into
mainfrom
feat/operator-ingress-gateway
May 18, 2026
Merged

feat(operator): Ingress + Gateway API controller (synapse-as-ingress, cert-manager HTTP-01)#20
pigri merged 6 commits into
mainfrom
feat/operator-ingress-gateway

Conversation

@pigri
Copy link
Copy Markdown
Contributor

@pigri pigri commented May 17, 2026

Summary

Adds --ingress-mode to the synapse-operator so it runs as a Kubernetes Ingress + Gateway API controller (sidecar in the synapse-proxy pod). This lets synapse fully replace a separate ingress controller (e.g. Traefik) — including serving cert-manager's ephemeral HTTP-01 / gatewayHTTPRoute ACME solver objects, so automated Let's Encrypt issuance works end-to-end through synapse.

What it does

  • IngressReconciler (controllers/ingress_controller.go): watches Ingresses whose spec.ingressClassName matches --ingress-class (default synapse); backends resolved via in-cluster DNS; renders synapse legacy upstreams.yaml. ACME challenge paths are emitted as an internal_paths entry keyed exactly as synapse's built-in default so they override the empty embedded ACME server (the crux of cert-manager HTTP-01 succeeding through synapse). Writes in place (truncate) so synapse's inotify filewatch — which ignores rename/move events — hot-reloads (~2s). --render-once one-shot mode primes the file from an initContainer before synapse starts.
  • GatewayReconciler (controllers/gateway.go): GatewayClass (controllerName gen0sec.com/synapse) + Gateways + attached HTTPRoutes (incl. cert-manager's ephemeral gatewayHTTPRoute solver) merged into the same upstreams.yaml; sets the Accepted/ResolvedRefs/Programmed status conditions cert-manager waits on. Tolerant of the Gateway API CRDs being absent (gated by --gateway-api).
  • main.go: scheme + flags wiring; config-hash controller path unchanged when not in --ingress-mode.

Tests

controllers/*_test.go (no envtest, fake client): ACME→internal_paths override + host upstream rendering, deterministic output, in-place writeIfChanged, ingress-class filtering, backend DNS/port resolution, Gateway HTTPRoute parsing + status conditions, foreign-class/foreign-GatewayClass isolation.

go build ./... && go vet ./... && go test ./... all green. New dependency: sigs.k8s.io/gateway-api v1.2.1.

Verified in a live k3s cluster (out of band)

Traefik fully removed; fresh-ACME-account real HTTP-01 challenge (no cached authz) → cert-manager solver Ingress → operator render → synapse hot-reload → challenge valid → Let's Encrypt cert issued; app served HTTPS 200. Gateway API routing proven; the cert-manager gatewayHTTPRoute solver path proven via a faithful simulation (cert-manager v1.20.2 does not enable its own gatewayHTTPRoute solver — external, unrelated to this controller).

pigri added 6 commits May 17, 2026 19:00
… cert-manager HTTP-01)

Adds an `--ingress-mode` to the operator binary so it runs as a
Kubernetes Ingress *and* Gateway API controller (sidecar in the
synapse-proxy pod), letting synapse fully replace a separate ingress
controller (e.g. Traefik) — including serving cert-manager's
ephemeral HTTP-01 / gatewayHTTPRoute ACME solver objects.

- `controllers/ingress_controller.go` — `IngressReconciler`: watches
  Ingresses whose `spec.ingressClassName` matches `--ingress-class`
  (default `synapse`), resolves backends by in-cluster DNS
  (`svc.ns.svc.cluster.local:port`), and renders the synapse legacy
  `upstreams.yaml`. ACME challenge paths
  (`/.well-known/acme-challenge/…`) are emitted as an `internal_paths`
  entry keyed exactly as synapse's built-in default so they OVERRIDE
  the (empty) embedded ACME server — this is what makes cert-manager
  HTTP-01 succeed through synapse. Writes IN PLACE (truncate) so
  synapse's inotify filewatch (which ignores rename/move events)
  hot-reloads (~2s). `--render-once` one-shot mode for an initContainer
  that primes the file before synapse starts (synapse drops file
  events in its first 2s).
- `controllers/gateway.go` — `GatewayReconciler` folded into the same
  render: a GatewayClass with controllerName `gen0sec.com/synapse`,
  Gateways of that class, and attached HTTPRoutes (incl. cert-manager's
  ephemeral `gatewayHTTPRoute` solver) merge into the same
  upstreams.yaml; sets the GatewayClass/Gateway/HTTPRoute status
  conditions cert-manager waits on. Tolerant of the Gateway API CRDs
  being absent.
- `main.go` — scheme + flags wiring; `--gateway-api` gates the Gateway
  API watches (require the CRDs); config-hash controller path
  unchanged when not in ingress-mode.
- Unit tests (no envtest): ACME→internal_paths override and host
  upstream rendering, deterministic output, in-place writeIfChanged,
  class filtering, backend DNS/port resolution, Gateway HTTPRoute
  parsing + status conditions, foreign-class isolation.

Dependency: `sigs.k8s.io/gateway-api v1.2.1`.
`go build/vet/test ./...` green.
…ttings

The controller previously emitted every backend as a single plain-HTTP
server with no tunables. This adds per-object annotation support and
Gateway-native translation, mapping onto the synapse legacy v1
upstreams fields (synapse-utils structs.rs) that already exist.

Annotations (`synapse.gen0sec.com/<k>`, with an
`nginx.ingress.kubernetes.io/` compat subset for backend-protocol /
ssl-redirect), applied per Ingress/HTTPRoute object:

  backend-protocol HTTP|HTTPS -> ssl_enabled (TLS to upstream)
  http2                       -> http2_enabled
  force-https | ssl-redirect  -> https_proxy_enabled (force_https)
  connect/read/write/idle-timeout (sec) -> *_timeout
  healthcheck                 -> healthcheck
  disable-access-log          -> disable_access_log
  request-headers/response-headers (csv) -> request_headers/response_headers
  sticky-sessions             -> top-level sticky_sessions

Gateway API extras:
  - HTTPRoute backendRef weights -> weighted server pool
    ({ address, weight } form; bare string when unweighted)
  - Request/ResponseHeaderModifier filters -> header injection
  - URLRewrite / header-remove have no v1 per-path equivalent and are
    left unmodified (not silently faked)

New `controllers/routes.go` holds the route model + annotation parser
+ the (now richer, still deterministic) renderUpstreams. The
cert-manager ACME-solver path stays a plain `internal_paths` override
(no knobs) — unchanged behavior, so HTTP-01 is unaffected. ssl_enabled
still defaults false (prior behavior) unless backend-protocol says
otherwise.

Tests extended: annotation parsing (+ nginx compat + nil), weighted
servers, header injection, sticky, deterministic output, end-to-end
Ingress-with-annotation and Gateway-weights-and-filters render. go
build/vet/test ./... green; no new dependencies.
…UP reload

Two robustness fixes found while testing annotation support in k3s.

#1 Deterministic route precedence (was: last-writer-wins, informer-
order-dependent — an unannotated Gateway HTTPRoute could clobber an
annotated Ingress route for the same host+path):
  - Ingress list and HTTPRoute list are sorted by namespace/name;
    Ingresses are processed before HTTPRoutes.
  - `addRoute` is first-writer-wins: an already-set (host,path) is
    NOT overwritten; it returns false and the caller logs a
    deterministic conflict. ACME backend also keeps the first.
  ⇒ reproducible output regardless of informer ordering; Ingress
    beats Gateway on conflict.

#2 Robust upstreams reload (was: in-place truncate write → synapse's
filewatch could torn-read an empty/partial file and, due to its 2s
reload debounce, never re-read → 502 "Upstream not found"; dynamic
updates were unreliable without a pod restart):
  - `writeIfChanged` is atomic again (tmp + rename) so a concurrent
    synapse read can never see a torn/empty file.
  - After a changed render the sidecar SIGHUPs the co-located synapse
    process (`signal.go`: scan /proc for argv0 basename "synapse",
    skip self/operator; syscall.Kill SIGHUP). synapse's SIGHUP
    handler broadcasts a reload and the upstreams filewatch re-reads
    with NO debounce — deterministic, independent of inotify event
    types/timing. (synapse SIGHUP verified reload-only; Pingora
    upgrade is off for this deployment so it ignores SIGHUP.)
  - New `IngressReconciler.SignalReload` (set for the long-running
    sidecar; off for the --render-once initContainer where synapse
    isn't running). Deployment gains `shareProcessNamespace: true`
    so the sidecar can see/signal synapse (same uid 65532).

Tests: first-writer-wins, ns/name-deterministic conflict across input
order, /proc-parameterized findReloadTargets (matches synapse, skips
self/operator). go build/vet/test ./... green. No new deps.

Verified live in k3s: dynamic annotation change with NO pod restart →
SIGHUP reload, HTTP 200 (no 502), new value applied; duplicate
unannotated HTTPRoute does not clobber the annotated Ingress route.
…ress/gateway

Follow-up improvements on the synapse-as-ingress/Gateway controller,
all unit-tested and verified live in k3s.

#1 Gateway API backendRef weight semantics (conformance fix). Was:
   weight 0 treated as equal-weight (0-weight backend still received
   traffic). Now: if any backendRef sets a weight, an unset weight
   defaults to 1 and weight 0 means "receive no traffic" (the backend
   is excluded); only when NO weight is set anywhere is the pool
   rendered unweighted. (k3s: a 0-weight backend is excluded; explicit
   weights and unset->1 are honored.)

#2 Path-match fidelity. rulePaths now honors HTTPRoute Path.Type:
   PathPrefix as-is, Exact used best-effort + warning (synapse v1
   matches longest-prefix), RegularExpression dropped + warning (no
   regex path support, and NOT defaulted to "/"), and header/method/
   queryParam match conditions warned (dropped — synapse v1 routes on
   host+path only). Ingress Exact pathType likewise warned. No more
   silent mis-routing.

#3 IngressClass resolution. isOurs now follows Kubernetes precedence:
   explicit spec.ingressClassName wins; otherwise the legacy
   kubernetes.io/ingress.class annotation; otherwise the default
   IngressClass (ingressclass.kubernetes.io/is-default-class=true with
   spec.controller == ours, resolved once per render). Foreign classes
   are still ignored.

#4 Ingress .status.loadBalancer publishing. Opt-in
   --publish-status-address (csv; IP vs hostname auto-classified),
   patched onto matched Ingresses only on change. Empty = no status
   (never a bogus address).

#7 Observability. Prometheus metrics on controller-runtime's Registry
   (render/changed/errors/reloads/conflicts/unsupported/backend-
   unresolved counters; hosts/routes/last-render/ready gauges).
   Kubernetes Events: Normal Programmed on matched Ingresses on a
   changed render; Warning RouteConflict / BackendUnresolved /
   UnsupportedMatch. Readiness gated on the first successful render via
   a manager Runnable primer (flips even on a zero-Ingress cluster, so
   the proxy isn't advertised ready before synapse has upstreams).
   Event recorder is nil-safe for --render-once.

#8 Service watch. The controller now also watches Services (and
   IngressClasses) so a named-port / Service change re-renders without
   waiting for an unrelated Ingress event.

#9 Reload robustness. --reload-process-name makes the SIGHUP target
   configurable (default "synapse"). A leading+trailing debouncer
   (--reload-debounce, default 500ms) collapses SIGHUP bursts (e.g.
   cert-manager solver churn) while GUARANTEEING the final state is
   applied (trailing edge always fires).

Tests: weight semantics (0/unset/none), rulePaths (exact/regex/
header), isOurs precedence (explicit/legacy/default/foreign), default
IngressClass render, publishStatus IP/hostname + idempotency,
ReadyCheck gating, reloadDebouncer leading/trailing + window<=0,
findReloadTargets named process. go build/vet/test ./... green
(race-clean, gofmt-clean). No new direct deps beyond promoting
prometheus/client_golang (already an indirect dep) to direct.

k3s e2e (host synapse + in-cluster operator sidecar): 33 use-case
assertions — weight-0 exclusion, Exact/regex/header warnings, legacy-
annotation & default-class capture, foreign-class ignored, Ingress
status address, /readyz + full metrics family, Programmed/RouteConflict
/BackendUnresolved events, named-port resolution, Service-only-change
re-render, debounced SIGHUP burst (6 changes -> coalesced, final value
live), public domain still serving 200 end-to-end.
…ica safe)

With >1 synapse-proxy replica each pod runs its own ingress sidecar.
Per-pod work — listing, rendering /shared/upstreams.yaml, and SIGHUP-
reloading the co-located synapse — MUST run on every replica and is
NEVER gated. But the SHARED cluster status (GatewayClass/Gateway/
HTTPRoute RouteParentStatus + Ingress .status.loadBalancer) was being
written by every replica, so N sidecars raced and churned the same
objects (optimistic-lock conflicts, wasted API writes, log noise).

Now an opt-in Lease-based election picks a single status writer:

  - IngressReconciler.IsLeader gates ONLY the four shared-status
    Status().Update sites (GatewayClass, Gateway, acceptRoute's
    HTTPRoute RouteParentStatus, publishStatus). The condition/route
    computation still runs everywhere (it feeds the per-pod render);
    only the API write is gated. nil gate ⇒ always leader, so
    --render-once, single-replica, and unit tests are byte-for-byte
    unchanged.
  - controllers/leader.go: a client-go LeaseLock elector run as a
    manager Runnable with NeedLeaderElection()=false, so it runs on
    EVERY replica (independent of the manager's own --leader-elect,
    which would otherwise stop non-leaders from rendering). It keeps
    a LeaderGate atomic flag in sync and re-contends after losing the
    lease (former leader can reacquire).
  - main.go: --status-leader-election (default OFF — single-replica
    behavior unchanged), --status-leader-election-id (lease name),
    --leader-election-namespace (defaults to $POD_NAMESPACE then
    "default"); identity from $POD_NAME then hostname.
  - RBAC: coordination.k8s.io/leases (kubebuilder marker).

Tests: LeaderGate flag/IsLeader; reconciler leader() (nil⇒leader,
false⇒not); a non-leader still programs routes into the model but
writes NO HTTPRoute/Gateway/Ingress status. go build/vet/test ./...
green (race-clean, gofmt-clean). No dep changes (client-go
leaderelection is already vendored transitively).

k3s e2e at replicas=2 (8/8): exactly one Lease holder among live
pods; BOTH replicas render their own upstreams locally (render not
gated); HTTPRoute/Gateway/Ingress shared status written by the leader;
killing the lease holder fails leadership over to the survivor, which
continues writing status; public domain still serving 200 throughout;
replicas restored to 1.
…per-SNI

The proxy previously served ONE statically-mounted cert. The ingress
sidecar now projects every referenced Kubernetes TLS Secret into
synapse's certificates dir so synapse serves the right cert per SNI
hostname, supporting many domains/certs.

  - Ingress spec.tls[] (Hosts[]→stem; empty Hosts ⇒ that Ingress's
    rule hosts) and Gateway listener tls.certificateRefs[] (Terminate;
    stem = listener hostname, else Secret-derived) are collected into
    the render model.
  - certs.go projects each kubernetes.io/tls Secret as
    <stem>.crt/<stem>.key into the operator-owned CertsOutDir
    (--certs-out). A concrete host uses the hostname as the file-stem
    so synapse's name_map exact match wins (the a33f3fc <host>.crt
    precedence); wildcard/no-host certs get a Secret-derived stem and
    rely on synapse's wildcard SAN match. The per-host `certificate:`
    is also emitted in upstreams.yaml (synapse SNI precedence #1,
    upstreams_cert_map), as defense-in-depth.
  - The dir is OPERATOR-OWNED: certs no longer backed by a Secret are
    pruned, so removing an Ingress/Secret converges (SNI then falls
    back to the configured default cert). Writes are in place (NOT
    tmp+rename) and only on a real content change: synapse's cert
    watcher (synapse-utils tools.rs `watch_folder`) only re-scans on
    inotify Create/Modify(Data)/Remove and ignores rename/MOVED_TO —
    in-place write is the pattern that watcher is designed around (it
    sleeps 500ms post-event so both files settle, pairs only files
    that parse, and keeps existing SslContexts via ArcSwap, so a torn
    cert is just retried — no 502, unlike the upstreams torn-read
    case). Projection is per-pod (never leader-gated, like the
    upstreams render).
  - New Secret watch (filtered to kubernetes.io/tls) so cert
    rotation re-projects without an unrelated Ingress event. Metrics
    synapse_operator_certs / synapse_operator_cert_errors_total; a
    missing / non-TLS Secret is skipped (logged + metric), never
    failing the render. RBAC: core/secrets get;list;watch.

Tests: certStem (host/wildcard/none/sanitize), addCert first-writer-
wins, per-host certificate: emission, writeFileIfChanged idempotent,
pruneCerts (stale .crt/.key/.tmp only), projectCerts (project /
idempotent / rotate / prune / missing / non-TLS / disabled /
not-leader-gated), render() Ingress spec.tls collection, renderGateways
listener certRef collection. go build/vet/test ./... green (race-
clean, gofmt-clean). No dep changes.

k3s e2e (operator-owned /certs, static mount removed): real cert-
manager g0s-test-tls projected as <domain>.crt/.key; public domain
serves 200 over the projected cert; pruning removes unreferenced
certs and SNI falls back to default; cert metrics exposed; per-SNI
cert selection serves each domain's own cert for every cert present
at synapse start (verified: after a pod rollout a.tls.test SNI →
its own serial, default unchanged).

KNOWN LIMITATION (synapse-side, fix tracked separately, NOT this
repo): with acme.enabled=false synapse binds the TLS listener to a
STARTUP snapshot (start.rs:400-421) and never consults the inotify-
rebuilt certificates_arc ArcSwap (start.rs:430-441) at handshake, so
a cert ADDED/ROTATED after synapse booted is not served until a pod
restart (the render-once initContainer reprojects on boot, so a
rollout applies cert-set changes). True no-restart hot reload needs a
synapse change (make TlsAccept resolve against the live ArcSwap;
optionally also honor Modify(Name) in watch_folder). The operator
side here is complete and correct.
@pigri pigri merged commit 595aaa4 into main May 18, 2026
4 checks passed
@pigri pigri deleted the feat/operator-ingress-gateway branch May 18, 2026 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant