Skip to content

feat(tls/acme): cert-manager coexistence, k8s hot-reload, WAF ends_with, version unify#360

Merged
pigri merged 18 commits into
mainfrom
feat/tls-certbot-and-default-cert
Jun 3, 2026
Merged

feat(tls/acme): cert-manager coexistence, k8s hot-reload, WAF ends_with, version unify#360
pigri merged 18 commits into
mainfrom
feat/tls-certbot-and-default-cert

Conversation

@pigri
Copy link
Copy Markdown
Contributor

@pigri pigri commented Jun 3, 2026

Interrelated TLS / ACME / WAF / packaging fixes. Kept as one PR because the commits share Cargo.lock, workspace-wide Cargo.toml (version unification), and feature-flag changes — splitting would just create churn/conflicts. Commits are individually scoped; grouped by theme below.

1. TLS cert loading

  • certbot-layout cert loading + explicit default cert/key path (be0ce01): listdir now discovers live/<domain>/{fullchain,privkey}.pem (not just flat <name>.crt), skips READMEs/subdirs gracefully (no panics); adds proxy.default_cert_path/default_key_path (SNI fallback that can live outside the scanned dir). Also enables pingora-core/openssl on synapse-utils' proxy feature so the crate builds standalone.

2. External-ACME / cert-manager coexistence (validated in-cluster, LE staging HTTP-01)

  • Don't intercept /.well-known/acme-challenge/* when synapse isn't the ACME client (affc7c7): gate on acme.enabled, add independent internal_services.serve_acme_challenge, and fall through to upstream on a token synapse doesn't hold.
  • Configurable embedded ACME bind_ip (b00f9a0) — default 127.0.0.1; set 0.0.0.0 for cross-pod reach.
  • Don't run the certificate worker when embedded ACME is disabled (5389308) — kills the recurring "No domains found" poll/log.
  • Skip the WAF for the ACME HTTP-01 challenge path (24b4968) — strict base64url token match so it can't be a WAF bypass.
  • Make file-watch hot-reload work on Kubernetes (60d8c05) — react to the ConfigMap ..data atomic-swap marker; also filter the cert watcher to .crt/.key/.pem.

3. WAF ends_with

  • Register ends_with via the upstream wirefilter-engine 0.7.2 (which now re-exports EndsWithFunction); the interim local shim (b79dc87) is removed in 7d24a0c. Fixes WP-03/WP-13/OWASP-A05 et al. that silently failed to load.
  • Real production-rule tests (f6bca69, 996ca98): assert the real rules compile and that OWASP-A05 blocks /server-status/*.bak//.git/*; WP-13's threat.score gate is exercised.

4. Log hygiene (per-request/per-lookup chatter → debug)

  • a9f0e2a (XDP-skels/access-log-perf/HTTP-fingerprints/heartbeats), 68f091a ("No domains" → debug), e858de7 (JA4 BPF map hit/miss → debug), b5637ca (skip BPF ClientHello lookup for port-0 events).

5. Release / versioning

  • Unify crate versions via workspace inheritance (b21860f): single [workspace.package].version; fixes synapse --version reporting 0.6.1 instead of 0.7.0.

Dependency

Requires wirefilter-engine 0.7.2 (published to the gen0sec registry; master CI green).

Test plan

  • cargo build --releasesynapse --version reports 0.7.0
  • synapse-waf real-rule tests pass (compile + OWASP-A05 block + WP-13 gate)
  • listdir unit tests (flat + certbot layouts) pass
  • hot-reload verified in k8s: ConfigMap patch reloads with no pod restart
  • cert-manager HTTP-01 issued an LE-staging cert end-to-end through synapse (edge + Tier-2)
  • CI green on this branch

pigri added 18 commits June 2, 2026 23:04
Three related cert-loading improvements:

1. certbot/Let's Encrypt layout support. tools::listdir now also discovers
   per-domain subdirs holding fullchain.pem + privkey.pem (e.g.
   /etc/letsencrypt/live/<domain>/), in addition to the flat <name>.crt /
   <name>.key layout. READMEs, subdirs without the pair, and .crt-without-.key
   are skipped, not errors (the old code panicked on read errors and silently
   loaded zero certs when pointed at a dir of subdirs, falling back to the
   default cert for every site). A new cert_logical_name() in tls.rs maps
   certbot's generic fullchain.pem to its parent-dir domain so certs aren't all
   keyed as the stem "fullchain" (which collided in cert_name_map and broke
   dedicated-cert precedence). Added listdir unit tests.

2. Fix the standalone build of synapse-utils. The proxy feature now enables
   pingora-core/openssl (the OpenSSL backend synapse-app already uses).
   Without it, building synapse-utils alone selected pingora's noop_tls stub,
   so tls.rs failed to compile (missing the `ssl` module + TlsSettings cipher
   methods) and the crate's tests couldn't run. Pre-existing; surfaced while
   validating (1).

3. New option: proxy.default_cert_path + proxy.default_key_path
   (env PROXY_DEFAULT_CERT_PATH / PROXY_DEFAULT_KEY_PATH). An explicit path to
   the SNI-fallback cert/key, loaded alongside the scanned certs so it may live
   outside the scanned directory (e.g. /etc/synapse/certs/default.* while certs
   are scanned from /etc/letsencrypt/live). Default selection priority is
   explicit path > name (default_certificate) > first valid cert; both paths
   must be set together. Wired through ProxyConfig, AppConfig, the synapse-app
   CLI layer, and both the initial load and the inotify-reload watcher.
…CME client

synapse registered /.well-known/acme-challenge/* as an internal path
unconditionally and 404'd unknown tokens, so an external ACME client
(certbot webroot) was broken even with proxy.acme.enabled=false — two
ACME clients fought over one path and synapse always won.

- Gate the route on embedded ACME: the built-in challenge internal_path is
  registered only when proxy.acme.enabled && internal_services
  .serve_acme_challenge. When off, the path routes to the upstream like any
  other (parceyaml::set_serve_acme_challenge, wired in synapse-app startup).
- New independent toggle internal_services.serve_acme_challenge (default
  true; env INTERNAL_SERVICES_SERVE_ACME_CHALLENGE) so captcha/health can
  stay enabled while freeing the challenge path for certbot.
- Fall through instead of 404: even when the route is registered, get_host
  keeps the internal match only if synapse holds the token file
  (<storage>/well-known/acme-challenge/<token>, sanitized); otherwise it
  forwards to the upstream — so embedded ACME and certbot can coexist.

Implements fixes #1, #2, #4 from the coexistence analysis.
The embedded ACME HTTP-01 server bind was hardcoded to 127.0.0.1, so in
k8s the validation request couldn't reach it cross-pod (loopback is
pod-local). Add proxy.acme.bind_ip (default 127.0.0.1, env ACME_BIND_IP)
and use it instead of the hardcoded literal; set 0.0.0.0 to expose the
HTTP-01 server beyond the pod. Default preserves existing behavior.
…abled

The CertificateWorker was registered whenever Redis was initialized (e.g.
Redis is configured for captcha) regardless of acme.enabled. With
acme.enabled=false the ACME cert store is empty, so the worker polled
every refresh interval and logged "No domains found in upstreams.yaml,
skipping certificate fetch" at WARN every ~30s.

Gate registration on config.proxy.acme.enabled (in addition to the
existing Redis + non-agent checks). The worker's only job is syncing
ACME-issued certs from Redis; externally-managed certs (cert-manager's
mounted Secret) load via the cert-dir watcher, not this worker, so
nothing else is affected.
The app log was dominated by lines emitted on every request or lookup at
INFO. Downgrade them so production INFO logs carry signal, not telemetry:

- tcp_fingerprint: "no XDP skels loaded" fired on every ClientHello
  lookup. The condition (passthrough/no-BPF) is fixed for the process
  lifetime, so log it once at debug instead of per-lookup.
- access_log_perf (access_log_emit + access_log_pipeline): per-request
  profiling JSON -> debug (target kept, so RUST_LOG=access_log_perf=debug
  re-enables it).
- proxyhttp "HTTP fingerprints ...": per-request echo -> debug (the
  fingerprints are already in the structured access log / events).
- afpacket + ssl_uprobe 30s heartbeats -> debug.

All reversible via log level; no behavior change.
Align certificate.rs:471 with the expiration path (1393), which already
logs the same empty-domain condition at debug. Keeps the line quiet even
when ACME is enabled but an issuer has no domains yet (e.g. fresh boot),
not just when the worker is gated off.
…ert watcher

The upstreams file-watcher only reloaded when the event path ended in
"yaml". Kubernetes updates a mounted ConfigMap/Secret with an atomic
symlink swap (writes a ..<timestamp> dir, repoints the ..data symlink);
the mounted *.yaml is itself a symlink and never fires, so the filter
dropped every in-cluster change and upstreams hot-reload silently never
happened (had to restart the pod). Confirmed live: a ConfigMap patch on
the Tier-2 proxy produced zero reloads.

The cert watcher had the opposite problem: no path filter at all, so any
unrelated file in the cert dir triggered a reload.

Add a shared reload_relevant_path(path, exts) helper and apply it to both:
- matches names ending in one of exts (bare-metal: file edited in place),
- AND any name starting with ".." (the k8s atomic-update marker: ..data /
  ..<timestamp>), which is what makes hot-reload fire in-cluster.
Cert watcher now filters to .crt/.key/.pem; config watcher uses .yaml/.yml
(both still catch ..data). Unit-tested incl. the bare-letter false
positive (".yaml" not "yaml").
The WAF inspected /.well-known/acme-challenge/* and could block the
Let's Encrypt challenge (false positives breaking cert issuance/renewal).
Only /cgi-bin/captcha/verify was skipped before.

Add a path-based skip in both the request and response WAF phases so it
covers embedded ACME (served internally) and external clients
(certbot/cert-manager, forwarded to the upstream/solver).

The skip is strict to avoid becoming a WAF bypass: it matches only
/.well-known/acme-challenge/<token> where <token> is a single pure
base64url segment ([A-Za-z0-9_-]+, RFC 8555). Any '/', '.', '%' or other
char (traversal / encoded-slash attempts) fails the check and is
inspected by the WAF as normal. Unit-tested incl. those bypass attempts.
All crates share one product version; they were not independently
versioned, but only the root [package].version was being bumped on
release, so the sub-crates drifted (synapse-app etc. stuck at 0.6.1,
some infra crates at 0.1.0). Since clap's `version` comes from
synapse-app's CARGO_PKG_VERSION, `synapse --version` reported 0.6.1
instead of the released 0.7.0.

Move the version to [workspace.package].version and make every crate
inherit it with `version.workspace = true`. Now there is a single field
to bump (cargo-release updates it; release.toml doc updated), every crate
stays in lockstep, and `synapse --version` reports 0.7.0. Safe: no
internal path-dep pins a version. Verified: cargo metadata shows all
crates at 0.7.0 and the rebuilt binary reports `synapse 0.7.0`.
lookup_and_store_client_hello logged at INFO on every ClientHello lookup:
BPF map hit/miss ("BPF map miss for <ip> (key not found)"), invalid
payload_len, captured payload (with byte dump), and stored fingerprint.
A map miss is a normal/expected condition (passthrough, no captured
handshake), so these fired per request and flooded the log. Move all five
to debug; the genuine map-lookup error stays at warn.
The kernel pump (kernel_pump.rs) and SSL uprobe call
lookup_and_store_client_hello for every drained packet/event, including
portless ones (ICMP, or events with src_port=0). A ClientHello is always
keyed by a real TCP source port, so a port-0 key can never match — those
lookups did pointless BPF work and logged a "BPF map miss for <ip>:0"
every time. Guard with an early `client_port == 0` return so port-0
events are skipped entirely (covers all callers). Real-port misses are
unchanged (normal cache/probe misses, already at debug).
Shipped WAF rules (WP-03, WP-13, OWASP-A05, and others) use ends_with(),
but it was unregistered — the gen0sec wirefilter-engine 0.7.1 fork ships
functions/ends_with.rs yet never re-exports EndsWithFunction, so it was
commented out. Those rules failed to parse ("Filter parsing error" at the
first ends_with token) and silently never loaded, leaving real protection
(wp-config exposure, plugin/theme enumeration, exposed endpoints) off.

Add a local EndsWithFunction shim in synapse-waf mirroring the upstream
impl using only wirefilter's public API, and register it. Verified: a
regression test compiles ends_with expressions against the scheme, and
those rules now load. The proper long-term fix is to re-export the
function from the fork (then this shim can be dropped).
wirefilter-engine 0.7.2 now re-exports EndsWithFunction (fixed in the
gen0sec fork). Bump the dep and register wirefilter::EndsWithFunction
directly, removing the local shim added as the interim workaround. The
ends_with regression test is kept and passes against the upstream
function, so WP-03 / WP-13 / OWASP-A05 and other ends_with rules load.
Add unit tests using the exact platform rules that previously failed to
load (they use ends_with). real_world_waf_rules_all_compile asserts all
three compile (request_rule_count == 3) — a direct regression guard for
the function-registration gap that silently dropped them.
real_owasp_a05_blocks_exposed_endpoints asserts OWASP-A05 blocks
/server-status (contains), *.bak (ends_with) and /.git/* (starts_with)
while allowing /. The toy fixture rules didn't exercise real expressions.
Set the WP-13 const to the production expression (threat.score gt 1) and
add real_wp_13_plugin_enum_is_threat_gated: a /wp-content/plugins/.../
readme.txt request matches the path + ends_with clauses but is NOT
blocked at the default threat score (0 not gt 1), exercising the
threat-score gate. (A true block needs score > 1, sourced from threat
intel at eval time — no unit-test seam; covered structurally here.)
… tests

Two clippy -D warnings failures: Certificates::new/new_with_sni_callback
took &Vec<CertificateConfig> (clippy::ptr_arg) — switch to
&[CertificateConfig] (+ configs.to_vec()); and the listdir_tests module
used .unwrap() (repo bans it via disallowed_methods) — add
#[allow(clippy::disallowed_methods)] per the existing test-module
convention. Validated locally: clippy (default + classifier) -D warnings,
cargo fmt --check, and synapse-utils/synapse-waf tests all pass.
The workspace-version-inheritance commit added a Cargo.toml comment
("Single source of truth for the product version…"); the builders'
`grep -m1 'version' Cargo.toml | cut -d' ' -f3` then matched that comment
and extracted "source", so dch got `--newversion "source-1"` →
"source-1 is not a valid version" and every build-install job failed.

Anchor the grep to a `version = "..."` line and sed out the quoted value,
ignoring comments and the `version.workspace = true` inheritance lines.
Verified: extraction now yields 0.7.0.
@pigri pigri merged commit 4efaba0 into main Jun 3, 2026
25 checks passed
@pigri pigri deleted the feat/tls-certbot-and-default-cert branch June 3, 2026 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant