feat(tls/acme): cert-manager coexistence, k8s hot-reload, WAF ends_with, version unify#360
Merged
Conversation
Three related cert-loading improvements: 1. certbot/Let's Encrypt layout support. tools::listdir now also discovers per-domain subdirs holding fullchain.pem + privkey.pem (e.g. /etc/letsencrypt/live/<domain>/), in addition to the flat <name>.crt / <name>.key layout. READMEs, subdirs without the pair, and .crt-without-.key are skipped, not errors (the old code panicked on read errors and silently loaded zero certs when pointed at a dir of subdirs, falling back to the default cert for every site). A new cert_logical_name() in tls.rs maps certbot's generic fullchain.pem to its parent-dir domain so certs aren't all keyed as the stem "fullchain" (which collided in cert_name_map and broke dedicated-cert precedence). Added listdir unit tests. 2. Fix the standalone build of synapse-utils. The proxy feature now enables pingora-core/openssl (the OpenSSL backend synapse-app already uses). Without it, building synapse-utils alone selected pingora's noop_tls stub, so tls.rs failed to compile (missing the `ssl` module + TlsSettings cipher methods) and the crate's tests couldn't run. Pre-existing; surfaced while validating (1). 3. New option: proxy.default_cert_path + proxy.default_key_path (env PROXY_DEFAULT_CERT_PATH / PROXY_DEFAULT_KEY_PATH). An explicit path to the SNI-fallback cert/key, loaded alongside the scanned certs so it may live outside the scanned directory (e.g. /etc/synapse/certs/default.* while certs are scanned from /etc/letsencrypt/live). Default selection priority is explicit path > name (default_certificate) > first valid cert; both paths must be set together. Wired through ProxyConfig, AppConfig, the synapse-app CLI layer, and both the initial load and the inotify-reload watcher.
…CME client synapse registered /.well-known/acme-challenge/* as an internal path unconditionally and 404'd unknown tokens, so an external ACME client (certbot webroot) was broken even with proxy.acme.enabled=false — two ACME clients fought over one path and synapse always won. - Gate the route on embedded ACME: the built-in challenge internal_path is registered only when proxy.acme.enabled && internal_services .serve_acme_challenge. When off, the path routes to the upstream like any other (parceyaml::set_serve_acme_challenge, wired in synapse-app startup). - New independent toggle internal_services.serve_acme_challenge (default true; env INTERNAL_SERVICES_SERVE_ACME_CHALLENGE) so captcha/health can stay enabled while freeing the challenge path for certbot. - Fall through instead of 404: even when the route is registered, get_host keeps the internal match only if synapse holds the token file (<storage>/well-known/acme-challenge/<token>, sanitized); otherwise it forwards to the upstream — so embedded ACME and certbot can coexist. Implements fixes #1, #2, #4 from the coexistence analysis.
The embedded ACME HTTP-01 server bind was hardcoded to 127.0.0.1, so in k8s the validation request couldn't reach it cross-pod (loopback is pod-local). Add proxy.acme.bind_ip (default 127.0.0.1, env ACME_BIND_IP) and use it instead of the hardcoded literal; set 0.0.0.0 to expose the HTTP-01 server beyond the pod. Default preserves existing behavior.
…abled The CertificateWorker was registered whenever Redis was initialized (e.g. Redis is configured for captcha) regardless of acme.enabled. With acme.enabled=false the ACME cert store is empty, so the worker polled every refresh interval and logged "No domains found in upstreams.yaml, skipping certificate fetch" at WARN every ~30s. Gate registration on config.proxy.acme.enabled (in addition to the existing Redis + non-agent checks). The worker's only job is syncing ACME-issued certs from Redis; externally-managed certs (cert-manager's mounted Secret) load via the cert-dir watcher, not this worker, so nothing else is affected.
The app log was dominated by lines emitted on every request or lookup at INFO. Downgrade them so production INFO logs carry signal, not telemetry: - tcp_fingerprint: "no XDP skels loaded" fired on every ClientHello lookup. The condition (passthrough/no-BPF) is fixed for the process lifetime, so log it once at debug instead of per-lookup. - access_log_perf (access_log_emit + access_log_pipeline): per-request profiling JSON -> debug (target kept, so RUST_LOG=access_log_perf=debug re-enables it). - proxyhttp "HTTP fingerprints ...": per-request echo -> debug (the fingerprints are already in the structured access log / events). - afpacket + ssl_uprobe 30s heartbeats -> debug. All reversible via log level; no behavior change.
Align certificate.rs:471 with the expiration path (1393), which already logs the same empty-domain condition at debug. Keeps the line quiet even when ACME is enabled but an issuer has no domains yet (e.g. fresh boot), not just when the worker is gated off.
…ert watcher
The upstreams file-watcher only reloaded when the event path ended in
"yaml". Kubernetes updates a mounted ConfigMap/Secret with an atomic
symlink swap (writes a ..<timestamp> dir, repoints the ..data symlink);
the mounted *.yaml is itself a symlink and never fires, so the filter
dropped every in-cluster change and upstreams hot-reload silently never
happened (had to restart the pod). Confirmed live: a ConfigMap patch on
the Tier-2 proxy produced zero reloads.
The cert watcher had the opposite problem: no path filter at all, so any
unrelated file in the cert dir triggered a reload.
Add a shared reload_relevant_path(path, exts) helper and apply it to both:
- matches names ending in one of exts (bare-metal: file edited in place),
- AND any name starting with ".." (the k8s atomic-update marker: ..data /
..<timestamp>), which is what makes hot-reload fire in-cluster.
Cert watcher now filters to .crt/.key/.pem; config watcher uses .yaml/.yml
(both still catch ..data). Unit-tested incl. the bare-letter false
positive (".yaml" not "yaml").
The WAF inspected /.well-known/acme-challenge/* and could block the Let's Encrypt challenge (false positives breaking cert issuance/renewal). Only /cgi-bin/captcha/verify was skipped before. Add a path-based skip in both the request and response WAF phases so it covers embedded ACME (served internally) and external clients (certbot/cert-manager, forwarded to the upstream/solver). The skip is strict to avoid becoming a WAF bypass: it matches only /.well-known/acme-challenge/<token> where <token> is a single pure base64url segment ([A-Za-z0-9_-]+, RFC 8555). Any '/', '.', '%' or other char (traversal / encoded-slash attempts) fails the check and is inspected by the WAF as normal. Unit-tested incl. those bypass attempts.
All crates share one product version; they were not independently versioned, but only the root [package].version was being bumped on release, so the sub-crates drifted (synapse-app etc. stuck at 0.6.1, some infra crates at 0.1.0). Since clap's `version` comes from synapse-app's CARGO_PKG_VERSION, `synapse --version` reported 0.6.1 instead of the released 0.7.0. Move the version to [workspace.package].version and make every crate inherit it with `version.workspace = true`. Now there is a single field to bump (cargo-release updates it; release.toml doc updated), every crate stays in lockstep, and `synapse --version` reports 0.7.0. Safe: no internal path-dep pins a version. Verified: cargo metadata shows all crates at 0.7.0 and the rebuilt binary reports `synapse 0.7.0`.
lookup_and_store_client_hello logged at INFO on every ClientHello lookup:
BPF map hit/miss ("BPF map miss for <ip> (key not found)"), invalid
payload_len, captured payload (with byte dump), and stored fingerprint.
A map miss is a normal/expected condition (passthrough, no captured
handshake), so these fired per request and flooded the log. Move all five
to debug; the genuine map-lookup error stays at warn.
The kernel pump (kernel_pump.rs) and SSL uprobe call lookup_and_store_client_hello for every drained packet/event, including portless ones (ICMP, or events with src_port=0). A ClientHello is always keyed by a real TCP source port, so a port-0 key can never match — those lookups did pointless BPF work and logged a "BPF map miss for <ip>:0" every time. Guard with an early `client_port == 0` return so port-0 events are skipped entirely (covers all callers). Real-port misses are unchanged (normal cache/probe misses, already at debug).
Shipped WAF rules (WP-03, WP-13, OWASP-A05, and others) use ends_with(),
but it was unregistered — the gen0sec wirefilter-engine 0.7.1 fork ships
functions/ends_with.rs yet never re-exports EndsWithFunction, so it was
commented out. Those rules failed to parse ("Filter parsing error" at the
first ends_with token) and silently never loaded, leaving real protection
(wp-config exposure, plugin/theme enumeration, exposed endpoints) off.
Add a local EndsWithFunction shim in synapse-waf mirroring the upstream
impl using only wirefilter's public API, and register it. Verified: a
regression test compiles ends_with expressions against the scheme, and
those rules now load. The proper long-term fix is to re-export the
function from the fork (then this shim can be dropped).
wirefilter-engine 0.7.2 now re-exports EndsWithFunction (fixed in the gen0sec fork). Bump the dep and register wirefilter::EndsWithFunction directly, removing the local shim added as the interim workaround. The ends_with regression test is kept and passes against the upstream function, so WP-03 / WP-13 / OWASP-A05 and other ends_with rules load.
Add unit tests using the exact platform rules that previously failed to load (they use ends_with). real_world_waf_rules_all_compile asserts all three compile (request_rule_count == 3) — a direct regression guard for the function-registration gap that silently dropped them. real_owasp_a05_blocks_exposed_endpoints asserts OWASP-A05 blocks /server-status (contains), *.bak (ends_with) and /.git/* (starts_with) while allowing /. The toy fixture rules didn't exercise real expressions.
Set the WP-13 const to the production expression (threat.score gt 1) and add real_wp_13_plugin_enum_is_threat_gated: a /wp-content/plugins/.../ readme.txt request matches the path + ends_with clauses but is NOT blocked at the default threat score (0 not gt 1), exercising the threat-score gate. (A true block needs score > 1, sourced from threat intel at eval time — no unit-test seam; covered structurally here.)
… tests Two clippy -D warnings failures: Certificates::new/new_with_sni_callback took &Vec<CertificateConfig> (clippy::ptr_arg) — switch to &[CertificateConfig] (+ configs.to_vec()); and the listdir_tests module used .unwrap() (repo bans it via disallowed_methods) — add #[allow(clippy::disallowed_methods)] per the existing test-module convention. Validated locally: clippy (default + classifier) -D warnings, cargo fmt --check, and synapse-utils/synapse-waf tests all pass.
The workspace-version-inheritance commit added a Cargo.toml comment
("Single source of truth for the product version…"); the builders'
`grep -m1 'version' Cargo.toml | cut -d' ' -f3` then matched that comment
and extracted "source", so dch got `--newversion "source-1"` →
"source-1 is not a valid version" and every build-install job failed.
Anchor the grep to a `version = "..."` line and sed out the quoted value,
ignoring comments and the `version.workspace = true` inheritance lines.
Verified: extraction now yields 0.7.0.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Interrelated TLS / ACME / WAF / packaging fixes. Kept as one PR because the commits share
Cargo.lock, workspace-wideCargo.toml(version unification), and feature-flag changes — splitting would just create churn/conflicts. Commits are individually scoped; grouped by theme below.1. TLS cert loading
be0ce01):listdirnow discoverslive/<domain>/{fullchain,privkey}.pem(not just flat<name>.crt), skips READMEs/subdirs gracefully (no panics); addsproxy.default_cert_path/default_key_path(SNI fallback that can live outside the scanned dir). Also enablespingora-core/opensslon synapse-utils'proxyfeature so the crate builds standalone.2. External-ACME / cert-manager coexistence (validated in-cluster, LE staging HTTP-01)
/.well-known/acme-challenge/*when synapse isn't the ACME client (affc7c7): gate onacme.enabled, add independentinternal_services.serve_acme_challenge, and fall through to upstream on a token synapse doesn't hold.bind_ip(b00f9a0) — default127.0.0.1; set0.0.0.0for cross-pod reach.5389308) — kills the recurring "No domains found" poll/log.24b4968) — strict base64url token match so it can't be a WAF bypass.60d8c05) — react to the ConfigMap..dataatomic-swap marker; also filter the cert watcher to.crt/.key/.pem.3. WAF
ends_withends_withvia the upstreamwirefilter-engine 0.7.2(which now re-exportsEndsWithFunction); the interim local shim (b79dc87) is removed in7d24a0c. Fixes WP-03/WP-13/OWASP-A05 et al. that silently failed to load.f6bca69,996ca98): assert the real rules compile and that OWASP-A05 blocks/server-status/*.bak//.git/*; WP-13'sthreat.scoregate is exercised.4. Log hygiene (per-request/per-lookup chatter → debug)
a9f0e2a(XDP-skels/access-log-perf/HTTP-fingerprints/heartbeats),68f091a("No domains" → debug),e858de7(JA4 BPF map hit/miss → debug),b5637ca(skip BPF ClientHello lookup for port-0 events).5. Release / versioning
b21860f): single[workspace.package].version; fixessynapse --versionreporting0.6.1instead of0.7.0.Dependency
Requires
wirefilter-engine 0.7.2(published to the gen0sec registry; master CI green).Test plan
cargo build --release→synapse --versionreports0.7.0listdirunit tests (flat + certbot layouts) pass