Add TLS support to Prometheus metrics server#3322
Add TLS support to Prometheus metrics server#3322jkhelil wants to merge 1 commit intoknative:mainfrom
Conversation
|
@jkhelil: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hi @jkhelil. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3322 +/- ##
==========================================
+ Coverage 74.84% 74.85% +0.01%
==========================================
Files 189 189
Lines 8287 8347 +60
==========================================
+ Hits 6202 6248 +46
- Misses 1844 1854 +10
- Partials 241 245 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
5492d68 to
545d035
Compare
dprotaso
left a comment
There was a problem hiding this comment.
You'll need to gofumpt the files you are touching (https://github.com/mvdan/gofumpt) to satisfy the linter
| if s.http.TLSConfig != nil || (s.certFile != "" && s.keyFile != "") { | ||
| return s.http.ListenAndServeTLS(s.certFile, s.keyFile) | ||
| } |
There was a problem hiding this comment.
Is there a case when a TLS config would not be used with ListenAndServeTLS ?
| if tlsConfig := prometheus.TLSConfigFromContext(ctx); tlsConfig != nil { | ||
| opts = append(opts, prometheus.WithTLSConfig(tlsConfig)) | ||
| } | ||
|
|
There was a problem hiding this comment.
Drop context check here.
If anything we should add options to Config if env vars overrides aren't enough - but it seems like they should be since we namespaced them
|
/ok-to-test |
| http: &http.Server{ | ||
| Addr: addr, | ||
| Handler: mux, | ||
| // https://medium.com/a-journey-with-go/go-understand-and-mitigate-slowloris-attack-711c1b1403f6 |
There was a problem hiding this comment.
I would probably leave this comment here.
| opts = append(opts, prometheus.WithTLSConfig(tlsConfig)) | ||
| } | ||
|
|
||
| server, err := prometheus.NewServer(opts...) |
There was a problem hiding this comment.
Existing error: missing error check. If NewServer fails, server is nil but the go routine below executes anyway and panics.
| func (s *Server) ListenAndServe() { | ||
| s.http.ListenAndServe() | ||
| func (s *Server) ListenAndServe() error { | ||
| if s.http.TLSConfig != nil || (s.certFile != "" && s.keyFile != "") { |
There was a problem hiding this comment.
This condition will silently fallback to s.http.ListenAndServe() if certFile or keyFile are not set. Don't think that is intended, I'd rather return an error that both must be set or none.
You could validate in the validate function where other options are validated.
There was a problem hiding this comment.
nit
This one is unused. Can be removed.
|
|
||
| go func() { | ||
| server.ListenAndServe() | ||
| if err := server.ListenAndServe(); err != nil { |
There was a problem hiding this comment.
Filter for http.ErrServerClosed which is returned on graceful shutdown which is not an error. See injection/sharedmain/main.go for example.
| if err := server.ListenAndServe(); err != nil { | |
| if err := server.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) { |
|
|
||
| func (s *Server) ListenAndServe() { | ||
| s.http.ListenAndServe() | ||
| func (s *Server) ListenAndServe() error { |
There was a problem hiding this comment.
ListenAndServe is a misleading name it actually also serves TLS traffic. Lets add a docs comment to the method signature stating that it can also serve TLS if TLS config or certificates are provided.
Random thoughts: I would actually prefer to separate ListenAndServe and ListenAndServeTLS like it is done by net/http own API design. The caller should make the choice which method to use depending on the options it's having. But that would have bigger implications on the refactoring => we would need to pull out the env var checks to the caller and I don't think it's worth it here.
There was a problem hiding this comment.
I don't think it's worth it here.
Yeah that's the big issue - it pushes the ownership on the caller to pick TLS. The original intent was to have ListenAndServe mimic the http.Server
What we could do is have ListenAndServe - ListenAndServeTLS and a new method that auto picks TLS if the settings are present. eg. ListenAndServeAutoTLS ? Though unsure about the name is best fits
There was a problem hiding this comment.
What we could do is have ListenAndServe - ListenAndServeTLS and a new method that auto picks TLS if the settings are present.
Yes, that would be quite close to what I was talking about but it means expanding the public API by 2 public methods. I wouldn't mind, it's a good idea 😸
545d035 to
bd1765e
Compare
|
Added a comment - #3322 (comment) |
6906ce7 to
308dd8a
Compare
twoGiants
left a comment
There was a problem hiding this comment.
Thank for the rework! Looks great 👍
I left a few more comments.
| @@ -71,22 +98,44 @@ func NewServer(opts ...ServerOption) (*Server, error) { | |||
| Addr: addr, | |||
| Handler: mux, | |||
| // https://medium.com/a-journey-with-go/go-understand-and-mitigate-slowloris-attack-711c1b1403f6 | |||
There was a problem hiding this comment.
This comment is for line 102.
| ) | ||
|
|
||
| func buildPrometheus(_ context.Context, cfg Config) (sdkmetric.Reader, shutdownFunc, error) { | ||
| func buildPrometheus(ctx context.Context, cfg Config) (sdkmetric.Reader, shutdownFunc, error) { |
There was a problem hiding this comment.
nit: Can stay blank if it's not used or use it below in line 66 and get the logger from ctx and log an error.
| func buildPrometheus(ctx context.Context, cfg Config) (sdkmetric.Reader, shutdownFunc, error) { | |
| func buildPrometheus(_ context.Context, cfg Config) (sdkmetric.Reader, shutdownFunc, error) { |
| prometheusTLSCertEnvName = "METRICS_PROMETHEUS_TLS_CERT" | ||
| prometheusTLSKeyEnvName = "METRICS_PROMETHEUS_TLS_KEY" | ||
| prometheusTLSClientAuthEnvName = "METRICS_PROMETHEUS_TLS_CLIENT_AUTH" | ||
| prometheusTLSClientCAFileEnv = "METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE" |
There was a problem hiding this comment.
| prometheusTLSClientCAFileEnv = "METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE" | |
| prometheusTLSClientCAFileEnvName = "METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE" |
| return fmt.Errorf("%s must be set when %s is %q (client certs cannot be validated without a CA)", | ||
| prometheusTLSClientCAFileEnv, prometheusTLSClientAuthEnvName, auth) | ||
| } | ||
|
|
There was a problem hiding this comment.
validate() allows: TLS cert+key set, clientAuth="none", clientCAFile="/some/path". Then applyPrometheusClientAuth() returns early for "none" and the CA file is never loaded.
Operator sets a CA file path but it does nothing which is confusing. Either reject the combination in validation or warn.
I see two scenarios where this could be an issue:
- Sets
METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE=/etc/tls/ca.pembut forgets to setMETRICS_PROMETHEUS_TLS_CLIENT_AUTH=> expectsmTLSgets plainTLS. - Sets METRICS_PROMETHEUS_TLS_CLIENT_AUTH=none plus a CA file by mistake and the CA file does nothing, but no error signals the misconfiguration.
This would fix it:
| if tlsEnabled && (auth == "" || auth == "none") && strings.TrimSpace(o.clientCAFile) != "" { | |
| return fmt.Errorf("%s is set but %s is %q; set %s to %q or %q to use client certificate verification", | |
| prometheusTLSClientCAFileEnv, prometheusTLSClientAuthEnvName, auth, prometheusTLSClientAuthEnvName, "optional", "require") | |
| } |
|
|
||
| t.Run("optional with valid client CA file sets ClientAuth and ClientCAs", func(t *testing.T) { | ||
| caFile := createTempCACertFile(t) | ||
| t.Cleanup(func() { _ = os.Remove(caFile) }) |
There was a problem hiding this comment.
Move this line into createTempCACertFile, then you don't have to call it here and below in line 231. Cleanup runs when the test that owns t finishes, so it should be fine.
- Add WithTLSConfig() and WithTLSCertFiles() server options
- Support METRICS_TLS_CERT and METRICS_TLS_KEY env vars
308dd8a to
a65b291
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jkhelil, twoGiants The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
This PR adds native TLS/HTTPS support to the Prometheus metrics server in
knative.dev/pkg/observability/metrics/prometheus, enabling secure metrics collection without requiring additional sidecars or proxies.Problem
Many production environments require encrypted metrics endpoints for compliance and security, but the current Prometheus server only supports plain HTTP. Existing solutions have significant drawbacks:
kube-rbac-proxy limitations:
Post-Quantum Cryptography (PQC) Readiness:
Solution
Native TLS support in the Prometheus server provides:
Changes
Add certFile and keyFile on the server (populated from env).
Env vars: METRICS_PROMETHEUS_TLS_CERT, METRICS_PROMETHEUS_TLS_KEY for cert/key paths; TLS settings (MinVersion, MaxVersion, CipherSuites, CurvePreferences) via METRICS_PROMETHEUS_TLS_* using knative.dev/pkg/network/tls.DefaultConfigFromEnv("METRICS_PROMETHEUS_").
Add mTLS (client certificate verification) via two new env vars:
METRICS_PROMETHEUS_TLS_CLIENT_AUTH: controls client cert policy (none, optional, require).
METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE: path to the CA bundle used to verify client certificates (e.g. the
API: ListenAndServe() (plain HTTP), ListenAndServeTLS(certFile, keyFile) (TLS), and Serve() (use TLS when both cert and key are set, otherwise plain HTTP).
Validation: In validate(), require both cert and key set or both unset; return a clear error otherwise (no silent fallback to plain HTTP when only one is set). Also validates mTLS configuration: clientAuth must be one of none, optional, or require; setting mTLS env vars without TLS enabled is an error; optional or require without a client CA file is an error.
Godoc updated for TLS and mTLS behavior and env vars.
observability/metrics/prometheus_enabled.go
Call server.Serve() so TLS vs plain is chosen from config (e.g. when using sharedmain and env-based config). Check NewServer error before starting the serve goroutine to prevent nil-pointer panics. Ignore http.ErrServerClosed when logging server errors so graceful shutdown is not reported as an error.
observability/metrics/prometheus/server_test.go
Tests for TLS from env (cert/key + METRICS_PROMETHEUS_TLS_MIN_VERSION and related vars).
Configuration
Cert/key (required for TLS): METRICS_PROMETHEUS_TLS_CERT, METRICS_PROMETHEUS_TLS_KEY (file paths). Both must be set or both unset.
TLS parameters (optional, via network/tls): METRICS_PROMETHEUS_TLS_MIN_VERSION, METRICS_PROMETHEUS_TLS_MAX_VERSION, METRICS_PROMETHEUS_TLS_CIPHER_SUITES, METRICS_PROMETHEUS_TLS_CURVE_PREFERENCES. Defaults (e.g. TLS 1.3) apply when not set.
Client authentication (optional, requires TLS): METRICS_PROMETHEUS_TLS_CLIENT_AUTH (none, optional, require), METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE (path to CA bundle for verifying client certs). optional or require requires METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE to be set. Setting either without TLS cert/key is a validation error.
Host/port: Existing METRICS_PROMETHEUS_HOST, METRICS_PROMETHEUS_PORT unchanged.
/kind enhancement
Fixes #
Release Note
Docs