Skip to content

Add TLS support to Prometheus metrics server#3322

Open
jkhelil wants to merge 1 commit intoknative:mainfrom
jkhelil:add-metrics-tls
Open

Add TLS support to Prometheus metrics server#3322
jkhelil wants to merge 1 commit intoknative:mainfrom
jkhelil:add-metrics-tls

Conversation

@jkhelil
Copy link
Contributor

@jkhelil jkhelil commented Feb 24, 2026

This PR adds native TLS/HTTPS support to the Prometheus metrics server in knative.dev/pkg/observability/metrics/prometheus, enabling secure metrics collection without requiring additional sidecars or proxies.

Problem

Many production environments require encrypted metrics endpoints for compliance and security, but the current Prometheus server only supports plain HTTP. Existing solutions have significant drawbacks:

  1. kube-rbac-proxy limitations:

    • Deprecated and unmaintained (archived by Brancz in 2023)
    • Requires sidecar container for each component (15+ containers in multi-component operators)
    • Increased memory overhead and operational complexity
    • Additional maintenance burden for certificate rotation
  2. Post-Quantum Cryptography (PQC) Readiness:

    • Organizations need to configure TLS settings (MinVersion, MaxVersion, CipherSuites) to prepare for PQC migration
    • Current solution provides no programmatic control over TLS configuration

Solution

Native TLS support in the Prometheus server provides:

  • Zero additional containers or sidecars
  • Minimal memory footprint
  • Built-in support for all Knative components
  • Full programmatic control over TLS configuration (PQC-ready)
  • Simple certificate file-based configuration for basic use cases
  • Backward compatible (plain HTTP when TLS not configured)

Changes

  • Add certFile and keyFile on the server (populated from env).
    Env vars: METRICS_PROMETHEUS_TLS_CERT, METRICS_PROMETHEUS_TLS_KEY for cert/key paths; TLS settings (MinVersion, MaxVersion, CipherSuites, CurvePreferences) via METRICS_PROMETHEUS_TLS_* using knative.dev/pkg/network/tls.DefaultConfigFromEnv("METRICS_PROMETHEUS_").

  • Add mTLS (client certificate verification) via two new env vars:
    METRICS_PROMETHEUS_TLS_CLIENT_AUTH: controls client cert policy (none, optional, require).
    METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE: path to the CA bundle used to verify client certificates (e.g. the

  • API: ListenAndServe() (plain HTTP), ListenAndServeTLS(certFile, keyFile) (TLS), and Serve() (use TLS when both cert and key are set, otherwise plain HTTP).

  • Validation: In validate(), require both cert and key set or both unset; return a clear error otherwise (no silent fallback to plain HTTP when only one is set). Also validates mTLS configuration: clientAuth must be one of none, optional, or require; setting mTLS env vars without TLS enabled is an error; optional or require without a client CA file is an error.

  • Godoc updated for TLS and mTLS behavior and env vars.

  • observability/metrics/prometheus_enabled.go
    Call server.Serve() so TLS vs plain is chosen from config (e.g. when using sharedmain and env-based config). Check NewServer error before starting the serve goroutine to prevent nil-pointer panics. Ignore http.ErrServerClosed when logging server errors so graceful shutdown is not reported as an error.

  • observability/metrics/prometheus/server_test.go
    Tests for TLS from env (cert/key + METRICS_PROMETHEUS_TLS_MIN_VERSION and related vars).

Configuration

  • Cert/key (required for TLS): METRICS_PROMETHEUS_TLS_CERT, METRICS_PROMETHEUS_TLS_KEY (file paths). Both must be set or both unset.

  • TLS parameters (optional, via network/tls): METRICS_PROMETHEUS_TLS_MIN_VERSION, METRICS_PROMETHEUS_TLS_MAX_VERSION, METRICS_PROMETHEUS_TLS_CIPHER_SUITES, METRICS_PROMETHEUS_TLS_CURVE_PREFERENCES. Defaults (e.g. TLS 1.3) apply when not set.

  • Client authentication (optional, requires TLS): METRICS_PROMETHEUS_TLS_CLIENT_AUTH (none, optional, require), METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE (path to CA bundle for verifying client certs). optional or require requires METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE to be set. Setting either without TLS cert/key is a validation error.

  • Host/port: Existing METRICS_PROMETHEUS_HOST, METRICS_PROMETHEUS_PORT unchanged.

/kind enhancement

Fixes #

Release Note

The Prometheus metrics server in knative.dev/pkg now supports native HTTPS and mTLS. TLS is configured via environment variables: set METRICS_PROMETHEUS_TLS_CERT and METRICS_PROMETHEUS_TLS_KEY for certificate and key file paths, and optionally METRICS_PROMETHEUS_TLS_MIN_VERSION, METRICS_PROMETHEUS_TLS_MAX_VERSION, METRICS_PROMETHEUS_TLS_CIPHER_SUITES, and METRICS_PROMETHEUS_TLS_CURVE_PREFERENCES (using knative.dev/pkg/network/tls) for Post-Quantum Cryptography (PQC)–ready tuning. For mutual TLS (mTLS), set METRICS_PROMETHEUS_TLS_CLIENT_AUTH (none, optional, or require) and METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE (path to the CA bundle used to verify client certificates). This enables operators to enforce client certificate authentication on the metrics endpoint,

Docs


@knative-prow
Copy link

knative-prow bot commented Feb 24, 2026

@jkhelil: The label(s) kind/<kind> cannot be applied, because the repository doesn't have them.

Details

In response to this:

This PR adds native TLS/HTTPS support to the Prometheus metrics server in knative.dev/pkg/observability/metrics/prometheus, enabling secure metrics collection without requiring additional sidecars or proxies.

Problem

Many production environments require encrypted metrics endpoints for compliance and security, but the current Prometheus server only supports plain HTTP. Existing solutions have significant drawbacks:

  1. kube-rbac-proxy limitations:
  • Deprecated and unmaintained (archived by Brancz in 2023)
  • Requires sidecar container for each component (15+ containers in multi-component operators)
  • Increased memory overhead and operational complexity
  • Additional maintenance burden for certificate rotation
  1. Post-Quantum Cryptography (PQC) Readiness:
  • Organizations need to configure TLS settings (MinVersion, MaxVersion, CipherSuites) to prepare for PQC migration
  • Current solution provides no programmatic control over TLS configuration

Solution

Native TLS support in the Prometheus server provides:

  • Zero additional containers or sidecars
  • Minimal memory footprint
  • Built-in support for all Knative components
  • Full programmatic control over TLS configuration (PQC-ready)
  • Simple certificate file-based configuration for basic use cases
  • Backward compatible (plain HTTP when TLS not configured)

Changes

  1. observability/metrics/prometheus/server.go (+71, -8 lines)
  • Add crypto/tls import
  • Add METRICS_TLS_CERT and METRICS_TLS_KEY constants
  • Extend Server struct with certFile, keyFile fields
  • Extend options struct with tlsConfig, certFile, keyFile
  • Update ListenAndServe() to return error and implement TLS priority logic
  • Add WithTLSConfig() and WithTLSCertFiles() server options
  • Add ContextWithTLSConfig() and TLSConfigFromContext() context helpers
  1. observability/metrics/prometheus/server_test.go (+118 lines)
  • Test WithTLSConfig() option
  • Test WithTLSCertFiles() option
  • Test environment variable configuration
  • Test environment variable priority over options
  • Test context-based TLS injection
  • Test TLS priority logic
  1. observability/metrics/prometheus_enabled.go (+11, -2 lines)
  • Update buildPrometheus() to use ctx context.Context (not _)

  • Retrieve TLSConfig from context if present

  • Handle error return from ListenAndServe()

/kind

Fixes #

Release Note


Docs


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 24, 2026
@knative-prow
Copy link

knative-prow bot commented Feb 24, 2026

Hi @jkhelil. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow knative-prow bot requested review from Leo6Leo and skonto February 24, 2026 10:01
@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 73.13433% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.85%. Comparing base (5504026) to head (a65b291).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
observability/metrics/prometheus/server.go 74.19% 12 Missing and 4 partials ⚠️
observability/metrics/prometheus_enabled.go 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3322      +/-   ##
==========================================
+ Coverage   74.84%   74.85%   +0.01%     
==========================================
  Files         189      189              
  Lines        8287     8347      +60     
==========================================
+ Hits         6202     6248      +46     
- Misses       1844     1854      +10     
- Partials      241      245       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to gofumpt the files you are touching (https://github.com/mvdan/gofumpt) to satisfy the linter

Comment on lines +89 to +91
if s.http.TLSConfig != nil || (s.certFile != "" && s.keyFile != "") {
return s.http.ListenAndServeTLS(s.certFile, s.keyFile)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case when a TLS config would not be used with ListenAndServeTLS ?

Comment on lines +58 to +61
if tlsConfig := prometheus.TLSConfigFromContext(ctx); tlsConfig != nil {
opts = append(opts, prometheus.WithTLSConfig(tlsConfig))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop context check here.

If anything we should add options to Config if env vars overrides aren't enough - but it seems like they should be since we namespaced them

@dprotaso
Copy link
Member

dprotaso commented Mar 5, 2026

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 5, 2026
Copy link
Contributor

@twoGiants twoGiants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @jkhelil ! 😸 👍

I have a few small remarks below.

http: &http.Server{
Addr: addr,
Handler: mux,
// https://medium.com/a-journey-with-go/go-understand-and-mitigate-slowloris-attack-711c1b1403f6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably leave this comment here.

opts = append(opts, prometheus.WithTLSConfig(tlsConfig))
}

server, err := prometheus.NewServer(opts...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing error: missing error check. If NewServer fails, server is nil but the go routine below executes anyway and panics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkhelil mind adding a fix for this?

func (s *Server) ListenAndServe() {
s.http.ListenAndServe()
func (s *Server) ListenAndServe() error {
if s.http.TLSConfig != nil || (s.certFile != "" && s.keyFile != "") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition will silently fallback to s.http.ListenAndServe() if certFile or keyFile are not set. Don't think that is intended, I'd rather return an error that both must be set or none.

You could validate in the validate function where other options are validated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

This one is unused. Can be removed.


go func() {
server.ListenAndServe()
if err := server.ListenAndServe(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter for http.ErrServerClosed which is returned on graceful shutdown which is not an error. See injection/sharedmain/main.go for example.

Suggested change
if err := server.ListenAndServe(); err != nil {
if err := server.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) {


func (s *Server) ListenAndServe() {
s.http.ListenAndServe()
func (s *Server) ListenAndServe() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListenAndServe is a misleading name it actually also serves TLS traffic. Lets add a docs comment to the method signature stating that it can also serve TLS if TLS config or certificates are provided.


Random thoughts: I would actually prefer to separate ListenAndServe and ListenAndServeTLS like it is done by net/http own API design. The caller should make the choice which method to use depending on the options it's having. But that would have bigger implications on the refactoring => we would need to pull out the env var checks to the caller and I don't think it's worth it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth it here.

Yeah that's the big issue - it pushes the ownership on the caller to pick TLS. The original intent was to have ListenAndServe mimic the http.Server

What we could do is have ListenAndServe - ListenAndServeTLS and a new method that auto picks TLS if the settings are present. eg. ListenAndServeAutoTLS ? Though unsure about the name is best fits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we could do is have ListenAndServe - ListenAndServeTLS and a new method that auto picks TLS if the settings are present.

Yes, that would be quite close to what I was talking about but it means expanding the public API by 2 public methods. I wouldn't mind, it's a good idea 😸

@dprotaso
Copy link
Member

Added a comment - #3322 (comment)

@jkhelil jkhelil changed the title Add TLS support to Prometheus metrics server WIP Add TLS support to Prometheus metrics server Mar 13, 2026
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026
@jkhelil jkhelil force-pushed the add-metrics-tls branch 2 times, most recently from 6906ce7 to 308dd8a Compare March 13, 2026 07:11
@jkhelil jkhelil changed the title WIP Add TLS support to Prometheus metrics server Add TLS support to Prometheus metrics server Mar 13, 2026
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026
Copy link
Contributor

@twoGiants twoGiants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the rework! Looks great 👍

I left a few more comments.

@@ -71,22 +98,44 @@ func NewServer(opts ...ServerOption) (*Server, error) {
Addr: addr,
Handler: mux,
// https://medium.com/a-journey-with-go/go-understand-and-mitigate-slowloris-attack-711c1b1403f6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is for line 102.

)

func buildPrometheus(_ context.Context, cfg Config) (sdkmetric.Reader, shutdownFunc, error) {
func buildPrometheus(ctx context.Context, cfg Config) (sdkmetric.Reader, shutdownFunc, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can stay blank if it's not used or use it below in line 66 and get the logger from ctx and log an error.

Suggested change
func buildPrometheus(ctx context.Context, cfg Config) (sdkmetric.Reader, shutdownFunc, error) {
func buildPrometheus(_ context.Context, cfg Config) (sdkmetric.Reader, shutdownFunc, error) {

prometheusTLSCertEnvName = "METRICS_PROMETHEUS_TLS_CERT"
prometheusTLSKeyEnvName = "METRICS_PROMETHEUS_TLS_KEY"
prometheusTLSClientAuthEnvName = "METRICS_PROMETHEUS_TLS_CLIENT_AUTH"
prometheusTLSClientCAFileEnv = "METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prometheusTLSClientCAFileEnv = "METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE"
prometheusTLSClientCAFileEnvName = "METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE"

return fmt.Errorf("%s must be set when %s is %q (client certs cannot be validated without a CA)",
prometheusTLSClientCAFileEnv, prometheusTLSClientAuthEnvName, auth)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate() allows: TLS cert+key set, clientAuth="none", clientCAFile="/some/path". Then applyPrometheusClientAuth() returns early for "none" and the CA file is never loaded.

Operator sets a CA file path but it does nothing which is confusing. Either reject the combination in validation or warn.

I see two scenarios where this could be an issue:

  • Sets METRICS_PROMETHEUS_TLS_CLIENT_CA_FILE=/etc/tls/ca.pem but forgets to set METRICS_PROMETHEUS_TLS_CLIENT_AUTH => expects mTLS gets plain TLS.
  • Sets METRICS_PROMETHEUS_TLS_CLIENT_AUTH=none plus a CA file by mistake and the CA file does nothing, but no error signals the misconfiguration.

This would fix it:

Suggested change
if tlsEnabled && (auth == "" || auth == "none") && strings.TrimSpace(o.clientCAFile) != "" {
return fmt.Errorf("%s is set but %s is %q; set %s to %q or %q to use client certificate verification",
prometheusTLSClientCAFileEnv, prometheusTLSClientAuthEnvName, auth, prometheusTLSClientAuthEnvName, "optional", "require")
}


t.Run("optional with valid client CA file sets ClientAuth and ClientCAs", func(t *testing.T) {
caFile := createTempCACertFile(t)
t.Cleanup(func() { _ = os.Remove(caFile) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this line into createTempCACertFile, then you don't have to call it here and below in line 231. Cleanup runs when the test that owns t finishes, so it should be fine.

    - Add WithTLSConfig() and WithTLSCertFiles() server options
    - Support METRICS_TLS_CERT and METRICS_TLS_KEY env vars
Copy link
Contributor

@twoGiants twoGiants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the solid re-work! Looks great @jkhelil 😸 👍

/approve
/lgtm

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2026
@knative-prow
Copy link

knative-prow bot commented Mar 16, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jkhelil, twoGiants
Once this PR has been reviewed and has the lgtm label, please assign leo6leo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants