Skip to content

feat: Administrate supernode via MCP#29

Open
gonzalezzfelipe wants to merge 11 commits into
mainfrom
feat/mcp
Open

feat: Administrate supernode via MCP#29
gonzalezzfelipe wants to merge 11 commits into
mainfrom
feat/mcp

Conversation

@gonzalezzfelipe
Copy link
Copy Markdown
Contributor

@gonzalezzfelipe gonzalezzfelipe commented May 15, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added MCP server component with authentication, authorization, and audit capabilities.
    • Introduced workload management tools for Kubernetes deployments and stateful sets.
    • Added Vault integration for runtime and operator secret management.
    • Added extension catalog with support for Cardano Node Relay and Dolos extensions.
    • New session store with SQLite persistence option.
  • Documentation

    • Updated guides for local MCP server development and deployment.
    • Added documentation for workload management and output port-forwarding workflows.
  • Chores

    • Added CI/CD pipelines for MCP server building, linting, and testing.
    • Extended control-plane Helm chart with MCP component configuration.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

📝 Walkthrough

Walkthrough

Adds a new Rust MCP server with HTTP endpoints, Kubernetes/Helm/Vault integrations, session storage, and an extensive tool suite. Wires CI to lint/test/build/publish artifacts. Extends the control-plane Helm chart to deploy the MCP server and updates documentation and scripts accordingly.

Changes

MCP server crate, runtime, and CI

Layer / File(s) Summary
CI/build, packaging, and ignore rules
.github/workflows/*, mcp-server/Cargo.toml, mcp-server/Dockerfile, .gitignore, mcp-server/.dockerignore
Sets up lint/test/build matrix, multi-arch image publishing, Docker ignores, crate manifest, and runtime image.
HTTP entrypoint, router, and configuration/errors
mcp-server/src/main.rs, mcp-server/src/server.rs, mcp-server/src/config.rs, mcp-server/src/errors.rs, mcp-server/src/helm.rs
Bootstraps Axum server and MCP service, config/env loading, health endpoints, graceful shutdown, and Helm install API.
Auth, policy checks, and auditing
mcp-server/src/auth/*, mcp-server/src/policy/*, mcp-server/src/audit/*
Defines auth context, policy enforcement of scopes/approvals, and audit events with tracing sink.
Extension catalog, resources, and prompts
mcp-server/src/catalog/*, mcp-server/src/resources/*, mcp-server/src/prompts/*
Provides embedded extension catalog, MCP resources (status/catalog), and static prompt catalog.
Kubernetes client and Helm release discovery
mcp-server/src/k8s/*
Implements a typed kube client, pod exec/logs, and Helm v1 Secret decoding to list latest releases.
Vault client, paths, and redaction
mcp-server/src/vault/*
Vault KV v2 client, secure path validation, and secret redaction wrappers.
SQLite session store
mcp-server/src/session/*
SQLite-backed session store with TTL and WAL mode.
Tools surface, router, and common helpers
mcp-server/src/tools/{mod.rs,router.rs,common.rs,supernode.rs,vault.rs}
Defines tool metadata, router skeleton, common result helpers, and core supernode/vault tool definitions.
Workloads tool set and registries/outputs/summaries/defaults
mcp-server/src/tools/workloads/*, mcp-server/src/tools/k8s_summaries.rs
Workloads list/get/install planning, dynamic tools, outputs derivation, registries, and resource summaries/defaults.
Workloads logs/metrics tools
mcp-server/src/tools/workloads/{logs.rs,metrics.rs}
Implements log retrieval/target resolution and metrics target mapping.
Dynamic tool discovery, full router dispatch, and MCP server
mcp-server/src/tools/{dynamic.rs,router.rs}, mcp-server/src/mcp.rs
Dynamic tool discovery from Helm releases, complete dispatcher logic, and MCP handler with periodic refresh and auditing.

Control-plane Helm integration, scripts, and docs

Layer / File(s) Summary
Control-plane Helm templates and values for MCP
extensions/control-plane/templates/stage4-0*.yaml, extensions/control-plane/values.yaml
Adds PVC/SA/RBAC/StatefulSet/Service and values for deploying supernode-mcp.
post_install.sh Vault token and kubectl context
extensions/control-plane/scripts/post_install.sh
Creates MCP Vault policy/token/Secret with optional kubectl context, and updates output/verify steps.
Control-plane/Bootstrap documentation updates
extensions/control-plane/README.md, bootstrap/kind/README.md
Documents MCP values, post-install reconciliation, and a local Kind workflow.
Cardano-node metrics role field
extensions/cardano-node/templates/configmap-metrics.yaml
Adds role=relay to emitted metrics.
Skills docs for MCP-driven workflows
skills/*
Rewrites relay/Dolos deployment guides around MCP tools and adds output port-forwarding guide.
Slim base image Helm install step
.github/image/Dockerfile
Minimizes packages and installs Helm via script in the build image.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • txpipe/metis#25 — Edits in Cardano-node metrics templates; overlaps with metrics ConfigMap changes adding role="relay".

Suggested reviewers

  • mduthey
  • scarmuega

Poem

In burrows of bytes, I thump with cheer,
A server now speaks, its tools draw near.
Helm charts hatch, pods blink awake—
Vault whispers safe, logs stream like a lake.
From Kind to cloud I hop, deploy—
Relay and Dolos, a rabbit’s joy. 🐇✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/mcp

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (20)
extensions/control-plane/templates/stage4-02-clusterrole-supernode-mcp.yaml-10-36 (1)

10-36: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Cluster-wide RBAC is over-privileged for a “readonly” role.

This role currently allows mutating cluster resources (including full secret CRUD) across all namespaces. That’s a high-risk permission set for the default deployment path and doesn’t match the role intent.

Please split permissions into:

  • a namespace-scoped Role/RoleBinding for mutable workload operations in the target namespace, and
  • a minimal ClusterRole only for truly cluster-scoped reads (if required).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/templates/stage4-02-clusterrole-supernode-mcp.yaml`
around lines 10 - 36, The ClusterRole is over-privileged: remove all mutating
verbs (create, update, patch, delete) and secret CRUD from the cluster-scoped
role and instead create a namespace-scoped Role + RoleBinding for mutable
workload ops. Concretely, change the ClusterRole entries that currently list
resources like
"serviceaccounts","services","persistentvolumeclaims","configmaps","secrets","deployments","statefulsets","podmonitors"
and verbs ["create","get","list","watch","update","patch","delete"] to a minimal
ClusterRole that only has get/list/watch for truly cluster-scoped reads (e.g.,
"namespaces","storageclasses","pods" as needed) and ensure "pods/log" and
"pods/exec" remain namespace-scoped; then add a Role (target namespace) that
grants create/get/list/watch/update/patch/delete for
"serviceaccounts","services","persistentvolumeclaims","configmaps","secrets","deployments","statefulsets","podmonitors"
and bind it to the same subject via a RoleBinding. Use the resource names and
verbs in the diff to locate the rules to modify and add corresponding
Role/RoleBinding objects.
extensions/control-plane/values.yaml-166-174 (1)

166-174: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden defaults: avoid enabling MCP with auth.mode: trusted by default.

Shipping enabled + trusted auth as defaults exposes an unauthenticated admin surface to any in-cluster caller unless users override values.

Consider defaulting to one of:

  • enabled: false, or
  • a non-trusted auth mode with explicit credentials configuration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/values.yaml` around lines 166 - 174, Change the
insecure default values in the Helm values file: do not ship enabled: true with
auth.mode: trusted. Update the defaults (the keys referenced: enabled,
auth.mode, and related image/replicaCount) so that either enabled: false is the
default OR auth.mode is a non-trusted mode that requires explicit credentials
(and add placeholder credential configuration fields) so the MCP does not expose
an unauthenticated admin surface out-of-the-box; adjust values.yaml accordingly
and document the required credentials for any non-disabled default.
extensions/cardano-node/templates/configmap-metrics.yaml-459-459 (1)

459-459: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not hardcode role to relay in emitted metrics.

The script computes node_role from METIS_NODE_ROLE, but the output now always reports "relay", which mislabels block-producer nodes.

Suggested fix
-    jq -n \
+    jq -n \
+      --arg nodeRole "$node_role" \
@@
-          role: "relay",
+          role: (nullable_string($nodeRole) // "relay"),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/cardano-node/templates/configmap-metrics.yaml` at line 459, The
emitted metrics hardcode role: "relay" instead of using the computed node role;
update the template in configmap-metrics.yaml so the role field uses the
computed variable (node_role / METIS_NODE_ROLE) rather than the literal "relay".
Locate the metrics emission block that sets role: "relay" and replace that
hardcoded string with the template variable that holds the result of
METIS_NODE_ROLE (e.g., use the node_role/template variable used earlier) so
block-producer nodes are labeled correctly.
extensions/control-plane/templates/stage4-00-pvc-supernode-mcp.yaml-1-24 (1)

1-24: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the standalone PVC; it’s unused with the current StatefulSet flow.

When persistence.enabled=true and existingClaim is empty, the StatefulSet already provisions session-store via volumeClaimTemplates. This template creates a second PVC (supernode-mcp) that is never mounted and can still consume storage.

Suggested fix
-{{- if and .Values.supernodeMcp.enabled .Values.supernodeMcp.persistence.enabled (not .Values.supernodeMcp.persistence.existingClaim) }}
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: supernode-mcp
-  namespace: '{{ .Values.namespace }}'
-  labels:
-    app.kubernetes.io/name: supernode-mcp
-    app.kubernetes.io/component: mcp
-{{- with .Values.supernodeMcp.persistence.annotations }}
-  annotations:
-{{ toYaml . | indent 4 }}
-{{- end }}
-spec:
-  accessModes:
-{{ toYaml .Values.supernodeMcp.persistence.accessModes | indent 4 }}
-{{- $storageClass := default .Values.storageClass .Values.supernodeMcp.persistence.storageClass }}
-{{- if $storageClass }}
-  storageClassName: {{ $storageClass }}
-{{- end }}
-  resources:
-    requests:
-      storage: {{ .Values.supernodeMcp.persistence.size }}
-{{- end }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/templates/stage4-00-pvc-supernode-mcp.yaml` around
lines 1 - 24, This template creates an unused PersistentVolumeClaim named
"supernode-mcp" which duplicates storage when the StatefulSet already provisions
"session-store" via volumeClaimTemplates; remove this PersistentVolumeClaim
resource (the apiVersion: v1 kind: PersistentVolumeClaim block that references
.Values.supernodeMcp.persistence and creates name: supernode-mcp) so PVCs are
only created by the StatefulSet's volumeClaimTemplates, or alternatively gate
rendering behind a new flag (e.g., .Values.supernodeMcp.useStandalonePVC) and
update the top conditional to require that flag before emitting the
supernode-mcp PVC.
.github/image/Dockerfile-5-5 (1)

5-5: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid piping mutable remote Helm installer scripts into bash; pin and verify Helm artifact instead.

Line 5 introduces a supply-chain and reproducibility gap. The Helm installer script cannot verify checksums (requires openssl, which is not installed on line 4), and no version is pinned—the script always fetches from the main branch. This creates a non-deterministic, unverified artifact in your build.

Use the suggested diff approach: download and verify a pinned Helm tarball explicitly, then extract it. This ensures reproducibility and eliminates reliance on a mutable remote script.

Suggested hardening diff
 RUN apt-get update \
     && apt-get install -y --no-install-recommends ca-certificates curl \
-    && curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash \
+    && HELM_VERSION="v3.19.0" \
+    && curl -fsSLo /tmp/helm.tgz "https://get.helm.sh/helm-${HELM_VERSION}-linux-${TARGETARCH}.tar.gz" \
+    && curl -fsSLo /tmp/helm.tgz.sha256 "https://get.helm.sh/helm-${HELM_VERSION}-linux-${TARGETARCH}.tar.gz.sha256" \
+    && echo "$(cat /tmp/helm.tgz.sha256)  /tmp/helm.tgz" | sha256sum -c - \
+    && tar -xzf /tmp/helm.tgz -C /tmp \
+    && mv "/tmp/linux-${TARGETARCH}/helm" /usr/local/bin/helm \
+    && chmod +x /usr/local/bin/helm \
+    && rm -rf /tmp/helm.tgz /tmp/helm.tgz.sha256 /tmp/linux-${TARGETARCH} \
     && apt-get purge -y --auto-remove curl \
     && rm -rf /var/lib/apt/lists/*
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/image/Dockerfile at line 5, Replace the unsafe piped installer
command ("curl -fsSL
https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash")
with a pinned, verifiable Helm release flow: choose a specific Helm version
(e.g., vX.Y.Z), download the corresponding release tarball from GitHub Releases,
fetch the official sha256 checksum for that exact version, verify the downloaded
tarball checksum with a checksum tool (openssl/sha256sum), extract the helm
binary and install it to the target location, and remove the downloaded
artifacts; ensure any required tooling for verification (openssl or coreutils)
is installed in the image before verification to make the process reproducible
and supply-chain safe.
extensions/control-plane/scripts/post_install.sh-34-41 (1)

34-41: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make MCP Vault provisioning opt-in.

Line 34 defaults MCP_VAULT_ENABLED to true, so a normal post-install now mints a long-lived orphan token with write access to the runtime KV even when supernode-mcp is not enabled. That silently expands the credential footprint for existing clusters.

Suggested fix
-MCP_VAULT_ENABLED="${MCP_VAULT_ENABLED:-true}"
+MCP_VAULT_ENABLED="${MCP_VAULT_ENABLED:-false}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/scripts/post_install.sh` around lines 34 - 41, The
script currently defaults MCP_VAULT_ENABLED to true causing automatic Vault
token provisioning; change the default to opt-in by setting
MCP_VAULT_ENABLED="${MCP_VAULT_ENABLED:-false}" and ensure any conditional
checks that rely on MCP_VAULT_ENABLED (the Vault provisioning branch that uses
MCP_VAULT_POLICY_NAME, MCP_VAULT_TOKEN_SECRET_NAME, MCP_VAULT_TOKEN_SECRET_KEY,
MCP_VAULT_TOKEN_DISPLAY_NAME, MCP_VAULT_TOKEN_TTL, MCP_VAULT_TOKEN_PERIOD,
MCP_VAULT_TOKEN_ORPHAN) respect the new default so provisioning only runs when
MCP_VAULT_ENABLED is explicitly enabled.
extensions/control-plane/scripts/post_install.sh-212-237 (1)

212-237: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Revoke the MCP token if Secret creation fails.

Lines 229-237 create the Vault token before the kubectl apply pipeline. If that pipeline fails, the script exits and cleanup only removes the temp file, leaving a live orphan token behind.

Suggested fix
+MCP_VAULT_SECRET_APPLIED="false"
+
 cleanup() {
   if [[ -n "${PORT_FORWARD_PID}" ]] && kill -0 "${PORT_FORWARD_PID}" >/dev/null 2>&1; then
     kill "${PORT_FORWARD_PID}" >/dev/null 2>&1 || true
     wait "${PORT_FORWARD_PID}" >/dev/null 2>&1 || true
   fi
   rm -f "${PORT_FORWARD_LOG}"
   if [[ -n "${MCP_VAULT_TOKEN_FILE}" ]]; then
+    if [[ "${MCP_VAULT_SECRET_APPLIED}" != "true" ]]; then
+      vault token revoke "$(cat "${MCP_VAULT_TOKEN_FILE}")" >/dev/null 2>&1 || true
+    fi
     rm -f "${MCP_VAULT_TOKEN_FILE}"
   fi
 }
 ...
   vault token create "${token_args[@]}" >"$MCP_VAULT_TOKEN_FILE"

   kubectl "${KUBECTL_ARGS[@]}" -n "$NAMESPACE" create secret generic "$MCP_VAULT_TOKEN_SECRET_NAME" \
     "--from-file=${MCP_VAULT_TOKEN_SECRET_KEY}=${MCP_VAULT_TOKEN_FILE}" \
     --dry-run=client \
     -o yaml \
     | kubectl "${KUBECTL_ARGS[@]}" -n "$NAMESPACE" apply -f -
+  MCP_VAULT_SECRET_APPLIED="true"
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/scripts/post_install.sh` around lines 212 - 237, The
script currently creates the Vault token via the vault token create call and
writes it to MCP_VAULT_TOKEN_FILE but doesn't revoke it if the kubectl apply
pipeline fails; change the flow so after running kubectl ... apply -f - you
check the pipeline exit status and if it failed, read the token from
MCP_VAULT_TOKEN_FILE and revoke it with the vault token revoke command (using
the token string read from MCP_VAULT_TOKEN_FILE), then remove the temp file;
alternatively add a local failure handler/trap that revokes the token on error
using the same MCP_VAULT_TOKEN_FILE token contents so no live orphan token
remains when kubectl apply fails.
.github/workflows/test_mcp_server.yml-25-25 (1)

25-25: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update to actions/checkout@v4 for runner compatibility.

The actions/checkout@v3 action runner is outdated. As per static analysis, update to v4 for compatibility with current GitHub Actions runners.

🔄 Proposed fix
       - name: Checkout
-        uses: actions/checkout@v3
+        uses: actions/checkout@v4
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test_mcp_server.yml at line 25, Replace the outdated
checkout action reference "uses: actions/checkout@v3" with the newer
"actions/checkout@v4" in the workflow so the runner uses the v4-compatible
checkout action; locate the line containing the "uses: actions/checkout@v3"
entry and update it to reference v4.
.github/workflows/check_mcp_server.yml-25-25 (1)

25-25: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update to actions/checkout@v4 for runner compatibility.

The actions/checkout@v3 action runner is outdated. As per static analysis, update to v4 for compatibility with current GitHub Actions runners.

🔄 Proposed fix
       - name: Checkout
-        uses: actions/checkout@v3
+        uses: actions/checkout@v4
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/check_mcp_server.yml at line 25, Update the GitHub Actions
workflow step that currently uses actions/checkout@v3 to actions/checkout@v4:
locate the workflow step with the uses entry "actions/checkout@v3" and change it
to "actions/checkout@v4" to ensure runner compatibility with current GitHub
Actions environments.
mcp-server/Dockerfile-11-15 (1)

11-15: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pin Helm version instead of using the latest script.

Downloading and executing the Helm install script via curl | bash without version pinning is a security risk. If the script or Helm upstream is compromised, or if breaking changes are introduced, builds could fail or include malicious code.

🔒 Proposed fix with pinned version
 RUN apt-get update \
-    && apt-get install -y --no-install-recommends ca-certificates curl \
-    && curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash \
-    && apt-get purge -y --auto-remove curl \
+    && apt-get install -y --no-install-recommends ca-certificates curl wget \
+    && wget https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz \
+    && tar -zxvf helm-v3.14.0-linux-amd64.tar.gz \
+    && mv linux-amd64/helm /usr/local/bin/helm \
+    && rm -rf helm-v3.14.0-linux-amd64.tar.gz linux-amd64 \
+    && apt-get purge -y --auto-remove curl wget \
     && rm -rf /var/lib/apt/lists/*

Note: Update the version number to match your requirements. Verify the latest stable release.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/Dockerfile` around lines 11 - 15, Replace the unpinned "curl |
bash" Helm install in the RUN line with a pinned-version installation: add a
HELM_VERSION build-arg or ENV, download the specific release tarball for that
version (from
github.com/helm/helm/releases/download/v${HELM_VERSION}/helm-v${HELM_VERSION}-linux-amd64.tar.gz),
verify it against a known SHA256 checksum, extract the binary and move it to
/usr/local/bin, and only then remove the tarball and cleanup apt lists; do not
pipe the upstream install script to bash and ensure checksum verification to
prevent supply-chain risks (update the existing RUN command that currently
invokes the Helm install script).
.github/workflows/build_mcp_server.yml-14-14 (1)

14-14: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove continue-on-error from the build job.

Allowing the build job to fail silently defeats the purpose of CI and can mask critical build failures. If the build fails, the workflow should fail.

🔧 Proposed fix
   build:
-    continue-on-error: true

     strategy:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build_mcp_server.yml at line 14, The CI build job
currently sets the YAML key continue-on-error: true which allows failures to be
ignored; remove that key (or set it to false) from the build job definition so
the workflow fails on build errors—look for the build job block that contains
continue-on-error and delete or change that setting to ensure the job's failures
surface.
.github/workflows/build_mcp_server.yml-41-43 (1)

41-43: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update to a supported Rust toolchain action.

The actions-rs/toolchain@v1 action is deprecated and its runner is too old for current GitHub Actions. As per static analysis, update to a maintained alternative.

🔄 Proposed fix using dtolnay/rust-toolchain
       - name: Install stable toolchain
-        uses: actions-rs/toolchain@v1
+        uses: dtolnay/rust-toolchain@stable
         with:
           toolchain: stable
+          targets: ${{ matrix.target }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build_mcp_server.yml around lines 41 - 43, Replace the
deprecated actions-rs/toolchain@v1 usage with a maintained Rust toolchain action
(e.g., dtolnay/rust-toolchain) in the workflow: update the uses entry that
currently references actions-rs/toolchain@v1 and keep the intended
channel/toolchain configuration (toolchain: stable) in the new action's inputs,
adjusting the input key names if the alternative action expects different field
names; then run the workflow to ensure the runner and toolchain install step
succeed.
mcp-server/src/k8s/helm_releases.rs-61-67 (1)

61-67: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

A single malformed Helm secret should not fail the entire discovery response.

collect::<Result<Vec<_>, _>>()? makes discovery brittle: one bad secret blocks all valid releases.

Proposed fix (best-effort decode)
-        Ok(latest_releases(
-            secrets
-                .items
-                .iter()
-                .map(decode_helm_release_secret)
-                .collect::<Result<Vec<_>, _>>()?,
-            include_control_plane,
-        ))
+        let decoded = secrets
+            .items
+            .iter()
+            .filter_map(|secret| decode_helm_release_secret(secret).ok())
+            .collect::<Vec<_>>();
+        Ok(latest_releases(decoded, include_control_plane))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/k8s/helm_releases.rs` around lines 61 - 67, The current use of
collect::<Result<Vec<_>, _>>()? over secrets.items with
decode_helm_release_secret makes the whole discovery fail on one bad secret;
change the flow so you best-effort decode each item: iterate secrets.items,
attempt decode_helm_release_secret for each (calling the same function), collect
only the successful Ok values into a Vec (e.g., via filter_map or flat_map
equivalent) and skip or log Err results, then pass that Vec into latest_releases
along with include_control_plane so a single malformed secret does not break the
entire response.
mcp-server/src/k8s/helm_releases.rs-255-259 (1)

255-259: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not exclude all releases just because they are in the control-plane namespace.

The namespace check causes false positives and hides legitimate non-control-plane releases in that namespace.

Proposed fix
 fn is_control_plane_release(release: &HelmReleaseSummary) -> bool {
     release.name == CONTROL_PLANE_RELEASE_NAME
         || release.chart.name.as_deref() == Some(CONTROL_PLANE_RELEASE_NAME)
-        || release.namespace == CONTROL_PLANE_RELEASE_NAME
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/k8s/helm_releases.rs` around lines 255 - 259, The function
is_control_plane_release in helm_releases.rs is incorrectly treating any release
in the CONTROL_PLANE_RELEASE_NAME namespace as the control plane, causing false
positives; remove the namespace equality check (the condition "||
release.namespace == CONTROL_PLANE_RELEASE_NAME") and only consider release.name
and release.chart.name.as_deref() == Some(CONTROL_PLANE_RELEASE_NAME) when
determining control-plane releases (adjust or add tests for HelmReleaseSummary
if needed).
mcp-server/src/helm.rs-60-60 (1)

60-60: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden Helm values-file handling for secrets.

Lines 84-97 write raw values to disk with default permissions, and Line 60 ignores cleanup failures. If values contain secrets, this risks leakage on shared nodes. Create files with restrictive perms (0600), prefer secure temp-file primitives, and surface/log delete failures.

Also applies to: 84-97

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/helm.rs` at line 60, The code currently writes raw Helm values
to disk with default permissions and silently ignores cleanup
(std::fs::remove_file(&values_path)); fix by creating the values file with
restrictive permissions and using a secure temp-file primitive: replace the raw
write at the "write raw values to disk" block (lines 84-97) with a
tempfile::NamedTempFile (or tempfile::Builder) to create an atomic,
securely-permissioned temporary file (or explicitly create with OpenOptions +
set_mode(0o600) on Unix), write the values into that temp file, then atomically
persist or move it to values_path if needed; also change the remove_file call
that currently ignores errors to check the Result and log or return the error
(e.g., using processLogger/tracing::error) instead of discarding it so deletion
failures are surfaced.
mcp-server/src/helm.rs-59-62 (1)

59-62: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a process-level timeout around Helm execution.

Line 59 can block indefinitely if the Helm process hangs; --timeout does not guarantee subprocess termination in all failure modes. Wrap command execution with tokio::time::timeout and return a structured timeout error.

Suggested change
+use tokio::time::{timeout, Duration};
...
-    let output = Command::new(&helm_bin).args(&args).output().await;
+    let output = timeout(
+        Duration::from_secs(15 * 60),
+        Command::new(&helm_bin).args(&args).output(),
+    )
+    .await
+    .map_err(|_| HelmInstallError::Execute(std::io::Error::new(
+        std::io::ErrorKind::TimedOut,
+        "helm command timed out",
+    )))?;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/helm.rs` around lines 59 - 62, The Helm command execution can
hang indefinitely; wrap the await on
Command::new(&helm_bin).args(&args).output().await in tokio::time::timeout with
a reasonable Duration and map a timeout result to a new structured
HelmInstallError variant (e.g., HelmInstallError::Timeout) so callers get a
clear error; ensure you still remove the temp values file (values_path) and
convert the successful output via output.map_err(HelmInstallError::Execute) as
before, returning the timeout error when tokio::time::timeout returns Err.
mcp-server/src/policy/scopes.rs-23-40 (1)

23-40: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Prevent maintenance errors when adding new scopes.

The all() method manually lists every enum variant. When a new Scope is added to the enum (lines 7-20), the developer must remember to also add it to this array. The compiler won't catch omissions, creating a subtle bug where the new scope won't be included in policy checks that rely on all().

🔧 Consider using a macro or const array

One approach is to use a declarative macro that generates both the enum and the all() method from a single source of truth:

macro_rules! define_scopes {
    ($($variant:ident),* $(,)?) => {
        #[derive(Debug, Clone, Copy, Eq, PartialEq, Ord, PartialOrd, Serialize)]
        #[serde(rename_all = "kebab-case")]
        pub enum Scope {
            $($variant,)*
        }

        impl Scope {
            pub fn all() -> BTreeSet<Self> {
                [$(Self::$variant,)*].into_iter().collect()
            }
        }
    };
}

define_scopes! {
    Discover,
    Debug,
    WorkloadsInstall,
    WorkloadsUpgrade,
    WorkloadsDelete,
    VaultRuntimeMetadata,
    VaultRuntimeRead,
    VaultRuntimeWrite,
    VaultOperatorMetadata,
    VaultOperatorWrite,
    VaultOperatorRead,
    Admin,
}

Alternatively, use the strum crate with EnumIter to generate iteration automatically.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/policy/scopes.rs` around lines 23 - 40, The current
Scope::all() method manually lists every Scope variant so newly added variants
can be omitted; change the design to generate the enum and its all() collection
from a single source of truth (e.g., introduce a define_scopes! macro or use
strum::EnumIter) so adding a variant updates both enum and all() automatically;
update the Scope declaration and the impl Scope::all() to be produced by that
macro (or derive EnumIter and implement all() by iterating over Variant::iter())
and remove the hand-maintained array of variants in all().
mcp-server/src/tools/router.rs-772-777 (1)

772-777: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clamp user-provided limit to a bounded maximum.

list_params accepts any u32 limit, so callers can request very large lists and create avoidable API pressure.

Proposed fix
 fn list_params(arguments: Option<&JsonObject>, default_limit: Option<u32>) -> ResourceListParams {
+    const MAX_LIMIT: u32 = 200;
+    let limit = optional_u32(arguments, "limit")
+        .map(|v| v.min(MAX_LIMIT))
+        .or(default_limit);
+
     ResourceListParams {
         label_selector: optional_string(arguments, "labelSelector"),
         field_selector: optional_string(arguments, "fieldSelector"),
-        limit: optional_u32(arguments, "limit").or(default_limit),
+        limit,
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/router.rs` around lines 772 - 777, list_params currently
accepts any u32 for limit which can cause unbounded requests; change its limit
construction to clamp user-provided values to a safe maximum: introduce a
MAX_LIST_LIMIT constant (e.g., 1000) and compute limit as
optional_u32(arguments, "limit").map(|l|
l.min(MAX_LIST_LIMIT)).or(default_limit) (or equivalent), keeping the rest of
ResourceListParams fields the same; reference the function name list_params, the
types ResourceListParams and optional_u32, and the default_limit parameter when
making this change.
mcp-server/src/tools/router.rs-869-877 (1)

869-877: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Schema type validation is incomplete for JSON Schema primitive types.

Unknown type values currently pass validation (_ => true), so declared array/null constraints are silently ignored.

Proposed fix
         let matches = match expected_type {
             "boolean" => value.is_boolean(),
             "integer" => value.as_i64().is_some(),
             "number" => value.as_f64().is_some(),
             "object" => value.is_object(),
             "string" => value.is_string(),
-            _ => true,
+            "array" => value.is_array(),
+            "null" => value.is_null(),
+            _ => false,
         };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/router.rs` around lines 869 - 877, The type-check branch
around schema.get("type").and_then(Value::as_str) currently treats unknown types
as valid (the _ => true arm) and omits "array" and "null"; update the match for
expected_type in the same block to include "array" => value.is_array() and
"null" => value.is_null(), and change the fallback arm from _ => true to _ =>
false so unknown/unsupported type strings fail validation; keep the rest of the
arms ("boolean", "integer", "number", "object", "string") as-is and ensure you
update the variable referenced as matches to use the new logic.
mcp-server/src/tools/dynamic.rs-33-48 (1)

33-48: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

On error, keep existing definitions instead of clearing them.

When discover_definitions() fails (e.g., Kubernetes temporarily unavailable), the current code returns an empty Vec, which causes all dynamic tools to disappear from the tool list. This degrades user experience—during transient failures, users see tools vanish and may think they're permanently gone. The function should preserve the existing definitions when refresh fails.

🔧 Proposed fix to preserve definitions on error
 pub(crate) async fn refresh(&self) -> bool {
     let definitions = match discover_definitions().await {
         Ok(definitions) => definitions,
         Err(error) => {
             tracing::debug!(%error, "failed to refresh dynamic MCP tools");
-            Vec::new()
+            return false;
         }
     };

     let mut current = self.definitions.write().await;
     let changed = tool_signature(current.as_ref()) != tool_signature(&definitions);
     if changed {
         *current = Arc::new(definitions);
     }
     changed
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/dynamic.rs` around lines 33 - 48, The refresh function
currently replaces definitions with an empty Vec on discover_definitions()
error; instead, when discover_definitions() returns Err, log the error and
return false without modifying self.definitions so existing dynamic tools are
preserved. Update the logic in refresh: call discover_definitions().await, and
on Err(error) do tracing::debug!(%error, "failed to refresh dynamic MCP tools")
(or similar) and return false; only compute tool_signature and update *current =
Arc::new(definitions) when discover_definitions() returns Ok. Reference:
refresh, discover_definitions, self.definitions, and tool_signature.
🟡 Minor comments (7)
mcp-server/src/resources/uri.rs-29-31 (1)

29-31: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add validation to ensure consistency with parser.

The function creates URIs without validating that extension_id is non-empty and contains no / characters. The parser at line 23 rejects such URIs, creating an inconsistency where this function can produce URIs that parse() would reject.

🛡️ Proposed fix to add validation
 pub fn extension_catalog_entry_uri(extension_id: &str) -> String {
+    assert!(!extension_id.is_empty() && !extension_id.contains('/'),
+        "extension_id must be non-empty and cannot contain '/'");
     format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}")
 }

Alternatively, return Option<String> to handle invalid input gracefully:

-pub fn extension_catalog_entry_uri(extension_id: &str) -> String {
+pub fn extension_catalog_entry_uri(extension_id: &str) -> Option<String> {
+    if extension_id.is_empty() || extension_id.contains('/') {
+        return None;
+    }
-    format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}")
+    Some(format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}"))
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/resources/uri.rs` around lines 29 - 31, The
extension_catalog_entry_uri function can produce URIs that the parser rejects;
change its signature to return Option<String>, validate that extension_id is
non-empty and does not contain '/' (e.g., if extension_id.is_empty() ||
extension_id.contains('/') return None), and otherwise return
Some(format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}")); also update any
callers to handle the Option and keep behavior consistent with the parse()
function used elsewhere.
extensions/control-plane/scripts/post_install.sh-263-271 (1)

263-271: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Print the correct controller kind in the restart instructions.

Line 271 tells users to restart a Deployment, but this PR deploys supernode-mcp as a StatefulSet. The command shown here will fail as written.

Suggested fix
-Wire the Secret into the Helm-managed MCP Deployment with:
+Wire the Secret into the Helm-managed MCP StatefulSet with:
   helm upgrade ${RELEASE_NAME} . \\
     --namespace ${NAMESPACE} \\
     --reuse-values \\
     --set supernodeMcp.vault.tokenSecretRef.name=${MCP_VAULT_TOKEN_SECRET_NAME} \\
     --set supernodeMcp.vault.tokenSecretRef.key=${MCP_VAULT_TOKEN_SECRET_KEY}

-If the MCP Deployment is already running, restart it after the Helm upgrade:
-  kubectl ${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart deployment/supernode-mcp
+If the MCP StatefulSet is already running, restart it after the Helm upgrade:
+  kubectl ${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart statefulset/supernode-mcp
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/scripts/post_install.sh` around lines 263 - 271, The
restart instruction currently references a Deployment but the controller is a
StatefulSet; update the kubectl rollback/restart command shown in the
post-install instructions so it targets the correct kind by replacing "rollout
restart deployment/supernode-mcp" with "rollout restart
statefulset/supernode-mcp" (the line that contains the kubectl ... rollout
restart command for supernode-mcp in post_install.sh).
.github/workflows/test_mcp_server.yml-18-18 (1)

18-18: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename job from lint to test for clarity.

The job is named lint but actually runs tests via cargo test. This is misleading and inconsistent with the workflow name.

📝 Proposed fix
 jobs:
-  lint:
+  test:
     runs-on: ubuntu-latest
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test_mcp_server.yml at line 18, The job currently declared
as lint should be renamed to test to reflect that it runs cargo test: update the
job identifier and any job name fields from "lint" to "test" in the workflow
(the job block starting with "lint:"), and adjust any references or dependent
job names accordingly so the job id and display name match the actual command
(cargo test).
bootstrap/kind/README.md-112-112 (1)

112-112: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix workload type references from Deployment to StatefulSet.

The rollout status commands reference deployment/supernode-mcp (lines 112 and 126), but the control-plane Helm chart deploys supernode-mcp as a StatefulSet (defined in stage4-04-statefulset-supernode-mcp.yaml). These commands should use statefulset/supernode-mcp instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bootstrap/kind/README.md` at line 112, The README currently references
deployment/supernode-mcp but the chart actually creates a StatefulSet; update
both occurrences that mention deployment/supernode-mcp to
statefulset/supernode-mcp so commands like kubectl -n control-plane rollout
status target the correct resource (supernode-mcp) and ensure any other
rollout/status examples use statefulset/... instead of deployment/....
mcp-server/src/config.rs-72-77 (1)

72-77: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject non-positive session TTL values.

Line 76 currently accepts 0 and negative TTLs. Validate TTL as strictly positive during config parsing to avoid silent misconfiguration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/config.rs` around lines 72 - 77, The session_ttl_seconds()
parser currently accepts zero and negative values; after parsing the env var in
session_ttl_seconds(), validate that the parsed i64 is strictly > 0 and return
an Err(ConfigError::InvalidSessionTtl(value)) (or appropriate ConfigError
variant) when value <= 0 instead of returning Ok(Some(value)); implement this by
mapping the parsed result to check the positivity (e.g., replace the direct
parse mapping with a closure that parses then checks value > 0 and returns an
Err on invalid values) while keeping the existing transpose() flow.
mcp-server/src/tools/router.rs-81-82 (1)

81-82: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the invoked tool name when mapping vault write/patch errors.

vault.runtime.patch currently reports failures under "vault.runtime.write", which mislabels audit/error details and complicates debugging.

Proposed fix
-            "vault.runtime.write" => vault_runtime_write(arguments, WriteMode::Replace).await,
-            "vault.runtime.patch" => vault_runtime_write(arguments, WriteMode::Patch).await,
+            "vault.runtime.write" => {
+                vault_runtime_write("vault.runtime.write", arguments, WriteMode::Replace).await
+            }
+            "vault.runtime.patch" => {
+                vault_runtime_write("vault.runtime.patch", arguments, WriteMode::Patch).await
+            }

 async fn vault_runtime_write(
+    tool_name: &str,
     arguments: Option<&JsonObject>,
     default_mode: WriteMode,
 ) -> CallToolResult {
@@
-        Err(error) => return vault_error("vault.runtime.write", error),
+        Err(error) => return vault_error(tool_name, error),
@@
-        Err(error) => return vault_error("vault.runtime.write", error),
+        Err(error) => return vault_error(tool_name, error),
@@
-        Err(error) => vault_error("vault.runtime.write", error),
+        Err(error) => vault_error(tool_name, error),
     }
 }

Also applies to: 116-145

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/router.rs` around lines 81 - 82, The error/audit entries
for the vault write/patch branches are mislabelled because both match arms call
vault_runtime_write(arguments, WriteMode::...) but still record the tool as
"vault.runtime.write"; update the router match arms for "vault.runtime.write"
and "vault.runtime.patch" so the invoked tool name is passed into or set on the
call site (e.g., pass an extra parameter like tool_name or set a context) before
calling vault_runtime_write, or make vault_runtime_write accept and use the tool
name when composing errors/audit logs; ensure both branches
("vault.runtime.write" and "vault.runtime.patch") supply their correct string so
failures are mapped to the actual invoked tool name.
mcp-server/src/tools/workloads/mod.rs-24-25 (1)

24-25: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

workloads.list input schema is missing supported filter fields.

The handler accepts namespace, labelSelector, fieldSelector, and limit, but schema currently rejects/omits them. This creates avoidable client-contract drift.

Proposed fix
-            input_schema: r#"{"type":"object","properties":{"includeControlPlane":{"type":"boolean"}},"additionalProperties":false}"#,
+            input_schema: r#"{"type":"object","properties":{"namespace":{"type":"string"},"includeControlPlane":{"type":"boolean"},"labelSelector":{"type":"string"},"fieldSelector":{"type":"string"},"limit":{"type":"integer","minimum":1,"maximum":200}},"additionalProperties":false}"#,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/workloads/mod.rs` around lines 24 - 25, The input_schema
for the workloads.list handler currently omits supported filter fields, causing
client/schema drift; update the input_schema JSON (the value assigned to
input_schema in the workloads.list handler) to include properties for
"namespace" (string), "labelSelector" (string), "fieldSelector" (string) and
"limit" (integer), keep additionalProperties as false and ensure the root type
remains "object" so the handler's accepted params (namespace, labelSelector,
fieldSelector, limit) validate against the schema.
🧹 Nitpick comments (8)
mcp-server/src/catalog/mod.rs (1)

42-44: ⚡ Quick win

Add is_empty() method to complement len().

Rust convention recommends providing an is_empty() method whenever len() is exposed, which avoids the clippy::len_without_is_empty lint and improves API ergonomics.

✨ Proposed addition
     pub fn len(&self) -> usize {
         self.extensions.len()
     }
+
+    pub fn is_empty(&self) -> bool {
+        self.extensions.is_empty()
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/catalog/mod.rs` around lines 42 - 44, Add a complementary
is_empty() method to the same impl that exposes len(): implement pub fn
is_empty(&self) -> bool that forwards to self.extensions.is_empty() (or returns
self.extensions.len() == 0) so callers can check emptiness and to satisfy the
clippy::len_without_is_empty lint; add it alongside the existing pub fn
len(&self) -> usize.
mcp-server/Dockerfile (1)

3-7: ⚡ Quick win

Optimize Docker layer caching by separating dependency and source builds.

The current approach copies all source files before building, which invalidates the cache on every source change. Building dependencies in a separate layer first would significantly speed up rebuilds.

⚡ Proposed optimization
 WORKDIR /app
 COPY Cargo.toml Cargo.lock ./
-COPY src ./src

+# Build dependencies first (creates a cache layer)
+RUN mkdir src && echo "fn main() {}" > src/main.rs \
+    && cargo build --release \
+    && rm -rf src
+
+# Now copy real source and build
+COPY src ./src
 RUN cargo build --release

This technique creates a dummy main.rs, builds dependencies, then replaces with real source. Dependencies are cached until Cargo.toml or Cargo.lock change.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/Dockerfile` around lines 3 - 7, The Dockerfile copies the entire
src before building which busts the layer cache on any source change; to fix,
implement a multi-step build: first COPY Cargo.toml and Cargo.lock and add a
minimal/dummy src/main.rs (or touch a placeholder) then RUN cargo build
--release to cache dependency compilation, then COPY the real src/ (replacing
the dummy) and RUN cargo build --release again to produce the final artifact;
update the Dockerfile steps that reference COPY Cargo.toml Cargo.lock ./, COPY
src ./src and RUN cargo build --release to follow this dependency-first pattern
so rebuilding only invalidates when Cargo.toml or Cargo.lock change.
.github/workflows/build_mcp_server.yml (1)

94-95: ⚡ Quick win

Consider removing the unmaintained Docker layer caching action.

The satackey/action-docker-layer-caching action is deprecated and unmaintained. Docker Buildx (already in use at line 78) provides built-in caching via BuildKit that is more reliable and actively maintained.

♻️ Proposed fix
       - uses: satackey/action-docker-layer-caching@v0.0.11
-        continue-on-error: true

Remove these lines entirely. The docker/build-push-action@v5 already uses BuildKit caching.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build_mcp_server.yml around lines 94 - 95, Remove the
deprecated satackey/action-docker-layer-caching step: delete the action entry
"uses: satackey/action-docker-layer-caching@v0.0.11" and its "continue-on-error:
true" configuration from the workflow; rely on the existing
docker/build-push-action@v5 / Docker Buildx caching already configured (the
buildx step referenced in the workflow) instead of adding this unmaintained
action.
.github/workflows/test_mcp_server.yml (1)

23-28: ⚡ Quick win

Consider adding Rust caching to speed up CI.

The build workflow uses Swatinem/rust-cache@v2 to cache dependencies. Adding the same caching here would speed up test runs.

⚡ Proposed addition
     steps:
       - name: Checkout
         uses: actions/checkout@v4
+
+      - uses: Swatinem/rust-cache@v2
+        with:
+          shared-key: "test"
+          workspaces: mcp-server -> target

       - name: Test
         run: cargo test
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test_mcp_server.yml around lines 23 - 28, Add a Rust
dependency cache step before the existing "Test" step to speed up CI: insert a
step (e.g., name: "Setup Rust cache") that uses Swatinem/rust-cache@v2 and
configures caching for cargo (registry, index, and target/build artifacts) keyed
on Cargo.lock (or equivalent) so subsequent "Test" (cargo test) runs reuse
dependencies/build outputs; reference the new step name "Setup Rust cache" and
the existing "Test" step when placing it in the workflow.
.github/workflows/check_mcp_server.yml (1)

23-28: ⚡ Quick win

Consider adding Rust caching to speed up CI.

The build workflow uses Swatinem/rust-cache@v2 to cache dependencies. Adding the same caching here would speed up the clippy checks.

⚡ Proposed addition
     steps:
       - name: Checkout
         uses: actions/checkout@v4
+
+      - uses: Swatinem/rust-cache@v2
+        with:
+          shared-key: "lint"
+          workspaces: mcp-server -> target

       - name: Clippy check lints
         run: cargo clippy -- -D warnings
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/check_mcp_server.yml around lines 23 - 28, The workflow
currently runs the "Clippy check lints" step without restoring a Rust/Cargo
cache, which slows CI; add a step using Swatinem/rust-cache@v2 before the clippy
run to restore and save Cargo cache (registry, git index, and target) so
subsequent runs are faster — insert a new step named e.g. "Restore Rust cache"
that uses Swatinem/rust-cache@v2 with appropriate inputs (rust-version or
default) and ensure the cache step runs before the existing "Clippy check lints"
step and will also save the cache after the job completes.
mcp-server/src/audit/event.rs (1)

60-60: ⚡ Quick win

Consider whether secret_values_included should be parameterized.

The field is currently hardcoded to false in the factory method. If there are scenarios where secret values should be included in audit events (e.g., in non-production environments or with special audit flags), you may want to add a parameter to from_policy_outcome rather than hardcoding this value.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/audit/event.rs` at line 60, The factory currently hardcodes
secret_values_included: false in from_policy_outcome; make it configurable by
adding a parameter (e.g., secret_values_included: bool) to the
from_policy_outcome signature, use that parameter when constructing the Event
(replace the hardcoded false), and update all call sites to pass the appropriate
flag (or provide an overloaded/optional default wrapper that supplies false for
existing callers); ensure the field and any docs/tests reflect the new
parameter.
mcp-server/src/session/sqlite.rs (1)

77-94: ⚖️ Poor tradeoff

Expired session cleanup runs on every load.

The delete_expired call on line 80 runs for every load() invocation. If expired sessions accumulate, this could add latency to session lookups. Consider whether periodic background cleanup (via a separate task or cron job) would be more efficient for production workloads with high session churn.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/session/sqlite.rs` around lines 77 - 94, Expired-session
cleanup is being executed on every load (delete_expired inside load), which can
add latency; remove the delete_expired call from async fn load(&self, ...) and
instead perform cleanup from a dedicated background mechanism that invokes
delete_expired via the existing with_connection(...) helper: either spawn a
periodic async task (e.g., tokio::spawn) at store initialization to call
delete_expired on an interval, or implement a lightweight probabilistic/interval
guard inside the session store (track last_cleanup timestamp in the store struct
and only run delete_expired when enough time has passed) so load() only performs
the SELECT logic and returns state_json without running delete_expired each
call.
mcp-server/src/tools/workloads/dolos.rs (1)

150-155: ⚡ Quick win

Consider logging scale-back failures during error recovery.

When wait_for_dolos_pods_absent times out or PVC deletion fails, the code attempts to restore the original replica count but silently ignores errors with let _ = .... If this recovery fails, the StatefulSet remains scaled to zero, and there's no indication in the logs. While returning the original error to the user is correct, logging recovery failures would aid debugging.

📊 Add recovery failure logging
 if let Err(error) = wait_for_dolos_pods_absent(&client, &namespace, &release_name).await {
-    let _ = client
+    if let Err(recovery_error) = client
         .scale_stateful_set(&namespace, &stateful_set_name, original_replicas)
-        .await;
+        .await
+    {
+        tracing::warn!(
+            %recovery_error,
+            namespace,
+            stateful_set = stateful_set_name,
+            "failed to restore StatefulSet replicas after timeout"
+        );
+    }
     return error;
 }

Also applies to: 161-165

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/workloads/dolos.rs` around lines 150 - 155, The recovery
path that calls client.scale_stateful_set(&namespace, &stateful_set_name,
original_replicas).await currently ignores errors; change it to capture the
result and, on Err(e), log a clear recovery-failure message including the
namespace, stateful_set_name and original_replicas along with the recovery error
before returning the original error from wait_for_dolos_pods_absent; apply the
same change to the other identical block (lines 161-165) so scale-back failures
are recorded for debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5f3be41-6308-4a52-b115-be6598276fb5

📥 Commits

Reviewing files that changed from the base of the PR and between 637efa2 and da7019c.

⛔ Files ignored due to path filters (1)
  • mcp-server/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (70)
  • .github/image/Dockerfile
  • .github/workflows/build_mcp_server.yml
  • .github/workflows/check_mcp_server.yml
  • .github/workflows/test_mcp_server.yml
  • .gitignore
  • bootstrap/kind/README.md
  • extensions/cardano-node/templates/configmap-metrics.yaml
  • extensions/control-plane/README.md
  • extensions/control-plane/scripts/post_install.sh
  • extensions/control-plane/templates/stage4-00-pvc-supernode-mcp.yaml
  • extensions/control-plane/templates/stage4-01-serviceaccount-supernode-mcp.yaml
  • extensions/control-plane/templates/stage4-02-clusterrole-supernode-mcp.yaml
  • extensions/control-plane/templates/stage4-03-clusterrolebinding-supernode-mcp.yaml
  • extensions/control-plane/templates/stage4-04-statefulset-supernode-mcp.yaml
  • extensions/control-plane/templates/stage4-05-service-supernode-mcp.yaml
  • extensions/control-plane/values.yaml
  • mcp-server/.dockerignore
  • mcp-server/Cargo.toml
  • mcp-server/Dockerfile
  • mcp-server/src/audit/event.rs
  • mcp-server/src/audit/mod.rs
  • mcp-server/src/audit/sink.rs
  • mcp-server/src/auth/mod.rs
  • mcp-server/src/catalog/cardano_node_relay.rs
  • mcp-server/src/catalog/dolos.rs
  • mcp-server/src/catalog/extension.rs
  • mcp-server/src/catalog/mod.rs
  • mcp-server/src/catalog/schema.rs
  • mcp-server/src/config.rs
  • mcp-server/src/errors.rs
  • mcp-server/src/helm.rs
  • mcp-server/src/k8s/client.rs
  • mcp-server/src/k8s/helm_releases.rs
  • mcp-server/src/k8s/mod.rs
  • mcp-server/src/main.rs
  • mcp-server/src/mcp.rs
  • mcp-server/src/policy/approvals.rs
  • mcp-server/src/policy/mod.rs
  • mcp-server/src/policy/roles.rs
  • mcp-server/src/policy/scopes.rs
  • mcp-server/src/prompts/catalog.rs
  • mcp-server/src/prompts/mod.rs
  • mcp-server/src/resources/mod.rs
  • mcp-server/src/resources/router.rs
  • mcp-server/src/resources/uri.rs
  • mcp-server/src/server.rs
  • mcp-server/src/session/mod.rs
  • mcp-server/src/session/sqlite.rs
  • mcp-server/src/tools/common.rs
  • mcp-server/src/tools/dynamic.rs
  • mcp-server/src/tools/k8s_summaries.rs
  • mcp-server/src/tools/mod.rs
  • mcp-server/src/tools/router.rs
  • mcp-server/src/tools/supernode.rs
  • mcp-server/src/tools/vault.rs
  • mcp-server/src/tools/workloads/dolos.rs
  • mcp-server/src/tools/workloads/install.rs
  • mcp-server/src/tools/workloads/logs.rs
  • mcp-server/src/tools/workloads/metrics.rs
  • mcp-server/src/tools/workloads/mod.rs
  • mcp-server/src/tools/workloads/outputs.rs
  • mcp-server/src/tools/workloads/registry.rs
  • mcp-server/src/vault/client.rs
  • mcp-server/src/vault/mod.rs
  • mcp-server/src/vault/paths.rs
  • mcp-server/src/vault/redaction.rs
  • skills/README.md
  • skills/cardano-relay-setup.md
  • skills/dolos-supernode-deployment.md
  • skills/workload-output-port-forward.md

Comment on lines +18 to +27
include:
- release_for: Linux-x86_64
build_on: ubuntu-22.04
target: x86_64-unknown-linux-gnu
args: "--locked --release"

- release_for: Linux-arm64
build_on: ubuntu-22.04-arm
target: "aarch64-unknown-linux-gnu"
args: "--locked --release"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Add missing ext property to matrix entries.

Lines 50-51 and 57 reference matrix.ext, but the matrix entries do not define this property, causing a workflow syntax error.

🐛 Proposed fix
         include:
           - release_for: Linux-x86_64
             build_on: ubuntu-22.04
             target: x86_64-unknown-linux-gnu
             args: "--locked --release"
+            ext: ""

           - release_for: Linux-arm64
             build_on: ubuntu-22.04-arm
             target: "aarch64-unknown-linux-gnu"
             args: "--locked --release"
+            ext: ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
include:
- release_for: Linux-x86_64
build_on: ubuntu-22.04
target: x86_64-unknown-linux-gnu
args: "--locked --release"
- release_for: Linux-arm64
build_on: ubuntu-22.04-arm
target: "aarch64-unknown-linux-gnu"
args: "--locked --release"
include:
- release_for: Linux-x86_64
build_on: ubuntu-22.04
target: x86_64-unknown-linux-gnu
args: "--locked --release"
ext: ""
- release_for: Linux-arm64
build_on: ubuntu-22.04-arm
target: "aarch64-unknown-linux-gnu"
args: "--locked --release"
ext: ""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build_mcp_server.yml around lines 18 - 27, The matrix
entries under include (the objects with keys release_for, build_on, target,
args) are missing the ext property referenced elsewhere as matrix.ext; add an
ext field to each included matrix entry (e.g., add ext: "<appropriate
extension>" to the Linux-x86_64 and Linux-arm64 include objects) so matrix.ext
resolves correctly where used on lines referencing matrix.ext.

Comment on lines +63 to +68
strategy:
matrix:
include:
- tags: ghcr.io/txpipe/metis-supernode-mcp,ghcr.io/txpipe/metis-supernode-mcp:${{ github.sha }}
binary: supernode-mcp

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Move dynamic tag generation out of the matrix definition.

GitHub Actions does not interpolate expressions like ${{ github.sha }} inside matrix definitions. The SHA tag will be literal text, not the actual commit SHA.

🔧 Proposed fix

Move tag generation to a step that sets an output or environment variable, then reference it in the build step:

     strategy:
       matrix:
         include:
-          - tags: ghcr.io/txpipe/metis-supernode-mcp,ghcr.io/txpipe/metis-supernode-mcp:${{ github.sha }}
-            binary: supernode-mcp
+          - binary: supernode-mcp

     permissions:
       contents: read
       packages: write

     steps:
       - name: Checkout repository
         uses: actions/checkout@v4
+
+      - name: Prepare tags
+        id: tags
+        run: |
+          echo "tags=ghcr.io/txpipe/metis-supernode-mcp,ghcr.io/txpipe/metis-supernode-mcp:${{ github.sha }}" >> $GITHUB_OUTPUT

       - name: Set up Docker Buildx
         uses: docker/setup-buildx-action@v3

Then update line 107:

-          tags: ${{ matrix.tags }}
+          tags: ${{ steps.tags.outputs.tags }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build_mcp_server.yml around lines 63 - 68, The matrix
currently hardcodes tags including the expression `${{ github.sha }}` which
won’t be interpolated; move dynamic tag generation out of the matrix by
replacing the matrix "tags" entry with a static placeholder (or remove the
SHA-specific tag) and add a prior step that computes the tags string using `${{
github.sha }}` and exposes it as an output or env var (e.g., set-output/tag or
echo to GITHUB_ENV). Then update the build/publish step that uses the matrix
`tags` (the step that references `tags`/`binary`) to instead consume the
computed output/env variable so the actual commit SHA is included in the runtime
tag list.

Comment thread mcp-server/src/server.rs
Comment on lines +52 to +55
let auth_context = match auth_mode {
AuthMode::Trusted => AuthContext::trusted(),
AuthMode::OAuth => anyhow::bail!("MCP_AUTH_MODE=oauth is not implemented yet"),
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fail closed when secure auth is unavailable.

Line 54 rejects OAuth, so the server effectively runs only in trusted mode. Combined with non-loopback binds, this creates a privileged unauthenticated control surface. Add a startup guard that forbids trusted mode on non-loopback addresses (or require explicit insecure opt-in).

Suggested mitigation
 fn router(config: Config, cancellation_token: CancellationToken) -> anyhow::Result<Router> {
     let auth_mode = config.auth_mode;
+    if matches!(auth_mode, AuthMode::Trusted) && !config.bind_addr.ip().is_loopback() {
+        anyhow::bail!(
+            "trusted auth mode is only allowed on loopback interfaces; configure secure auth before exposing the service"
+        );
+    }
     let auth_context = match auth_mode {
         AuthMode::Trusted => AuthContext::trusted(),
         AuthMode::OAuth => anyhow::bail!("MCP_AUTH_MODE=oauth is not implemented yet"),
     };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/server.rs` around lines 52 - 55, The code allows
AuthMode::Trusted unconditionally, which creates an unauthenticated control
surface when the server is bound to non-loopback addresses; update startup to
detect when auth_mode == AuthMode::Trusted and the server bind/listen address
(e.g., bind_addr or listen_addr used when starting the server) is not a loopback
IP, and then fail-fast with an error (or require an explicit insecure opt-in
flag/env var such as INSECURE_ALLOW_TRUSTED) instead of proceeding; ensure this
guard lives alongside the existing auth_mode match where AuthContext::trusted()
is created so the process exits on unsafe configuration.

Comment on lines +66 to +70
pub async fn runtime_metadata(
&self,
path: &VaultPath,
) -> Result<VaultSecretMetadata, VaultError> {
let metadata_response = self.get_metadata(path).await?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Enforce Runtime path kind in runtime-scoped Vault APIs.

These methods currently accept operator paths too, which breaks the runtime/operator boundary and can allow unintended secret domain access.

Proposed fix
 use crate::vault::paths::VaultPath;
+use crate::vault::paths::VaultPathKind;
@@
     pub async fn runtime_metadata(
         &self,
         path: &VaultPath,
     ) -> Result<VaultSecretMetadata, VaultError> {
+        ensure_runtime_path(path)?;
         let metadata_response = self.get_metadata(path).await?;
         let secret_response = self.get_secret(path).await?;
@@
     pub async fn write_runtime_secret(
         &self,
         path: &VaultPath,
         secret: &SecretObject,
         mode: WriteMode,
     ) -> Result<VaultWriteReceipt, VaultError> {
+        ensure_runtime_path(path)?;
         let data = match mode {
@@
 }
+
+fn ensure_runtime_path(path: &VaultPath) -> Result<(), VaultError> {
+    if path.kind() != VaultPathKind::Runtime {
+        return Err(VaultError::Path(crate::vault::paths::VaultPathError::NotAllowed));
+    }
+    Ok(())
+}

Also applies to: 87-92

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/vault/client.rs` around lines 66 - 70, The runtime-scoped
Vault APIs (e.g., runtime_metadata) currently accept any VaultPath; update these
functions to validate that the provided VaultPath.kind == VaultPathKind::Runtime
(or equivalent enum) and return a VaultError (or Err) if not Runtime to prevent
operator paths from being used; add the same validation at the start of other
runtime methods referenced (e.g., runtime_get, runtime_put around lines 87–92)
so each function checks the path kind before calling
get_metadata/get_secret/put_secret and aborts with a clear error when the path
is not Runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant