feat: Administrate supernode via MCP#29
Conversation
📝 WalkthroughWalkthroughAdds a new Rust MCP server with HTTP endpoints, Kubernetes/Helm/Vault integrations, session storage, and an extensive tool suite. Wires CI to lint/test/build/publish artifacts. Extends the control-plane Helm chart to deploy the MCP server and updates documentation and scripts accordingly. ChangesMCP server crate, runtime, and CI
Control-plane Helm integration, scripts, and docs
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 4
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟠 Major comments (20)
extensions/control-plane/templates/stage4-02-clusterrole-supernode-mcp.yaml-10-36 (1)
10-36:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftCluster-wide RBAC is over-privileged for a “readonly” role.
This role currently allows mutating cluster resources (including full secret CRUD) across all namespaces. That’s a high-risk permission set for the default deployment path and doesn’t match the role intent.
Please split permissions into:
- a namespace-scoped
Role/RoleBindingfor mutable workload operations in the target namespace, and- a minimal
ClusterRoleonly for truly cluster-scoped reads (if required).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/control-plane/templates/stage4-02-clusterrole-supernode-mcp.yaml` around lines 10 - 36, The ClusterRole is over-privileged: remove all mutating verbs (create, update, patch, delete) and secret CRUD from the cluster-scoped role and instead create a namespace-scoped Role + RoleBinding for mutable workload ops. Concretely, change the ClusterRole entries that currently list resources like "serviceaccounts","services","persistentvolumeclaims","configmaps","secrets","deployments","statefulsets","podmonitors" and verbs ["create","get","list","watch","update","patch","delete"] to a minimal ClusterRole that only has get/list/watch for truly cluster-scoped reads (e.g., "namespaces","storageclasses","pods" as needed) and ensure "pods/log" and "pods/exec" remain namespace-scoped; then add a Role (target namespace) that grants create/get/list/watch/update/patch/delete for "serviceaccounts","services","persistentvolumeclaims","configmaps","secrets","deployments","statefulsets","podmonitors" and bind it to the same subject via a RoleBinding. Use the resource names and verbs in the diff to locate the rules to modify and add corresponding Role/RoleBinding objects.extensions/control-plane/values.yaml-166-174 (1)
166-174:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHarden defaults: avoid enabling MCP with
auth.mode: trustedby default.Shipping enabled + trusted auth as defaults exposes an unauthenticated admin surface to any in-cluster caller unless users override values.
Consider defaulting to one of:
enabled: false, or- a non-trusted auth mode with explicit credentials configuration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/control-plane/values.yaml` around lines 166 - 174, Change the insecure default values in the Helm values file: do not ship enabled: true with auth.mode: trusted. Update the defaults (the keys referenced: enabled, auth.mode, and related image/replicaCount) so that either enabled: false is the default OR auth.mode is a non-trusted mode that requires explicit credentials (and add placeholder credential configuration fields) so the MCP does not expose an unauthenticated admin surface out-of-the-box; adjust values.yaml accordingly and document the required credentials for any non-disabled default.extensions/cardano-node/templates/configmap-metrics.yaml-459-459 (1)
459-459:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not hardcode
roleto relay in emitted metrics.The script computes
node_rolefromMETIS_NODE_ROLE, but the output now always reports"relay", which mislabels block-producer nodes.Suggested fix
- jq -n \ + jq -n \ + --arg nodeRole "$node_role" \ @@ - role: "relay", + role: (nullable_string($nodeRole) // "relay"),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/cardano-node/templates/configmap-metrics.yaml` at line 459, The emitted metrics hardcode role: "relay" instead of using the computed node role; update the template in configmap-metrics.yaml so the role field uses the computed variable (node_role / METIS_NODE_ROLE) rather than the literal "relay". Locate the metrics emission block that sets role: "relay" and replace that hardcoded string with the template variable that holds the result of METIS_NODE_ROLE (e.g., use the node_role/template variable used earlier) so block-producer nodes are labeled correctly.extensions/control-plane/templates/stage4-00-pvc-supernode-mcp.yaml-1-24 (1)
1-24:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove the standalone PVC; it’s unused with the current StatefulSet flow.
When
persistence.enabled=trueandexistingClaimis empty, the StatefulSet already provisionssession-storeviavolumeClaimTemplates. This template creates a second PVC (supernode-mcp) that is never mounted and can still consume storage.Suggested fix
-{{- if and .Values.supernodeMcp.enabled .Values.supernodeMcp.persistence.enabled (not .Values.supernodeMcp.persistence.existingClaim) }} -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: supernode-mcp - namespace: '{{ .Values.namespace }}' - labels: - app.kubernetes.io/name: supernode-mcp - app.kubernetes.io/component: mcp -{{- with .Values.supernodeMcp.persistence.annotations }} - annotations: -{{ toYaml . | indent 4 }} -{{- end }} -spec: - accessModes: -{{ toYaml .Values.supernodeMcp.persistence.accessModes | indent 4 }} -{{- $storageClass := default .Values.storageClass .Values.supernodeMcp.persistence.storageClass }} -{{- if $storageClass }} - storageClassName: {{ $storageClass }} -{{- end }} - resources: - requests: - storage: {{ .Values.supernodeMcp.persistence.size }} -{{- end }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/control-plane/templates/stage4-00-pvc-supernode-mcp.yaml` around lines 1 - 24, This template creates an unused PersistentVolumeClaim named "supernode-mcp" which duplicates storage when the StatefulSet already provisions "session-store" via volumeClaimTemplates; remove this PersistentVolumeClaim resource (the apiVersion: v1 kind: PersistentVolumeClaim block that references .Values.supernodeMcp.persistence and creates name: supernode-mcp) so PVCs are only created by the StatefulSet's volumeClaimTemplates, or alternatively gate rendering behind a new flag (e.g., .Values.supernodeMcp.useStandalonePVC) and update the top conditional to require that flag before emitting the supernode-mcp PVC..github/image/Dockerfile-5-5 (1)
5-5:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid piping mutable remote Helm installer scripts into bash; pin and verify Helm artifact instead.
Line 5 introduces a supply-chain and reproducibility gap. The Helm installer script cannot verify checksums (requires openssl, which is not installed on line 4), and no version is pinned—the script always fetches from the
mainbranch. This creates a non-deterministic, unverified artifact in your build.Use the suggested diff approach: download and verify a pinned Helm tarball explicitly, then extract it. This ensures reproducibility and eliminates reliance on a mutable remote script.
Suggested hardening diff
RUN apt-get update \ && apt-get install -y --no-install-recommends ca-certificates curl \ - && curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash \ + && HELM_VERSION="v3.19.0" \ + && curl -fsSLo /tmp/helm.tgz "https://get.helm.sh/helm-${HELM_VERSION}-linux-${TARGETARCH}.tar.gz" \ + && curl -fsSLo /tmp/helm.tgz.sha256 "https://get.helm.sh/helm-${HELM_VERSION}-linux-${TARGETARCH}.tar.gz.sha256" \ + && echo "$(cat /tmp/helm.tgz.sha256) /tmp/helm.tgz" | sha256sum -c - \ + && tar -xzf /tmp/helm.tgz -C /tmp \ + && mv "/tmp/linux-${TARGETARCH}/helm" /usr/local/bin/helm \ + && chmod +x /usr/local/bin/helm \ + && rm -rf /tmp/helm.tgz /tmp/helm.tgz.sha256 /tmp/linux-${TARGETARCH} \ && apt-get purge -y --auto-remove curl \ && rm -rf /var/lib/apt/lists/*🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/image/Dockerfile at line 5, Replace the unsafe piped installer command ("curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash") with a pinned, verifiable Helm release flow: choose a specific Helm version (e.g., vX.Y.Z), download the corresponding release tarball from GitHub Releases, fetch the official sha256 checksum for that exact version, verify the downloaded tarball checksum with a checksum tool (openssl/sha256sum), extract the helm binary and install it to the target location, and remove the downloaded artifacts; ensure any required tooling for verification (openssl or coreutils) is installed in the image before verification to make the process reproducible and supply-chain safe.extensions/control-plane/scripts/post_install.sh-34-41 (1)
34-41:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake MCP Vault provisioning opt-in.
Line 34 defaults
MCP_VAULT_ENABLEDtotrue, so a normal post-install now mints a long-lived orphan token with write access to the runtime KV even whensupernode-mcpis not enabled. That silently expands the credential footprint for existing clusters.Suggested fix
-MCP_VAULT_ENABLED="${MCP_VAULT_ENABLED:-true}" +MCP_VAULT_ENABLED="${MCP_VAULT_ENABLED:-false}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/control-plane/scripts/post_install.sh` around lines 34 - 41, The script currently defaults MCP_VAULT_ENABLED to true causing automatic Vault token provisioning; change the default to opt-in by setting MCP_VAULT_ENABLED="${MCP_VAULT_ENABLED:-false}" and ensure any conditional checks that rely on MCP_VAULT_ENABLED (the Vault provisioning branch that uses MCP_VAULT_POLICY_NAME, MCP_VAULT_TOKEN_SECRET_NAME, MCP_VAULT_TOKEN_SECRET_KEY, MCP_VAULT_TOKEN_DISPLAY_NAME, MCP_VAULT_TOKEN_TTL, MCP_VAULT_TOKEN_PERIOD, MCP_VAULT_TOKEN_ORPHAN) respect the new default so provisioning only runs when MCP_VAULT_ENABLED is explicitly enabled.extensions/control-plane/scripts/post_install.sh-212-237 (1)
212-237:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRevoke the MCP token if Secret creation fails.
Lines 229-237 create the Vault token before the
kubectl applypipeline. If that pipeline fails, the script exits and cleanup only removes the temp file, leaving a live orphan token behind.Suggested fix
+MCP_VAULT_SECRET_APPLIED="false" + cleanup() { if [[ -n "${PORT_FORWARD_PID}" ]] && kill -0 "${PORT_FORWARD_PID}" >/dev/null 2>&1; then kill "${PORT_FORWARD_PID}" >/dev/null 2>&1 || true wait "${PORT_FORWARD_PID}" >/dev/null 2>&1 || true fi rm -f "${PORT_FORWARD_LOG}" if [[ -n "${MCP_VAULT_TOKEN_FILE}" ]]; then + if [[ "${MCP_VAULT_SECRET_APPLIED}" != "true" ]]; then + vault token revoke "$(cat "${MCP_VAULT_TOKEN_FILE}")" >/dev/null 2>&1 || true + fi rm -f "${MCP_VAULT_TOKEN_FILE}" fi } ... vault token create "${token_args[@]}" >"$MCP_VAULT_TOKEN_FILE" kubectl "${KUBECTL_ARGS[@]}" -n "$NAMESPACE" create secret generic "$MCP_VAULT_TOKEN_SECRET_NAME" \ "--from-file=${MCP_VAULT_TOKEN_SECRET_KEY}=${MCP_VAULT_TOKEN_FILE}" \ --dry-run=client \ -o yaml \ | kubectl "${KUBECTL_ARGS[@]}" -n "$NAMESPACE" apply -f - + MCP_VAULT_SECRET_APPLIED="true" fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/control-plane/scripts/post_install.sh` around lines 212 - 237, The script currently creates the Vault token via the vault token create call and writes it to MCP_VAULT_TOKEN_FILE but doesn't revoke it if the kubectl apply pipeline fails; change the flow so after running kubectl ... apply -f - you check the pipeline exit status and if it failed, read the token from MCP_VAULT_TOKEN_FILE and revoke it with the vault token revoke command (using the token string read from MCP_VAULT_TOKEN_FILE), then remove the temp file; alternatively add a local failure handler/trap that revokes the token on error using the same MCP_VAULT_TOKEN_FILE token contents so no live orphan token remains when kubectl apply fails..github/workflows/test_mcp_server.yml-25-25 (1)
25-25:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate to
actions/checkout@v4for runner compatibility.The
actions/checkout@v3action runner is outdated. As per static analysis, update to v4 for compatibility with current GitHub Actions runners.🔄 Proposed fix
- name: Checkout - uses: actions/checkout@v3 + uses: actions/checkout@v4🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/test_mcp_server.yml at line 25, Replace the outdated checkout action reference "uses: actions/checkout@v3" with the newer "actions/checkout@v4" in the workflow so the runner uses the v4-compatible checkout action; locate the line containing the "uses: actions/checkout@v3" entry and update it to reference v4..github/workflows/check_mcp_server.yml-25-25 (1)
25-25:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate to
actions/checkout@v4for runner compatibility.The
actions/checkout@v3action runner is outdated. As per static analysis, update to v4 for compatibility with current GitHub Actions runners.🔄 Proposed fix
- name: Checkout - uses: actions/checkout@v3 + uses: actions/checkout@v4🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/check_mcp_server.yml at line 25, Update the GitHub Actions workflow step that currently uses actions/checkout@v3 to actions/checkout@v4: locate the workflow step with the uses entry "actions/checkout@v3" and change it to "actions/checkout@v4" to ensure runner compatibility with current GitHub Actions environments.mcp-server/Dockerfile-11-15 (1)
11-15:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPin Helm version instead of using the latest script.
Downloading and executing the Helm install script via
curl | bashwithout version pinning is a security risk. If the script or Helm upstream is compromised, or if breaking changes are introduced, builds could fail or include malicious code.🔒 Proposed fix with pinned version
RUN apt-get update \ - && apt-get install -y --no-install-recommends ca-certificates curl \ - && curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash \ - && apt-get purge -y --auto-remove curl \ + && apt-get install -y --no-install-recommends ca-certificates curl wget \ + && wget https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz \ + && tar -zxvf helm-v3.14.0-linux-amd64.tar.gz \ + && mv linux-amd64/helm /usr/local/bin/helm \ + && rm -rf helm-v3.14.0-linux-amd64.tar.gz linux-amd64 \ + && apt-get purge -y --auto-remove curl wget \ && rm -rf /var/lib/apt/lists/*Note: Update the version number to match your requirements. Verify the latest stable release.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/Dockerfile` around lines 11 - 15, Replace the unpinned "curl | bash" Helm install in the RUN line with a pinned-version installation: add a HELM_VERSION build-arg or ENV, download the specific release tarball for that version (from github.com/helm/helm/releases/download/v${HELM_VERSION}/helm-v${HELM_VERSION}-linux-amd64.tar.gz), verify it against a known SHA256 checksum, extract the binary and move it to /usr/local/bin, and only then remove the tarball and cleanup apt lists; do not pipe the upstream install script to bash and ensure checksum verification to prevent supply-chain risks (update the existing RUN command that currently invokes the Helm install script)..github/workflows/build_mcp_server.yml-14-14 (1)
14-14:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove
continue-on-errorfrom the build job.Allowing the build job to fail silently defeats the purpose of CI and can mask critical build failures. If the build fails, the workflow should fail.
🔧 Proposed fix
build: - continue-on-error: true strategy:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/build_mcp_server.yml at line 14, The CI build job currently sets the YAML key continue-on-error: true which allows failures to be ignored; remove that key (or set it to false) from the build job definition so the workflow fails on build errors—look for the build job block that contains continue-on-error and delete or change that setting to ensure the job's failures surface..github/workflows/build_mcp_server.yml-41-43 (1)
41-43:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate to a supported Rust toolchain action.
The
actions-rs/toolchain@v1action is deprecated and its runner is too old for current GitHub Actions. As per static analysis, update to a maintained alternative.🔄 Proposed fix using dtolnay/rust-toolchain
- name: Install stable toolchain - uses: actions-rs/toolchain@v1 + uses: dtolnay/rust-toolchain@stable with: toolchain: stable + targets: ${{ matrix.target }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/build_mcp_server.yml around lines 41 - 43, Replace the deprecated actions-rs/toolchain@v1 usage with a maintained Rust toolchain action (e.g., dtolnay/rust-toolchain) in the workflow: update the uses entry that currently references actions-rs/toolchain@v1 and keep the intended channel/toolchain configuration (toolchain: stable) in the new action's inputs, adjusting the input key names if the alternative action expects different field names; then run the workflow to ensure the runner and toolchain install step succeed.mcp-server/src/k8s/helm_releases.rs-61-67 (1)
61-67:⚠️ Potential issue | 🟠 Major | ⚡ Quick winA single malformed Helm secret should not fail the entire discovery response.
collect::<Result<Vec<_>, _>>()?makes discovery brittle: one bad secret blocks all valid releases.Proposed fix (best-effort decode)
- Ok(latest_releases( - secrets - .items - .iter() - .map(decode_helm_release_secret) - .collect::<Result<Vec<_>, _>>()?, - include_control_plane, - )) + let decoded = secrets + .items + .iter() + .filter_map(|secret| decode_helm_release_secret(secret).ok()) + .collect::<Vec<_>>(); + Ok(latest_releases(decoded, include_control_plane))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/k8s/helm_releases.rs` around lines 61 - 67, The current use of collect::<Result<Vec<_>, _>>()? over secrets.items with decode_helm_release_secret makes the whole discovery fail on one bad secret; change the flow so you best-effort decode each item: iterate secrets.items, attempt decode_helm_release_secret for each (calling the same function), collect only the successful Ok values into a Vec (e.g., via filter_map or flat_map equivalent) and skip or log Err results, then pass that Vec into latest_releases along with include_control_plane so a single malformed secret does not break the entire response.mcp-server/src/k8s/helm_releases.rs-255-259 (1)
255-259:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not exclude all releases just because they are in the
control-planenamespace.The namespace check causes false positives and hides legitimate non-control-plane releases in that namespace.
Proposed fix
fn is_control_plane_release(release: &HelmReleaseSummary) -> bool { release.name == CONTROL_PLANE_RELEASE_NAME || release.chart.name.as_deref() == Some(CONTROL_PLANE_RELEASE_NAME) - || release.namespace == CONTROL_PLANE_RELEASE_NAME }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/k8s/helm_releases.rs` around lines 255 - 259, The function is_control_plane_release in helm_releases.rs is incorrectly treating any release in the CONTROL_PLANE_RELEASE_NAME namespace as the control plane, causing false positives; remove the namespace equality check (the condition "|| release.namespace == CONTROL_PLANE_RELEASE_NAME") and only consider release.name and release.chart.name.as_deref() == Some(CONTROL_PLANE_RELEASE_NAME) when determining control-plane releases (adjust or add tests for HelmReleaseSummary if needed).mcp-server/src/helm.rs-60-60 (1)
60-60:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHarden Helm values-file handling for secrets.
Lines 84-97 write raw values to disk with default permissions, and Line 60 ignores cleanup failures. If values contain secrets, this risks leakage on shared nodes. Create files with restrictive perms (0600), prefer secure temp-file primitives, and surface/log delete failures.
Also applies to: 84-97
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/helm.rs` at line 60, The code currently writes raw Helm values to disk with default permissions and silently ignores cleanup (std::fs::remove_file(&values_path)); fix by creating the values file with restrictive permissions and using a secure temp-file primitive: replace the raw write at the "write raw values to disk" block (lines 84-97) with a tempfile::NamedTempFile (or tempfile::Builder) to create an atomic, securely-permissioned temporary file (or explicitly create with OpenOptions + set_mode(0o600) on Unix), write the values into that temp file, then atomically persist or move it to values_path if needed; also change the remove_file call that currently ignores errors to check the Result and log or return the error (e.g., using processLogger/tracing::error) instead of discarding it so deletion failures are surfaced.mcp-server/src/helm.rs-59-62 (1)
59-62:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd a process-level timeout around Helm execution.
Line 59 can block indefinitely if the Helm process hangs;
--timeoutdoes not guarantee subprocess termination in all failure modes. Wrap command execution withtokio::time::timeoutand return a structured timeout error.Suggested change
+use tokio::time::{timeout, Duration}; ... - let output = Command::new(&helm_bin).args(&args).output().await; + let output = timeout( + Duration::from_secs(15 * 60), + Command::new(&helm_bin).args(&args).output(), + ) + .await + .map_err(|_| HelmInstallError::Execute(std::io::Error::new( + std::io::ErrorKind::TimedOut, + "helm command timed out", + )))?;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/helm.rs` around lines 59 - 62, The Helm command execution can hang indefinitely; wrap the await on Command::new(&helm_bin).args(&args).output().await in tokio::time::timeout with a reasonable Duration and map a timeout result to a new structured HelmInstallError variant (e.g., HelmInstallError::Timeout) so callers get a clear error; ensure you still remove the temp values file (values_path) and convert the successful output via output.map_err(HelmInstallError::Execute) as before, returning the timeout error when tokio::time::timeout returns Err.mcp-server/src/policy/scopes.rs-23-40 (1)
23-40: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winPrevent maintenance errors when adding new scopes.
The
all()method manually lists every enum variant. When a newScopeis added to the enum (lines 7-20), the developer must remember to also add it to this array. The compiler won't catch omissions, creating a subtle bug where the new scope won't be included in policy checks that rely onall().🔧 Consider using a macro or const array
One approach is to use a declarative macro that generates both the enum and the
all()method from a single source of truth:macro_rules! define_scopes { ($($variant:ident),* $(,)?) => { #[derive(Debug, Clone, Copy, Eq, PartialEq, Ord, PartialOrd, Serialize)] #[serde(rename_all = "kebab-case")] pub enum Scope { $($variant,)* } impl Scope { pub fn all() -> BTreeSet<Self> { [$(Self::$variant,)*].into_iter().collect() } } }; } define_scopes! { Discover, Debug, WorkloadsInstall, WorkloadsUpgrade, WorkloadsDelete, VaultRuntimeMetadata, VaultRuntimeRead, VaultRuntimeWrite, VaultOperatorMetadata, VaultOperatorWrite, VaultOperatorRead, Admin, }Alternatively, use the
strumcrate withEnumIterto generate iteration automatically.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/policy/scopes.rs` around lines 23 - 40, The current Scope::all() method manually lists every Scope variant so newly added variants can be omitted; change the design to generate the enum and its all() collection from a single source of truth (e.g., introduce a define_scopes! macro or use strum::EnumIter) so adding a variant updates both enum and all() automatically; update the Scope declaration and the impl Scope::all() to be produced by that macro (or derive EnumIter and implement all() by iterating over Variant::iter()) and remove the hand-maintained array of variants in all().mcp-server/src/tools/router.rs-772-777 (1)
772-777:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClamp user-provided
limitto a bounded maximum.
list_paramsaccepts anyu32limit, so callers can request very large lists and create avoidable API pressure.Proposed fix
fn list_params(arguments: Option<&JsonObject>, default_limit: Option<u32>) -> ResourceListParams { + const MAX_LIMIT: u32 = 200; + let limit = optional_u32(arguments, "limit") + .map(|v| v.min(MAX_LIMIT)) + .or(default_limit); + ResourceListParams { label_selector: optional_string(arguments, "labelSelector"), field_selector: optional_string(arguments, "fieldSelector"), - limit: optional_u32(arguments, "limit").or(default_limit), + limit, } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/tools/router.rs` around lines 772 - 777, list_params currently accepts any u32 for limit which can cause unbounded requests; change its limit construction to clamp user-provided values to a safe maximum: introduce a MAX_LIST_LIMIT constant (e.g., 1000) and compute limit as optional_u32(arguments, "limit").map(|l| l.min(MAX_LIST_LIMIT)).or(default_limit) (or equivalent), keeping the rest of ResourceListParams fields the same; reference the function name list_params, the types ResourceListParams and optional_u32, and the default_limit parameter when making this change.mcp-server/src/tools/router.rs-869-877 (1)
869-877:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSchema type validation is incomplete for JSON Schema primitive types.
Unknown
typevalues currently pass validation (_ => true), so declaredarray/nullconstraints are silently ignored.Proposed fix
let matches = match expected_type { "boolean" => value.is_boolean(), "integer" => value.as_i64().is_some(), "number" => value.as_f64().is_some(), "object" => value.is_object(), "string" => value.is_string(), - _ => true, + "array" => value.is_array(), + "null" => value.is_null(), + _ => false, };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/tools/router.rs` around lines 869 - 877, The type-check branch around schema.get("type").and_then(Value::as_str) currently treats unknown types as valid (the _ => true arm) and omits "array" and "null"; update the match for expected_type in the same block to include "array" => value.is_array() and "null" => value.is_null(), and change the fallback arm from _ => true to _ => false so unknown/unsupported type strings fail validation; keep the rest of the arms ("boolean", "integer", "number", "object", "string") as-is and ensure you update the variable referenced as matches to use the new logic.mcp-server/src/tools/dynamic.rs-33-48 (1)
33-48:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOn error, keep existing definitions instead of clearing them.
When
discover_definitions()fails (e.g., Kubernetes temporarily unavailable), the current code returns an emptyVec, which causes all dynamic tools to disappear from the tool list. This degrades user experience—during transient failures, users see tools vanish and may think they're permanently gone. The function should preserve the existing definitions when refresh fails.🔧 Proposed fix to preserve definitions on error
pub(crate) async fn refresh(&self) -> bool { let definitions = match discover_definitions().await { Ok(definitions) => definitions, Err(error) => { tracing::debug!(%error, "failed to refresh dynamic MCP tools"); - Vec::new() + return false; } }; let mut current = self.definitions.write().await; let changed = tool_signature(current.as_ref()) != tool_signature(&definitions); if changed { *current = Arc::new(definitions); } changed }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/tools/dynamic.rs` around lines 33 - 48, The refresh function currently replaces definitions with an empty Vec on discover_definitions() error; instead, when discover_definitions() returns Err, log the error and return false without modifying self.definitions so existing dynamic tools are preserved. Update the logic in refresh: call discover_definitions().await, and on Err(error) do tracing::debug!(%error, "failed to refresh dynamic MCP tools") (or similar) and return false; only compute tool_signature and update *current = Arc::new(definitions) when discover_definitions() returns Ok. Reference: refresh, discover_definitions, self.definitions, and tool_signature.
🟡 Minor comments (7)
mcp-server/src/resources/uri.rs-29-31 (1)
29-31:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd validation to ensure consistency with parser.
The function creates URIs without validating that
extension_idis non-empty and contains no/characters. The parser at line 23 rejects such URIs, creating an inconsistency where this function can produce URIs thatparse()would reject.🛡️ Proposed fix to add validation
pub fn extension_catalog_entry_uri(extension_id: &str) -> String { + assert!(!extension_id.is_empty() && !extension_id.contains('/'), + "extension_id must be non-empty and cannot contain '/'"); format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}") }Alternatively, return
Option<String>to handle invalid input gracefully:-pub fn extension_catalog_entry_uri(extension_id: &str) -> String { +pub fn extension_catalog_entry_uri(extension_id: &str) -> Option<String> { + if extension_id.is_empty() || extension_id.contains('/') { + return None; + } - format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}") + Some(format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}")) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/resources/uri.rs` around lines 29 - 31, The extension_catalog_entry_uri function can produce URIs that the parser rejects; change its signature to return Option<String>, validate that extension_id is non-empty and does not contain '/' (e.g., if extension_id.is_empty() || extension_id.contains('/') return None), and otherwise return Some(format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}")); also update any callers to handle the Option and keep behavior consistent with the parse() function used elsewhere.extensions/control-plane/scripts/post_install.sh-263-271 (1)
263-271:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPrint the correct controller kind in the restart instructions.
Line 271 tells users to restart a Deployment, but this PR deploys
supernode-mcpas a StatefulSet. The command shown here will fail as written.Suggested fix
-Wire the Secret into the Helm-managed MCP Deployment with: +Wire the Secret into the Helm-managed MCP StatefulSet with: helm upgrade ${RELEASE_NAME} . \\ --namespace ${NAMESPACE} \\ --reuse-values \\ --set supernodeMcp.vault.tokenSecretRef.name=${MCP_VAULT_TOKEN_SECRET_NAME} \\ --set supernodeMcp.vault.tokenSecretRef.key=${MCP_VAULT_TOKEN_SECRET_KEY} -If the MCP Deployment is already running, restart it after the Helm upgrade: - kubectl ${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart deployment/supernode-mcp +If the MCP StatefulSet is already running, restart it after the Helm upgrade: + kubectl ${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart statefulset/supernode-mcp🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/control-plane/scripts/post_install.sh` around lines 263 - 271, The restart instruction currently references a Deployment but the controller is a StatefulSet; update the kubectl rollback/restart command shown in the post-install instructions so it targets the correct kind by replacing "rollout restart deployment/supernode-mcp" with "rollout restart statefulset/supernode-mcp" (the line that contains the kubectl ... rollout restart command for supernode-mcp in post_install.sh)..github/workflows/test_mcp_server.yml-18-18 (1)
18-18:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRename job from
linttotestfor clarity.The job is named
lintbut actually runs tests viacargo test. This is misleading and inconsistent with the workflow name.📝 Proposed fix
jobs: - lint: + test: runs-on: ubuntu-latest🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/test_mcp_server.yml at line 18, The job currently declared as lint should be renamed to test to reflect that it runs cargo test: update the job identifier and any job name fields from "lint" to "test" in the workflow (the job block starting with "lint:"), and adjust any references or dependent job names accordingly so the job id and display name match the actual command (cargo test).bootstrap/kind/README.md-112-112 (1)
112-112:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix workload type references from Deployment to StatefulSet.
The rollout status commands reference
deployment/supernode-mcp(lines 112 and 126), but the control-plane Helm chart deploys supernode-mcp as a StatefulSet (defined instage4-04-statefulset-supernode-mcp.yaml). These commands should usestatefulset/supernode-mcpinstead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bootstrap/kind/README.md` at line 112, The README currently references deployment/supernode-mcp but the chart actually creates a StatefulSet; update both occurrences that mention deployment/supernode-mcp to statefulset/supernode-mcp so commands like kubectl -n control-plane rollout status target the correct resource (supernode-mcp) and ensure any other rollout/status examples use statefulset/... instead of deployment/....mcp-server/src/config.rs-72-77 (1)
72-77:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winReject non-positive session TTL values.
Line 76 currently accepts
0and negative TTLs. Validate TTL as strictly positive during config parsing to avoid silent misconfiguration.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/config.rs` around lines 72 - 77, The session_ttl_seconds() parser currently accepts zero and negative values; after parsing the env var in session_ttl_seconds(), validate that the parsed i64 is strictly > 0 and return an Err(ConfigError::InvalidSessionTtl(value)) (or appropriate ConfigError variant) when value <= 0 instead of returning Ok(Some(value)); implement this by mapping the parsed result to check the positivity (e.g., replace the direct parse mapping with a closure that parses then checks value > 0 and returns an Err on invalid values) while keeping the existing transpose() flow.mcp-server/src/tools/router.rs-81-82 (1)
81-82:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse the invoked tool name when mapping vault write/patch errors.
vault.runtime.patchcurrently reports failures under"vault.runtime.write", which mislabels audit/error details and complicates debugging.Proposed fix
- "vault.runtime.write" => vault_runtime_write(arguments, WriteMode::Replace).await, - "vault.runtime.patch" => vault_runtime_write(arguments, WriteMode::Patch).await, + "vault.runtime.write" => { + vault_runtime_write("vault.runtime.write", arguments, WriteMode::Replace).await + } + "vault.runtime.patch" => { + vault_runtime_write("vault.runtime.patch", arguments, WriteMode::Patch).await + } async fn vault_runtime_write( + tool_name: &str, arguments: Option<&JsonObject>, default_mode: WriteMode, ) -> CallToolResult { @@ - Err(error) => return vault_error("vault.runtime.write", error), + Err(error) => return vault_error(tool_name, error), @@ - Err(error) => return vault_error("vault.runtime.write", error), + Err(error) => return vault_error(tool_name, error), @@ - Err(error) => vault_error("vault.runtime.write", error), + Err(error) => vault_error(tool_name, error), } }Also applies to: 116-145
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/tools/router.rs` around lines 81 - 82, The error/audit entries for the vault write/patch branches are mislabelled because both match arms call vault_runtime_write(arguments, WriteMode::...) but still record the tool as "vault.runtime.write"; update the router match arms for "vault.runtime.write" and "vault.runtime.patch" so the invoked tool name is passed into or set on the call site (e.g., pass an extra parameter like tool_name or set a context) before calling vault_runtime_write, or make vault_runtime_write accept and use the tool name when composing errors/audit logs; ensure both branches ("vault.runtime.write" and "vault.runtime.patch") supply their correct string so failures are mapped to the actual invoked tool name.mcp-server/src/tools/workloads/mod.rs-24-25 (1)
24-25:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
workloads.listinput schema is missing supported filter fields.The handler accepts
namespace,labelSelector,fieldSelector, andlimit, but schema currently rejects/omits them. This creates avoidable client-contract drift.Proposed fix
- input_schema: r#"{"type":"object","properties":{"includeControlPlane":{"type":"boolean"}},"additionalProperties":false}"#, + input_schema: r#"{"type":"object","properties":{"namespace":{"type":"string"},"includeControlPlane":{"type":"boolean"},"labelSelector":{"type":"string"},"fieldSelector":{"type":"string"},"limit":{"type":"integer","minimum":1,"maximum":200}},"additionalProperties":false}"#,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/tools/workloads/mod.rs` around lines 24 - 25, The input_schema for the workloads.list handler currently omits supported filter fields, causing client/schema drift; update the input_schema JSON (the value assigned to input_schema in the workloads.list handler) to include properties for "namespace" (string), "labelSelector" (string), "fieldSelector" (string) and "limit" (integer), keep additionalProperties as false and ensure the root type remains "object" so the handler's accepted params (namespace, labelSelector, fieldSelector, limit) validate against the schema.
🧹 Nitpick comments (8)
mcp-server/src/catalog/mod.rs (1)
42-44: ⚡ Quick winAdd
is_empty()method to complementlen().Rust convention recommends providing an
is_empty()method wheneverlen()is exposed, which avoids theclippy::len_without_is_emptylint and improves API ergonomics.✨ Proposed addition
pub fn len(&self) -> usize { self.extensions.len() } + + pub fn is_empty(&self) -> bool { + self.extensions.is_empty() + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/catalog/mod.rs` around lines 42 - 44, Add a complementary is_empty() method to the same impl that exposes len(): implement pub fn is_empty(&self) -> bool that forwards to self.extensions.is_empty() (or returns self.extensions.len() == 0) so callers can check emptiness and to satisfy the clippy::len_without_is_empty lint; add it alongside the existing pub fn len(&self) -> usize.mcp-server/Dockerfile (1)
3-7: ⚡ Quick winOptimize Docker layer caching by separating dependency and source builds.
The current approach copies all source files before building, which invalidates the cache on every source change. Building dependencies in a separate layer first would significantly speed up rebuilds.
⚡ Proposed optimization
WORKDIR /app COPY Cargo.toml Cargo.lock ./ -COPY src ./src +# Build dependencies first (creates a cache layer) +RUN mkdir src && echo "fn main() {}" > src/main.rs \ + && cargo build --release \ + && rm -rf src + +# Now copy real source and build +COPY src ./src RUN cargo build --releaseThis technique creates a dummy
main.rs, builds dependencies, then replaces with real source. Dependencies are cached untilCargo.tomlorCargo.lockchange.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/Dockerfile` around lines 3 - 7, The Dockerfile copies the entire src before building which busts the layer cache on any source change; to fix, implement a multi-step build: first COPY Cargo.toml and Cargo.lock and add a minimal/dummy src/main.rs (or touch a placeholder) then RUN cargo build --release to cache dependency compilation, then COPY the real src/ (replacing the dummy) and RUN cargo build --release again to produce the final artifact; update the Dockerfile steps that reference COPY Cargo.toml Cargo.lock ./, COPY src ./src and RUN cargo build --release to follow this dependency-first pattern so rebuilding only invalidates when Cargo.toml or Cargo.lock change..github/workflows/build_mcp_server.yml (1)
94-95: ⚡ Quick winConsider removing the unmaintained Docker layer caching action.
The
satackey/action-docker-layer-cachingaction is deprecated and unmaintained. Docker Buildx (already in use at line 78) provides built-in caching via BuildKit that is more reliable and actively maintained.♻️ Proposed fix
- uses: satackey/action-docker-layer-caching@v0.0.11 - continue-on-error: trueRemove these lines entirely. The
docker/build-push-action@v5already uses BuildKit caching.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/build_mcp_server.yml around lines 94 - 95, Remove the deprecated satackey/action-docker-layer-caching step: delete the action entry "uses: satackey/action-docker-layer-caching@v0.0.11" and its "continue-on-error: true" configuration from the workflow; rely on the existing docker/build-push-action@v5 / Docker Buildx caching already configured (the buildx step referenced in the workflow) instead of adding this unmaintained action..github/workflows/test_mcp_server.yml (1)
23-28: ⚡ Quick winConsider adding Rust caching to speed up CI.
The build workflow uses
Swatinem/rust-cache@v2to cache dependencies. Adding the same caching here would speed up test runs.⚡ Proposed addition
steps: - name: Checkout uses: actions/checkout@v4 + + - uses: Swatinem/rust-cache@v2 + with: + shared-key: "test" + workspaces: mcp-server -> target - name: Test run: cargo test🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/test_mcp_server.yml around lines 23 - 28, Add a Rust dependency cache step before the existing "Test" step to speed up CI: insert a step (e.g., name: "Setup Rust cache") that uses Swatinem/rust-cache@v2 and configures caching for cargo (registry, index, and target/build artifacts) keyed on Cargo.lock (or equivalent) so subsequent "Test" (cargo test) runs reuse dependencies/build outputs; reference the new step name "Setup Rust cache" and the existing "Test" step when placing it in the workflow..github/workflows/check_mcp_server.yml (1)
23-28: ⚡ Quick winConsider adding Rust caching to speed up CI.
The build workflow uses
Swatinem/rust-cache@v2to cache dependencies. Adding the same caching here would speed up the clippy checks.⚡ Proposed addition
steps: - name: Checkout uses: actions/checkout@v4 + + - uses: Swatinem/rust-cache@v2 + with: + shared-key: "lint" + workspaces: mcp-server -> target - name: Clippy check lints run: cargo clippy -- -D warnings🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/check_mcp_server.yml around lines 23 - 28, The workflow currently runs the "Clippy check lints" step without restoring a Rust/Cargo cache, which slows CI; add a step using Swatinem/rust-cache@v2 before the clippy run to restore and save Cargo cache (registry, git index, and target) so subsequent runs are faster — insert a new step named e.g. "Restore Rust cache" that uses Swatinem/rust-cache@v2 with appropriate inputs (rust-version or default) and ensure the cache step runs before the existing "Clippy check lints" step and will also save the cache after the job completes.mcp-server/src/audit/event.rs (1)
60-60: ⚡ Quick winConsider whether
secret_values_includedshould be parameterized.The field is currently hardcoded to
falsein the factory method. If there are scenarios where secret values should be included in audit events (e.g., in non-production environments or with special audit flags), you may want to add a parameter tofrom_policy_outcomerather than hardcoding this value.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/audit/event.rs` at line 60, The factory currently hardcodes secret_values_included: false in from_policy_outcome; make it configurable by adding a parameter (e.g., secret_values_included: bool) to the from_policy_outcome signature, use that parameter when constructing the Event (replace the hardcoded false), and update all call sites to pass the appropriate flag (or provide an overloaded/optional default wrapper that supplies false for existing callers); ensure the field and any docs/tests reflect the new parameter.mcp-server/src/session/sqlite.rs (1)
77-94: ⚖️ Poor tradeoffExpired session cleanup runs on every load.
The
delete_expiredcall on line 80 runs for everyload()invocation. If expired sessions accumulate, this could add latency to session lookups. Consider whether periodic background cleanup (via a separate task or cron job) would be more efficient for production workloads with high session churn.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/session/sqlite.rs` around lines 77 - 94, Expired-session cleanup is being executed on every load (delete_expired inside load), which can add latency; remove the delete_expired call from async fn load(&self, ...) and instead perform cleanup from a dedicated background mechanism that invokes delete_expired via the existing with_connection(...) helper: either spawn a periodic async task (e.g., tokio::spawn) at store initialization to call delete_expired on an interval, or implement a lightweight probabilistic/interval guard inside the session store (track last_cleanup timestamp in the store struct and only run delete_expired when enough time has passed) so load() only performs the SELECT logic and returns state_json without running delete_expired each call.mcp-server/src/tools/workloads/dolos.rs (1)
150-155: ⚡ Quick winConsider logging scale-back failures during error recovery.
When
wait_for_dolos_pods_absenttimes out or PVC deletion fails, the code attempts to restore the original replica count but silently ignores errors withlet _ = .... If this recovery fails, the StatefulSet remains scaled to zero, and there's no indication in the logs. While returning the original error to the user is correct, logging recovery failures would aid debugging.📊 Add recovery failure logging
if let Err(error) = wait_for_dolos_pods_absent(&client, &namespace, &release_name).await { - let _ = client + if let Err(recovery_error) = client .scale_stateful_set(&namespace, &stateful_set_name, original_replicas) - .await; + .await + { + tracing::warn!( + %recovery_error, + namespace, + stateful_set = stateful_set_name, + "failed to restore StatefulSet replicas after timeout" + ); + } return error; }Also applies to: 161-165
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/tools/workloads/dolos.rs` around lines 150 - 155, The recovery path that calls client.scale_stateful_set(&namespace, &stateful_set_name, original_replicas).await currently ignores errors; change it to capture the result and, on Err(e), log a clear recovery-failure message including the namespace, stateful_set_name and original_replicas along with the recovery error before returning the original error from wait_for_dolos_pods_absent; apply the same change to the other identical block (lines 161-165) so scale-back failures are recorded for debugging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b5f3be41-6308-4a52-b115-be6598276fb5
⛔ Files ignored due to path filters (1)
mcp-server/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (70)
.github/image/Dockerfile.github/workflows/build_mcp_server.yml.github/workflows/check_mcp_server.yml.github/workflows/test_mcp_server.yml.gitignorebootstrap/kind/README.mdextensions/cardano-node/templates/configmap-metrics.yamlextensions/control-plane/README.mdextensions/control-plane/scripts/post_install.shextensions/control-plane/templates/stage4-00-pvc-supernode-mcp.yamlextensions/control-plane/templates/stage4-01-serviceaccount-supernode-mcp.yamlextensions/control-plane/templates/stage4-02-clusterrole-supernode-mcp.yamlextensions/control-plane/templates/stage4-03-clusterrolebinding-supernode-mcp.yamlextensions/control-plane/templates/stage4-04-statefulset-supernode-mcp.yamlextensions/control-plane/templates/stage4-05-service-supernode-mcp.yamlextensions/control-plane/values.yamlmcp-server/.dockerignoremcp-server/Cargo.tomlmcp-server/Dockerfilemcp-server/src/audit/event.rsmcp-server/src/audit/mod.rsmcp-server/src/audit/sink.rsmcp-server/src/auth/mod.rsmcp-server/src/catalog/cardano_node_relay.rsmcp-server/src/catalog/dolos.rsmcp-server/src/catalog/extension.rsmcp-server/src/catalog/mod.rsmcp-server/src/catalog/schema.rsmcp-server/src/config.rsmcp-server/src/errors.rsmcp-server/src/helm.rsmcp-server/src/k8s/client.rsmcp-server/src/k8s/helm_releases.rsmcp-server/src/k8s/mod.rsmcp-server/src/main.rsmcp-server/src/mcp.rsmcp-server/src/policy/approvals.rsmcp-server/src/policy/mod.rsmcp-server/src/policy/roles.rsmcp-server/src/policy/scopes.rsmcp-server/src/prompts/catalog.rsmcp-server/src/prompts/mod.rsmcp-server/src/resources/mod.rsmcp-server/src/resources/router.rsmcp-server/src/resources/uri.rsmcp-server/src/server.rsmcp-server/src/session/mod.rsmcp-server/src/session/sqlite.rsmcp-server/src/tools/common.rsmcp-server/src/tools/dynamic.rsmcp-server/src/tools/k8s_summaries.rsmcp-server/src/tools/mod.rsmcp-server/src/tools/router.rsmcp-server/src/tools/supernode.rsmcp-server/src/tools/vault.rsmcp-server/src/tools/workloads/dolos.rsmcp-server/src/tools/workloads/install.rsmcp-server/src/tools/workloads/logs.rsmcp-server/src/tools/workloads/metrics.rsmcp-server/src/tools/workloads/mod.rsmcp-server/src/tools/workloads/outputs.rsmcp-server/src/tools/workloads/registry.rsmcp-server/src/vault/client.rsmcp-server/src/vault/mod.rsmcp-server/src/vault/paths.rsmcp-server/src/vault/redaction.rsskills/README.mdskills/cardano-relay-setup.mdskills/dolos-supernode-deployment.mdskills/workload-output-port-forward.md
| include: | ||
| - release_for: Linux-x86_64 | ||
| build_on: ubuntu-22.04 | ||
| target: x86_64-unknown-linux-gnu | ||
| args: "--locked --release" | ||
|
|
||
| - release_for: Linux-arm64 | ||
| build_on: ubuntu-22.04-arm | ||
| target: "aarch64-unknown-linux-gnu" | ||
| args: "--locked --release" |
There was a problem hiding this comment.
Add missing ext property to matrix entries.
Lines 50-51 and 57 reference matrix.ext, but the matrix entries do not define this property, causing a workflow syntax error.
🐛 Proposed fix
include:
- release_for: Linux-x86_64
build_on: ubuntu-22.04
target: x86_64-unknown-linux-gnu
args: "--locked --release"
+ ext: ""
- release_for: Linux-arm64
build_on: ubuntu-22.04-arm
target: "aarch64-unknown-linux-gnu"
args: "--locked --release"
+ ext: ""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| include: | |
| - release_for: Linux-x86_64 | |
| build_on: ubuntu-22.04 | |
| target: x86_64-unknown-linux-gnu | |
| args: "--locked --release" | |
| - release_for: Linux-arm64 | |
| build_on: ubuntu-22.04-arm | |
| target: "aarch64-unknown-linux-gnu" | |
| args: "--locked --release" | |
| include: | |
| - release_for: Linux-x86_64 | |
| build_on: ubuntu-22.04 | |
| target: x86_64-unknown-linux-gnu | |
| args: "--locked --release" | |
| ext: "" | |
| - release_for: Linux-arm64 | |
| build_on: ubuntu-22.04-arm | |
| target: "aarch64-unknown-linux-gnu" | |
| args: "--locked --release" | |
| ext: "" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/build_mcp_server.yml around lines 18 - 27, The matrix
entries under include (the objects with keys release_for, build_on, target,
args) are missing the ext property referenced elsewhere as matrix.ext; add an
ext field to each included matrix entry (e.g., add ext: "<appropriate
extension>" to the Linux-x86_64 and Linux-arm64 include objects) so matrix.ext
resolves correctly where used on lines referencing matrix.ext.
| strategy: | ||
| matrix: | ||
| include: | ||
| - tags: ghcr.io/txpipe/metis-supernode-mcp,ghcr.io/txpipe/metis-supernode-mcp:${{ github.sha }} | ||
| binary: supernode-mcp | ||
|
|
There was a problem hiding this comment.
Move dynamic tag generation out of the matrix definition.
GitHub Actions does not interpolate expressions like ${{ github.sha }} inside matrix definitions. The SHA tag will be literal text, not the actual commit SHA.
🔧 Proposed fix
Move tag generation to a step that sets an output or environment variable, then reference it in the build step:
strategy:
matrix:
include:
- - tags: ghcr.io/txpipe/metis-supernode-mcp,ghcr.io/txpipe/metis-supernode-mcp:${{ github.sha }}
- binary: supernode-mcp
+ - binary: supernode-mcp
permissions:
contents: read
packages: write
steps:
- name: Checkout repository
uses: actions/checkout@v4
+
+ - name: Prepare tags
+ id: tags
+ run: |
+ echo "tags=ghcr.io/txpipe/metis-supernode-mcp,ghcr.io/txpipe/metis-supernode-mcp:${{ github.sha }}" >> $GITHUB_OUTPUT
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3Then update line 107:
- tags: ${{ matrix.tags }}
+ tags: ${{ steps.tags.outputs.tags }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/build_mcp_server.yml around lines 63 - 68, The matrix
currently hardcodes tags including the expression `${{ github.sha }}` which
won’t be interpolated; move dynamic tag generation out of the matrix by
replacing the matrix "tags" entry with a static placeholder (or remove the
SHA-specific tag) and add a prior step that computes the tags string using `${{
github.sha }}` and exposes it as an output or env var (e.g., set-output/tag or
echo to GITHUB_ENV). Then update the build/publish step that uses the matrix
`tags` (the step that references `tags`/`binary`) to instead consume the
computed output/env variable so the actual commit SHA is included in the runtime
tag list.
| let auth_context = match auth_mode { | ||
| AuthMode::Trusted => AuthContext::trusted(), | ||
| AuthMode::OAuth => anyhow::bail!("MCP_AUTH_MODE=oauth is not implemented yet"), | ||
| }; |
There was a problem hiding this comment.
Fail closed when secure auth is unavailable.
Line 54 rejects OAuth, so the server effectively runs only in trusted mode. Combined with non-loopback binds, this creates a privileged unauthenticated control surface. Add a startup guard that forbids trusted mode on non-loopback addresses (or require explicit insecure opt-in).
Suggested mitigation
fn router(config: Config, cancellation_token: CancellationToken) -> anyhow::Result<Router> {
let auth_mode = config.auth_mode;
+ if matches!(auth_mode, AuthMode::Trusted) && !config.bind_addr.ip().is_loopback() {
+ anyhow::bail!(
+ "trusted auth mode is only allowed on loopback interfaces; configure secure auth before exposing the service"
+ );
+ }
let auth_context = match auth_mode {
AuthMode::Trusted => AuthContext::trusted(),
AuthMode::OAuth => anyhow::bail!("MCP_AUTH_MODE=oauth is not implemented yet"),
};🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@mcp-server/src/server.rs` around lines 52 - 55, The code allows
AuthMode::Trusted unconditionally, which creates an unauthenticated control
surface when the server is bound to non-loopback addresses; update startup to
detect when auth_mode == AuthMode::Trusted and the server bind/listen address
(e.g., bind_addr or listen_addr used when starting the server) is not a loopback
IP, and then fail-fast with an error (or require an explicit insecure opt-in
flag/env var such as INSECURE_ALLOW_TRUSTED) instead of proceeding; ensure this
guard lives alongside the existing auth_mode match where AuthContext::trusted()
is created so the process exits on unsafe configuration.
| pub async fn runtime_metadata( | ||
| &self, | ||
| path: &VaultPath, | ||
| ) -> Result<VaultSecretMetadata, VaultError> { | ||
| let metadata_response = self.get_metadata(path).await?; |
There was a problem hiding this comment.
Enforce Runtime path kind in runtime-scoped Vault APIs.
These methods currently accept operator paths too, which breaks the runtime/operator boundary and can allow unintended secret domain access.
Proposed fix
use crate::vault::paths::VaultPath;
+use crate::vault::paths::VaultPathKind;
@@
pub async fn runtime_metadata(
&self,
path: &VaultPath,
) -> Result<VaultSecretMetadata, VaultError> {
+ ensure_runtime_path(path)?;
let metadata_response = self.get_metadata(path).await?;
let secret_response = self.get_secret(path).await?;
@@
pub async fn write_runtime_secret(
&self,
path: &VaultPath,
secret: &SecretObject,
mode: WriteMode,
) -> Result<VaultWriteReceipt, VaultError> {
+ ensure_runtime_path(path)?;
let data = match mode {
@@
}
+
+fn ensure_runtime_path(path: &VaultPath) -> Result<(), VaultError> {
+ if path.kind() != VaultPathKind::Runtime {
+ return Err(VaultError::Path(crate::vault::paths::VaultPathError::NotAllowed));
+ }
+ Ok(())
+}Also applies to: 87-92
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@mcp-server/src/vault/client.rs` around lines 66 - 70, The runtime-scoped
Vault APIs (e.g., runtime_metadata) currently accept any VaultPath; update these
functions to validate that the provided VaultPath.kind == VaultPathKind::Runtime
(or equivalent enum) and return a VaultError (or Err) if not Runtime to prevent
operator paths from being used; add the same validation at the start of other
runtime methods referenced (e.g., runtime_get, runtime_put around lines 87–92)
so each function checks the path kind before calling
get_metadata/get_secret/put_secret and aborts with a clear error when the path
is not Runtime.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Chores