Webhook Middleware Phase 2: Validating webhook middleware#4314
Webhook Middleware Phase 2: Validating webhook middleware#4314Sanskarzz wants to merge 39 commits intostacklok:mainfrom
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
656c138 to
315001a
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4314 +/- ##
==========================================
+ Coverage 68.87% 69.25% +0.37%
==========================================
Files 478 480 +2
Lines 48306 48320 +14
==========================================
+ Hits 33272 33463 +191
+ Misses 12320 12267 -53
+ Partials 2714 2590 -124 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
Workloads already default to the "default" group when no group is specified (pkg/api/v1/workload_service.go), but skills did not. When `thv skill install <name>` was called without --group, the group field stayed empty and registerSkillInGroup was a no-op, meaning skills were never added to any group unless explicitly requested. Add the same defaulting logic in the skill service Install method so that skills are automatically registered in the "default" group, matching workload behavior. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(vmcp): add support for ResourceLink content type Signed-off-by: Sanskarzz <sanskar.gur@gmail.com> * fix: added ContentType as a string alias in types.go Signed-off-by: Sanskarzz <sanskar.gur@gmail.com> * fix: lint Signed-off-by: Sanskarzz <sanskar.gur@gmail.com> --------- Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
RuntimeConfig values (AdditionalPackages and BuilderImage) are interpolated into Dockerfile templates. Add input validation to harden these fields against malformed or unexpected values. Add a Validate() method on RuntimeConfig that checks package names against a strict allowlist regex and validates builder images using go-containerregistry's ParseReference. Call Validate() at all entry points: CLI flags, API requests, config file loading, and the runtime config loader in the runner package. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…#4197) Delete write_timeout_integration_test.go; TestIntegration_SSEGetConnectionSurvivesWriteTimeout is covered by TestWriteTimeout_SSEStreamSurvivesTimeout in the middleware package. Move TestIntegration_NonSSEGetRejectedWithNotAcceptable to session_management_realbackend_integration_test.go where it belongs alongside other real-backend server tests. Co-authored-by: taskbot <taskbot@users.noreply.github.com>
* Bring composite tools into session abstraction Composite tool workflow engines were previously relying on the discovery middleware to inject DiscoveredCapabilities into the request context so that the shared stateless router could route backend tool calls within workflows. This created an implicit coupling between the middleware and composite tool execution that made unit-testing harder and was a source of integration bugs. Affected components: pkg/vmcp/router, pkg/vmcp/composer, pkg/vmcp/server, pkg/vmcp/discovery Related-to: stacklok#3872 * refactor(vmcp): unify composite tools and optimizer as session decorators Both composite tools and the optimizer now implement the MultiSession decorator pattern (same as hijackPreventionDecorator) rather than having bespoke SDK wiring in handleSessionRegistrationImpl. New session decorators: - session/compositetools: appends composite tools to Tools(), routes their CallTool invocations to per-session workflow executors - session/optimizerdec: replaces Tools() with [find_tool, call_tool]; find_tool routes through the optimizer, call_tool delegates to the underlying session for normal backend routing sessionmanager.Manager gains DecorateSession() to swap in a wrapped session after creation. handleSessionRegistrationImpl becomes a flat decoration sequence (apply compositetools → apply optimizer → register tools) with no branching on optimizer vs non-optimizer paths. adapter.WorkflowExecutor/WorkflowResult become type aliases for the compositetools package types so the two layers share a single definition. adapter.CreateOptimizerTools is deleted; its logic lives in optimizerdec. --------- Co-authored-by: taskbot <taskbot@users.noreply.github.com>
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.79.1 to 1.79.3. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](grpc/grpc-go@v1.79.1...v1.79.3) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.79.3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…klok#4232) * Update module github.com/google/go-containerregistry to v0.21.3 * Fix daemon.Client compatibility with go-containerregistry v0.21.3 The go-containerregistry update changed the daemon.Client interface to use moby/moby/client types instead of docker/docker/client. Create a separate moby client for daemon operations while keeping the docker client for image builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Alejandro Ponce <aponcedeleonch@stacklok.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Update VERSION for release Release-Triggered-By: aponcedeleonch * Update Chart.yaml for release Release-Triggered-By: aponcedeleonch * Update Chart.yaml for release Release-Triggered-By: aponcedeleonch * Update Chart.yaml for release Release-Triggered-By: aponcedeleonch * Update Chart.yaml for release Release-Triggered-By: aponcedeleonch * Update values.yaml for release Release-Triggered-By: aponcedeleonch * Update values.yaml for release Release-Triggered-By: aponcedeleonch * Update values.yaml for release Release-Triggered-By: aponcedeleonch * Update VERSION for release Release-Triggered-By: aponcedeleonch * Update Chart.yaml for release Release-Triggered-By: aponcedeleonch * Update README.md for release Release-Triggered-By: aponcedeleonch * Update Chart.yaml for release Release-Triggered-By: aponcedeleonch * Update README.md for release Release-Triggered-By: aponcedeleonch * Update values.yaml for release Release-Triggered-By: aponcedeleonch
…lok#4238) Preserve tool annotations and output schema in SessionManager.GetAdaptedTools GetAdaptedTools was constructing mcp.Tool structs without copying Annotations or OutputSchema from the domain tool, causing Cedar authorization policies that rely on tool hints (readOnlyHint, destructiveHint, etc.) to deny every request. Fixes stacklok#4235 Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Update VERSION for release Release-Triggered-By: JAORMX * Update Chart.yaml for release Release-Triggered-By: JAORMX * Update Chart.yaml for release Release-Triggered-By: JAORMX * Update Chart.yaml for release Release-Triggered-By: JAORMX * Update Chart.yaml for release Release-Triggered-By: JAORMX * Update values.yaml for release Release-Triggered-By: JAORMX * Update values.yaml for release Release-Triggered-By: JAORMX * Update values.yaml for release Release-Triggered-By: JAORMX * Update VERSION for release Release-Triggered-By: JAORMX * Update Chart.yaml for release Release-Triggered-By: JAORMX * Update README.md for release Release-Triggered-By: JAORMX * Update Chart.yaml for release Release-Triggered-By: JAORMX * Update README.md for release Release-Triggered-By: JAORMX * Update values.yaml for release Release-Triggered-By: JAORMX
…lok#4237) * Fix MCPRegistry operator to update deployments on spec changes Signed-off-by: Radoslav Dimitrov <radoslav@stacklok.com> * Add integration tests Signed-off-by: Radoslav Dimitrov <radoslav@stacklok.com> * Add more integration tests Signed-off-by: Radoslav Dimitrov <radoslav@stacklok.com> --------- Signed-off-by: Radoslav Dimitrov <radoslav@stacklok.com>
…tacklok#4230) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
* Add serve-mode registry auth to thv serve API Wire non-interactive registry auth for serve mode so that browser-based OAuth flows are never triggered from the API server. When registry auth is configured but tokens are missing or expired, the API returns a structured 503 response with code "registry_auth_required" instead of hanging on a browser redirect. Key changes: - Add WithInteractive(false) provider option for headless contexts - Add GetNonInteractiveProviderWithConfig for serve mode - Add auth status fields (auth_status, auth_type) to registry API responses - Add POST /auth/login and POST /auth/logout API endpoints - Add auth fields to PUT registry endpoint with offline_access scope - Return structured 503 errors when registry auth is required - Wrap validation probe 401 with ErrRegistryAuthRequired Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs Signed-off-by: Chris Burns <29541485+ChrisJBurns@users.noreply.github.com> * Return OAuth public config in registry GET responses Clients like Studio need to display the configured issuer and client_id without reading the config file directly. Add an auth_config field to the GET /registry and GET /registry/default responses that surfaces the non-secret OAuth configuration (issuer, client_id, audience, scopes). The field is omitted when no auth is configured. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * updates docs Signed-off-by: Chris Burns <29541485+ChrisJBurns@users.noreply.github.com> --------- Signed-off-by: Chris Burns <29541485+ChrisJBurns@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Add Swagger annotations for registry auth login/logout endpoints The registryAuthLogin and registryAuthLogout handlers were missing swag annotations, so they did not appear in the generated OpenAPI spec. Add the standard @summary, @description, @tags, @router, etc. comment blocks and regenerate the swagger docs. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The static registry-api ClusterRole, ClusterRoleBinding, and
ServiceAccount templates were never used by the operator. The operator
dynamically creates namespace-scoped RBAC per MCPRegistry using
{name}-registry-api naming, while these templates created resources
bound to a hardcoded toolhive-registry-api ServiceAccount that no pod
ever references.
Also removes the now-unused registryAPI.serviceAccount values from
both values.yaml and values-openshift.yaml.
Refs stacklok#4245
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tacklok#4198) * Key upstream token storage on (sessionID, providerName) Restructure UpstreamTokenStorage so tokens are keyed by (sessionID, providerName) instead of just sessionID, enabling multiple upstream providers' tokens to coexist per session. This is the foundation for multi-upstream IDP support (RFC-0052). Storage layer: - UpstreamTokenStorage interface gains providerName param on Store/Get and new GetAllUpstreamTokens bulk-read method - Memory backend uses nested map (sessionID -> providerName -> entry) with empty inner map cleanup during eviction - Redis backend uses per-provider keys (upstream:{sid}:{provider}) with a session index SET for enumeration and atomic deletion - PendingAuthorization gains UpstreamProviderName and SessionID fields for future multi-leg authorization chains Service + middleware: - upstreamtoken.Service.GetValidTokens takes providerName - upstreamswap.Config gains required ProviderName field - runner/middleware derives ProviderName from upstream config Bug fixes from review: - Fix provider name key mismatch: callback handler now uses the upstream's logical name (UpstreamConfig.Name) instead of the protocol type (upstream.Type()) for both ProviderID and the storage key, matching what the middleware uses to retrieve tokens - Fix singleflight key collision: include providerName in the singleflight dedup key to prevent cross-provider result leaking - Fix Redis GetAllUpstreamTokens excluding expired tokens (violates interface contract that includes expired tokens at bulk-read level) - Add warn logging for corrupt entries in Redis GetAllUpstreamTokens - Defensive slice clone in Redis DeleteUpstreamTokens to prevent append mutation of the original providerKeys slice - Add sessionID validation in GetUpstreamTokens for consistency with StoreUpstreamTokens - Populate PendingAuthorization.UpstreamProviderName in authorize handler for multi-upstream forward compatibility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add backwards-compatible migration for provider ID rename Existing deployments store upstream tokens under the legacy key format (upstream:{sessionID}) with ProviderID set to the protocol type ("oidc"/"oauth2"). The multi-upstream restructuring changed both the key format and ProviderID to use logical names. Without migration, upgrading users would lose sessions and get new internal user IDs. Upstream token migration (RedisStorage.GetUpstreamTokens): - On ErrNotFound, fall back to legacy key format - Patch ProviderID to the logical name before returning so the refresher writes to the correct new key (prevents refresh loop) - Re-store under new key format and delete legacy key - Preserve ErrExpired semantics for expired legacy tokens ProviderIdentity migration (UserResolver.ResolveUser): - Before creating a new user, check legacy provider IDs ("oidc", "oauth2") for existing identity records - If found, create new identity under current provider ID pointing to the same internal user, preserving sub claim continuity - Keep legacy records as defensive fallback (not deleted) - Handle concurrent migration via ErrAlreadyExists Both migration paths are idempotent and marked with TODO for removal after all deployments have migrated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add integration tests for upstream token session binding Verify end-to-end that the callback handler stores upstream tokens under the correct (sessionID, providerName) key and that the tsid JWT claim maps to retrievable tokens. TestIntegration_UpstreamTokenStorage: - Tokens retrievable by provider name ("default") after full flow - ProviderID is the logical name, not the protocol type - Binding fields (UserID, UpstreamSubject, ClientID) populated - UserID matches JWT sub claim TestIntegration_RefreshPreservesUpstreamTokenBinding: - tsid claim preserved across refresh token grant - Upstream tokens still retrievable after refresh - ProviderID unchanged after refresh Infrastructure: expose UpstreamTokenStorage on testServer struct via srv.IDPTokenStorage() for storage assertions in tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add legacy migration tests and fix expired token migration Add miniredis unit tests for the legacy upstream token migration path, verifying: - Legacy key format tokens are migrated on read (ProviderID patched, old key deleted, new key created) - Expired legacy tokens are migrated with ErrExpired preserved - New-format keys take priority over legacy keys The expired token test caught a bug: the migration fallback skipped expired legacy tokens because getUpstreamTokensFromKey returns ErrExpired (non-nil), which the guard condition treated as "not found". Fixed by allowing ErrExpired through the legacy fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Enforce pending.UpstreamProviderName in CallbackHandler. AuthorizeHandler records which upstream initiated the authorization, but CallbackHandler ignored it and always used h.upstreamName. Add a consistency check: if the pending state names a different provider than the handler serving the callback, reject with a server_error redirect. This prevents misrouted callbacks from associating sessions with the wrong provider namespace. Harden legacy migration against cross-provider attacks. Identity migration: - Scope findLegacyProviderIdentity to a single legacy ID derived from UpstreamConfig.Type, not a global ["oidc","oauth2"] scan. This prevents cross-provider account merge when two upstreams share a subject value. - Propagate transient storage errors instead of swallowing them (fail closed rather than creating duplicate users on Redis blip). - Inject UserResolver into NewHandler as a dependency so Handler carries no migration-specific knowledge. Redis token migration: - Add isLegacyUpstreamProviderID guard: refuse to migrate tokens whose ProviderID is already a logical name, preventing any provider from claiming another provider's legacy tokens. - In practice, the remaining window (protocol-typed tokens claimed by first caller) is safe because validateUpstreams() enforces a single upstream, and new multi-upstream deployments start fresh without legacy keys. * Fix lint * Regenerate docs * Replace inline migration with one-shot bulk startup migration Move legacy data migration from the read path (GetUpstreamTokens, ResolveUser) to a single bulk migration that runs at authserver startup before handlers are constructed. This eliminates ~100 lines of inline fallback logic, security guards, and legacy-aware fields from the hot path. The read path is now a straight key lookup with no branching. The migration is idempotent, crash-safe (each key independent), and fails startup if any keys cannot be migrated — since the request path no longer has inline fallbacks for unmigrated data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Flatten nested upstream token map in memory storage Replace map[sessionID]map[providerName]*timedEntry with a flat map[upstreamKey]*timedEntry using a composite struct key. This eliminates the empty-inner-map cleanup pattern that was duplicated in cleanupExpired and DeleteUser — a known footgun when adding new deletion paths. The trade-off is O(N) scans in GetAllUpstreamTokens and DeleteUpstreamTokens instead of O(1) session lookup, which is acceptable for this dev/test-only storage backend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Return error from NewHandler instead of panicking on nil resolver Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Clarify Redis key comments for session index set and key prefix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add integration tests for legacy data migration against real Redis Test the one-shot bulk migration against a real Redis Sentinel cluster via testcontainers, covering: full lifecycle (token key migration, identity duplication, session index, user:upstream reverse index, DeleteUser cascade), idempotency, TTL preservation, empty-store no-op, and new-format key non-interference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sion factory (stacklok#4231) Move optimizer and composite tool construction from server-level post-session decoration into the MultiSessionFactory via a new DecoratingFactory wrapper. Sessions are now fully formed at creation time — the server unconditionally calls the factory and always gets a decorated MultiSession. - Add session.Decorator type and NewDecoratingFactory to apply decorators at session construction time - Add adaptToolsForFactory helper for optimizer tool indexing without session-manager-specific terminate callbacks - Move composite tool converter functions (FilterWorkflowDefsForSession, ConvertWorkflowDefsToTools, ValidateNoToolConflicts) out of server/ into the compositetools package where they belong - Remove DecorateSession and GetMultiSession from SessionManager interface and implementation — no longer needed - Delete adapter/optimizer_adapter.go (re-export of constants already in optimizerdec); delete its trivial test - Apply early-return guard clause in compositetools.CallTool to reduce nesting Closes: stacklok#3872 Co-authored-by: taskbot <taskbot@users.noreply.github.com>
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
PR size has been reduced below the XL threshold. Thank you for splitting this up!
|
✅ PR size has been reduced below the XL threshold. The size review has been dismissed and this PR can now proceed with normal review. Thank you for splitting this up! |
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
|
@JAORMX This PR is ready for review. |
Overview
This PR implements Phase 2 of the Dynamic Webhook Middleware feature by introducing the Validating Webhook Middleware. Validating webhooks allow ToolHive to call external HTTP services (such as policy engines, bespoke approval workflows, or rate limiters) to strictly evaluate, approve, or deny MCP requests before they reach backend tools.
Fixes #3397
Key Changes
1.
pkg/webhook/validatingPackageconfig.go): AddedMiddlewareParamsstruct supporting a chain ofwebhook.Configelements. Includes setup validation requiring >0 webhooks to be explicitly declared.middleware.go):types.Middlewareinterface factory.MCPRequest, extracting User Principal attributes directly from theauth.Identitycontext, and recording the request Origin Context (SourceIP,Transport,ServerName).allowed: false.FailurePolicyFail(fail-closed, blocks request on network/server errors) andFailurePolicyIgnore(fail-open, logs a warning on exception but continues pipeline).middleware_test.go): Complete parallelized test-suite coveringAllowed=truepaths, denial paths, both failure policies, connection errors, and safe bypass for non-MCP calls. (Test Coverage sits above 88%).2. Runner Integration (
pkg/runner)middleware.go:validating.CreateMiddlewareinsideGetSupportedMiddlewareFactories.addValidatingWebhookMiddleware) securely positioning the validating evaluation block sequentially aftermcp-parserbut precisely before auditing (telemetry,authz). Thus blocking unverified telemetry pollution or unauthorized execution.config.go:RunConfigexposing theValidatingWebhooks []webhook.Configslice.Testing Performed
go test ./pkg/webhook/validating/... ./pkg/runner/...(All unit tests passing).task lint/task lint-fixagainst the overall project (clean).Type of change
Test plan
task test)task test-e2e)task lint-fix)