Migrate agent tools from runtime to individual Lambda functions behind AgentCore Gateway#139
Migrate agent tools from runtime to individual Lambda functions behind AgentCore Gateway#139
Conversation
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
matteofigus
left a comment
There was a problem hiding this comment.
Thanks for the PR. I've gone through the diff and have some concerns about the overall approach.
The core premise doesn't hold up
The PR is framed as moving tools from the AgentCore runtime to Lambda for better permission scoping, but AgentCore runtime is Lambda under the hood. So this effectively goes from 1 Lambda with in-process tools to 8 Lambdas + Cognito + AgentCore Gateway + OAuth2 — without a meaningful security improvement to justify that complexity.
The permission improvements are marginal and achievable without this change
Comparing old vs new IAM policies side by side:
| Tool | Old (runtime role) | New (per-Lambda role) |
|---|---|---|
| Lambda discovery | lambda:ListFunctions, GetFunction, etc → * |
Same actions → * |
| CloudWatch metrics | cloudwatch:GetMetricStatistics, ListMetrics → * |
Same actions → * |
| CloudWatch logs | logs:StartQuery → /aws/lambda/*, StopQuery/GetQueryResults → * |
Same |
| Pricing | pricing:GetProducts → * |
Same |
| S3 storage | GetObject/PutObject on bucket |
grantReadWrite (broader — adds Abort*, DeleteObject*, GetBucket*, List*) |
| DynamoDB journal | Full table access on runtime | PutItem only ✅ |
The only real win is scoping journal to PutItem. The S3 change is actually a regression. The monitoring permissions are identical because those APIs don't support resource-level scoping. Both improvements could be achieved by tightening the existing runtime role — no new infrastructure needed.
New costs and tradeoffs
- Latency: Every tool call now goes runtime → HTTP → Gateway → OAuth2 check → Lambda invoke → response, vs a direct in-process function call
- Reliability: More failure modes — gateway outages, OAuth token endpoint issues, cold starts on tool Lambdas. If the gateway is down, the agent has zero tools
- Cost: 7 additional Lambdas, Cognito user pool, AgentCore Gateway for what was previously free (in-process calls)
- Complexity: 752 lines of CDK for
gateway.tsalone, plus MCP client, OAuth token caching, etc. - Debugging: Tool call failures now span multiple Lambda invocations and log groups
Other issues
gateway.clientSecret.unsafeUnwrap()passes the Cognito client secret as a plaintext environment variable. Should use Secrets Manager or SSM SecureString.- The interceptor's regex-based prompt injection blocking is trivially bypassable and gives a false sense of security.
use_aws(generic AWS API tool) was removed without replacement — potential functional regression if the agent used it for anything beyond the 5 specific APIs now covered.- Token cache (
_token_cachemodule-level dict) has no thread safety.
Suggestion
If the goal is tighter permissions, I'd suggest just scoping the existing runtime role more tightly (e.g., DynamoDB PutItem only, remove any excess S3 actions). That achieves the same or better security posture without the architectural overhead.
This review was done with the help of Claude Opus 4.6
Issue number: closes #
Summary
Moves all agent tools (storage, journal, and the generic use_aws) out of the agent runtime into standalone Lambda functions behind the Bedrock AgentCore Gateway. Each Lambda has its own scoped IAM role, replacing the previous "overlord permissions" pattern where the runtime held broad access to S3, DynamoDB, Lambda, CloudWatch, and Pricing APIs.
Changes
New Lambda tools (7 handlers): storage_tool, journal_tool, interceptor, lambda_discovery_tool, cloudwatch_metrics_tool, cloudwatch_logs_tool, pricing_tool — each with tool-name routing, structured JSON responses, and scoped IAM.
CDK construct:
gateway.ts — Cognito OAuth2 (client credentials), Lambda targets with tool schemas, interceptor escape hatch, cdk-nag suppressions.
MCP client: src/mcp_client/ — OAuth2 token caching, MCPClient factory for the agent runtime to call tools via the gateway.
Agent runtime (main.py): Replaced storage, journal, and use_aws imports with a single mcp_client.
Infrastructure (infra-stack.ts, agent.ts): Removed direct S3 bucket policy, S3_BUCKET_NAME env var, and monitoringPolicy from the runtime.
Prompt updates: analysis_prompt.md references new tool names. report_prompt.md adds ASCII-only output constraint to prevent UTF-8 mojibake.
Deleted:
journal.py, storage.py, old tests, agent-iam-policies.spec.ts.
Security
Path traversal validation on storage filenames (allowlist regex)
Interceptor blocks prompt injection and redacts PII
Tests
210 Python tests (99% coverage), 41 TypeScript tests (100% statement coverage).
User experience
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Disclaimer: We value your time and bandwidth. As such, any pull requests created on non-triaged issues might not be successful.