feat(observability-on-aws): Add AWS Observability & FinOps plugin#68
feat(observability-on-aws): Add AWS Observability & FinOps plugin#68theagenticguy wants to merge 19 commits intoawslabs:mainfrom
Conversation
Adds a comprehensive AWS observability plugin combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, and automated codebase observability gap analysis. Includes 4 MCP servers (CloudWatch, Application Signals, CloudTrail, AWS Documentation) and 8 reference files covering incident response, log analysis, alerting, performance monitoring, security auditing, observability gap analysis, and Application Signals setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new aws-observability plugin to the Agent Plugins for AWS repo, providing an operational/observability-focused skill that integrates CloudWatch, Application Signals, CloudTrail, and AWS documentation via MCP servers, with supporting steering/reference docs.
Changes:
- Introduces the
aws-observabilityplugin manifest and MCP server configuration (CloudWatch, Application Signals, CloudTrail, AWS docs). - Adds an
aws-observabilityskill with progressive-disclosure reference files for incident response, log analysis, alerting, APM, security auditing, and codebase gap analysis. - Registers the new plugin in the marketplace registry under the
observabilitycategory.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| plugins/aws-observability/.claude-plugin/plugin.json | New plugin manifest (metadata, keywords, version, license). |
| plugins/aws-observability/.mcp.json | Defines MCP servers used by the plugin (stdio via uvx). |
| plugins/aws-observability/skills/aws-observability/SKILL.md | Main skill entrypoint: prerequisites, configuration, capability overview, and reference index. |
| plugins/aws-observability/skills/aws-observability/references/incident-response.md | Incident response workflows and cross-signal correlation guidance. |
| plugins/aws-observability/skills/aws-observability/references/log-analysis.md | CloudWatch Logs Insights syntax/patterns and tool parameter guidance. |
| plugins/aws-observability/skills/aws-observability/references/alerting-setup.md | CloudWatch alarm configuration patterns and best practices. |
| plugins/aws-observability/skills/aws-observability/references/performance-monitoring.md | Application Signals concepts, tool entrypoints, and troubleshooting workflows. |
| plugins/aws-observability/skills/aws-observability/references/security-auditing.md | CloudTrail data-source priority and security/compliance query patterns. |
| plugins/aws-observability/skills/aws-observability/references/observability-gap-analysis.md | Multi-language codebase observability gap analysis framework and templates. |
| plugins/aws-observability/skills/aws-observability/references/application-signals-setup.md | Application Signals enablement guidance using the MCP server enablement tool. |
| plugins/aws-observability/skills/aws-observability/references/cloudtrail-data-source-selection.md | Utility guide describing CloudTrail Lake/Logs/LookupEvents priority strategy. |
| .claude-plugin/marketplace.json | Registers aws-observability in the marketplace. |
plugins/aws-observability/skills/aws-observability/references/security-auditing.md
Outdated
Show resolved
Hide resolved
plugins/aws-observability/skills/aws-observability/references/security-auditing.md
Outdated
Show resolved
Hide resolved
plugins/aws-observability/skills/aws-observability/references/alerting-setup.md
Outdated
Show resolved
Hide resolved
- Replace wildcard IAM permissions with least-privilege read-only actions in SKILL.md (Copilot review comment awslabs#5) - Add missing `| limit 100` to Performance Analysis query example in SKILL.md (Copilot review comment awslabs#4) - Fix DynamoDB Throttles alarm pattern to use ReadThrottleEvents / WriteThrottleEvents instead of UserErrors (Copilot review comment awslabs#3) - Fix lookup_events example to use 90-day window matching API limits (Copilot review comment awslabs#1) - Remove orphaned pattern numbering ("Pattern 2/3/4" with no Pattern 1) in security-auditing.md (Copilot review comment awslabs#2) - Replace all "steering file" terminology with "reference" across all 8 reference files for consistency with plugin conventions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace awslabs.aws-documentation-mcp-server (local stdio via uvx) with awsknowledge (remote HTTP at knowledge-mcp.global.api.aws), matching the pattern used by deploy-on-aws plugin. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 10 comments.
You can also share your feedback on Copilot code review. Take the survey.
plugins/aws-observability/skills/aws-observability/references/incident-response.md
Outdated
Show resolved
Hide resolved
plugins/aws-observability/skills/aws-observability/references/security-auditing.md
Outdated
Show resolved
Hide resolved
plugins/aws-observability/skills/aws-observability/references/performance-monitoring.md
Outdated
Show resolved
Hide resolved
plugins/observability-on-aws/skills/observability-on-aws/references/log-analysis.md
Show resolved
Hide resolved
plugins/observability-on-aws/skills/observability-on-aws/references/log-analysis.md
Show resolved
Hide resolved
plugins/observability-on-aws/skills/observability-on-aws/references/log-analysis.md
Show resolved
Hide resolved
plugins/observability-on-aws/skills/observability-on-aws/references/log-analysis.md
Show resolved
Hide resolved
plugins/observability-on-aws/skills/observability-on-aws/references/incident-response.md
Show resolved
Hide resolved
- Add awslabs.billing-cost-management-mcp-server (stdio) to .mcp.json for cost analysis, forecasting, Compute Optimizer, Budgets, and Billing Conductor capabilities - Update SKILL.md: add Billing & Cost Management capability section, MCP server table entry, IAM permissions, and clarify configuration applies to all 4 stdio servers (not just cloudwatch-mcp-server) - Add missing `| limit` clauses to log-analysis.md patterns 3, 4, 5, 9 - Reduce incident-response.md quick error snapshot from limit 1000 to limit 100 to avoid context overflow - Update Cost Explorer references in incident-response.md to use the billing-cost-management MCP server Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restructure SKILL.md from a human-facing README style to an agent-optimized skill format: - Replace verbose Prerequisites/Configuration sections (20+ lines of IAM permissions, JSON examples, quick test) with a single-line config note - Move IAM permissions and setup details to new references/prerequisites.md for on-demand loading - Merge Capabilities and MCP Servers into a single decision table - Replace flat "Reference Files" list with a "Workflow Decision Tree" that tells the agent exactly when to load each reference - Rename "Best Practices" to "Key Tool Entry Points" with actionable tool-selection guidance - Add billing/cost trigger keywords to description frontmatter Result: SKILL.md drops from ~172 lines to ~97 lines. Initial agent context is sharper and more actionable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Alain Krok <alkrok@amazon.com>
Signed-off-by: Alain Krok <alkrok@amazon.com>
plugins/observability-on-aws/skills/observability-on-aws/references/incident-response.md
Show resolved
Hide resolved
|
|
||
| **Default:** Uses `default` AWS profile and `us-east-1` region. | ||
|
|
||
| ## Required IAM Permissions (read-only, least-privilege) |
There was a problem hiding this comment.
header says "Required IAM Permissions (read-only, least-privilege)" but lists ce:, billingconductor:, etc. These wildcards include write actions like ce:CreateAnomalyMonitor and ce:DeleteCostCategoryDefinition. We should either replace with actual read-only actions (e.g., ce:GetCostAndUsage, ce:GetCostForecast) or remove the "read-only, least-privilege" claim.
|
|
||
| ## Required IAM Permissions (read-only, least-privilege) | ||
|
|
||
| - **CloudWatch Metrics & Alarms**: `cloudwatch:GetMetricData`, `cloudwatch:GetMetricStatistics`, `cloudwatch:ListMetrics`, `cloudwatch:DescribeAlarms`, `cloudwatch:DescribeAlarmsForMetric`, `cloudwatch:DescribeAlarmHistory`, `cloudwatch:DescribeAnomalyDetectors` |
There was a problem hiding this comment.
Also the plugin's Priority 1 data source is CloudTrail Lake, aren't we missing permissions to query CL ? like cloudtrail:ListEventDataStores, cloudtrail:StartQuery,...
| ### Pattern 10: Anomaly Detection | ||
|
|
||
| ``` | ||
| anomaly @message |
There was a problem hiding this comment.
Shouldn't the anomaly keyword follow a pattern keyword ? According to https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-Anomaly.html
(e.g., pattern @message | anomaly)
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 14 comments.
You can also share your feedback on Copilot code review. Take the survey.
...ns/aws-observability/skills/aws-observability/references/cloudtrail-data-source-selection.md
Outdated
Show resolved
Hide resolved
| - Full SQL support with JOINs, aggregations, and window functions | ||
| - 7-year retention by default | ||
| - Cross-account and cross-region queries | ||
| - Cost-effective for large-scale analysis |
| # Prerequisites and Configuration | ||
|
|
||
| ## Requirements | ||
|
|
||
| 1. **AWS CLI configured** with credentials (`aws configure` or `~/.aws/credentials`) | ||
| 2. **Python 3.10+** and `uv` installed | ||
| 3. **Application Signals enabled** in your AWS account when applicable |
plugins/observability-on-aws/skills/observability-on-aws/references/prerequisites.md
Show resolved
Hide resolved
| # Application Signals Setup and Enablement Guide | ||
|
|
||
| This reference provides comprehensive guidance for setting up AWS Application Signals using the plugin's enablement guide feature. | ||
|
|
||
| ## Quick Start: Get Enablement Guide |
|
|
||
| - Use `list_event_data_stores` to check for enabled event data stores | ||
| - If available, use `lake_query` for SQL-based analysis | ||
| - Best for complex queries, long-term retention (7 years), and cost efficiency |
| ## Required IAM Permissions (read-only, least-privilege) | ||
|
|
||
| - **CloudWatch Metrics & Alarms**: `cloudwatch:GetMetricData`, `cloudwatch:GetMetricStatistics`, `cloudwatch:ListMetrics`, `cloudwatch:DescribeAlarms`, `cloudwatch:DescribeAlarmsForMetric`, `cloudwatch:DescribeAlarmHistory`, `cloudwatch:DescribeAnomalyDetectors` | ||
| - **CloudWatch Logs**: `logs:DescribeLogGroups`, `logs:DescribeLogStreams`, `logs:GetLogEvents`, `logs:FilterLogEvents`, `logs:StartQuery`, `logs:StopQuery`, `logs:GetQueryResults`, `logs:DescribeQueries` | ||
| - **X-Ray**: `xray:BatchGetTraces`, `xray:GetTraceSummaries`, `xray:GetTraceGraph`, `xray:GetServiceGraph`, `xray:GetTimeSeriesServiceStatistics` | ||
| - **CloudTrail**: `cloudtrail:LookupEvents`, `cloudtrail:DescribeTrails`, `cloudtrail:GetTrail`, `cloudtrail:ListTrails`, `cloudtrail:GetEventSelectors` | ||
| - **Application Signals**: `application-signals:GetService`, `application-signals:ListServices`, `application-signals:ListServiceOperations`, `application-signals:GetServiceLevelObjective`, `application-signals:ListServiceLevelObjectives`, `application-signals:BatchGetServiceLevelObjectiveBudgetReport` | ||
| - **Billing & Cost Management**: `ce:*`, `cost-optimization-hub:*`, `compute-optimizer:*`, `budgets:ViewBudget`, `pricing:*`, `freetier:GetFreeTierUsage`, `bcm-pricing-calculator:*`, `billingconductor:*` | ||
| - `synthetics:GetCanary`, `synthetics:GetCanaryRuns` for canary analysis |
| This reference provides guidance for accessing and analyzing CloudTrail audit data for security auditing, compliance monitoring, and governance analysis. | ||
|
|
| # CloudWatch Logs Insights Analysis | ||
|
|
||
| ## Purpose | ||
|
|
||
| This reference provides guidance for using CloudWatch Logs Insights QL syntax for log analysis, troubleshooting, and data extraction via the CloudWatch MCP server. |
… IAM permissions Split 6 oversized reference files (530-804 lines each) into focused sub-references under ~100 lines per the design guidelines, improving progressive disclosure and reducing agent context pressure. Changes: - Split security-auditing.md → security-investigations.md, security-monitoring.md - Split performance-monitoring.md → performance-traces.md, performance-slos.md - Split incident-response.md → incident-patterns.md, incident-postmortem.md - Split alerting-setup.md → alerting-advanced.md - Split observability-gap-analysis.md → observability-language-patterns.md - Trim cloudtrail-data-source-selection.md from 346 to 92 lines - Fix prerequisites.md: replace billing wildcards (ce:*, billingconductor:*) with specific read-only actions, add CloudTrail Lake query permissions - Fix SKILL.md: replace direct cloudtrail-data-source-selection.md link with prerequisites.md link (fixes orphaned file + utility-not-directly-loaded) - Fix CloudTrail Lake retention wording: "configurable retention" not "7-year default" - Verify anomaly syntax: pattern @message | anomaly (already correct) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ix tool names - Rename plugin from aws-observability to observability-on-aws (directories, plugin.json, marketplace.json, SKILL.md frontmatter) - Fix SKILL.md: replace incorrect tool names (cost-explorer, compute-optimizer) with actual MCP server tool names (get_cost_and_usage, get_cost_forecast, list_recommendations) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| "description": "Comprehensive AWS observability platform combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, and automated codebase observability gap analysis for complete monitoring, troubleshooting, and optimization.", | ||
| "homepage": "https://github.com/awslabs/agent-plugins", |
There was a problem hiding this comment.
The plugin manifest description omits Billing/Cost Management, but this plugin config includes the Billing & Cost Management MCP server and SKILL.md advertises cost workflows. Consider updating the manifest description (and possibly keywords) to reflect the full capability set so discovery in the marketplace matches what the plugin actually provides.
| "category": "observability", | ||
| "description": "Comprehensive AWS observability platform combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, and automated codebase observability gap analysis.", | ||
| "keywords": [ | ||
| "aws", | ||
| "observability", | ||
| "cloudwatch", | ||
| "monitoring", | ||
| "logs", | ||
| "metrics", | ||
| "alarms", | ||
| "application-signals", | ||
| "apm", | ||
| "cloudtrail", | ||
| "security", | ||
| "tracing" | ||
| ], |
There was a problem hiding this comment.
This marketplace entry description omits Billing/Cost Management, but the plugin ships with the billing-cost-management MCP server and the skill docs describe cost analysis workflows. Updating the marketplace description/keywords would make the listing accurately reflect the plugin's capabilities.
… correct The billing MCP server exposes single dispatcher tools (cost-explorer, compute-optimizer) that route via an operation parameter, not individual tools per API call. Verified against actual source code at awslabs/mcp/src/billing-cost-management-mcp-server/tools/. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…words Update plugin.json and marketplace.json descriptions to mention Billing & Cost Management and FinOps. Add keywords: billing, cost-management, finops, incident-response. Add finops tag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| "description": "Comprehensive AWS observability and FinOps platform combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, Billing & Cost Management, and automated codebase observability gap analysis for monitoring, troubleshooting, cost optimization, and incident response.", | ||
| "homepage": "https://github.com/awslabs/agent-plugins", | ||
| "keywords": [ | ||
| "aws", | ||
| "observability", | ||
| "cloudwatch", | ||
| "monitoring", | ||
| "logs", | ||
| "metrics", | ||
| "alarms", | ||
| "application-signals", | ||
| "apm", | ||
| "cloudtrail", | ||
| "security", | ||
| "tracing", |
There was a problem hiding this comment.
The plugin manifest description/keywords don’t mention Billing & Cost Management, but the skill and .mcp.json include the billing-cost-management MCP server. Consider updating the manifest metadata to reflect that capability for consistency and discoverability.
|
|
||
| ## Critical Alerting Patterns | ||
|
|
||
| Use the Root Account and Security Group queries above as CloudWatch Alarms. Additional critical alerts: |
There was a problem hiding this comment.
CloudWatch Alarms can’t be created directly from Logs Insights queries; they alarm on metrics. This section should clarify the needed translation step (e.g., metric filters on the CloudTrail log group, or scheduled Logs Insights queries that publish metrics) before creating alarms.
| Use the Root Account and Security Group queries above as CloudWatch Alarms. Additional critical alerts: | |
| Use the Root Account and Security Group queries above as the basis for CloudWatch Alarms by first turning these patterns into metrics (for example, with CloudWatch metric filters on the CloudTrail log group or scheduled Logs Insights queries that publish metrics), then creating alarms on those metrics. Additional critical alerts (as queries you can similarly translate into metrics and alarms): |
| # AWS Observability | ||
|
|
||
| Requires AWS CLI credentials. All stdio MCP servers use `AWS_PROFILE` and `AWS_REGION` from their env config (defaults: `default` profile, `us-east-1`). | ||
|
|
There was a problem hiding this comment.
The PR description/RFC scope says this plugin is intended to be read-only (no AWS resource provisioning/modification), but SKILL.md doesn’t state that explicitly and the incident-response guidance includes mitigation actions that could be interpreted as “do this now”. Add an explicit note here that the plugin should only query/inspect and provide recommendations unless the user explicitly asks to make changes (and ideally direct provisioning changes to an appropriate workflow/plugin).
| Note: This plugin is read-only. It should only query and inspect AWS resources and provide recommendations. It must not provision, modify, or delete AWS resources unless the user explicitly asks for a change, and such changes should preferably be executed via a dedicated deployment or provisioning workflow/plugin. |
Adds a new observability-on-aws plugin — a comprehensive AWS observability and FinOps platform combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, Billing & Cost Management, and automated codebase observability gap analysis.
Related
Changes
New plugin:
observability-on-aws(20 files).claude-plugin/plugin.json): name, description, keywords (including billing, cost-management, finops, incident-response), Apache-2.0 license.mcp.json): 5 servers from AWS Labs:awslabs.cloudwatch-mcp-server(stdio) — CloudWatch Logs, Metrics, Alarmsawslabs.cloudwatch-applicationsignals-mcp-server(stdio) — Application Signals APM, SLOs, tracingawslabs.cloudtrail-mcp-server(stdio) — CloudTrail security auditingawslabs.billing-cost-management-mcp-server(stdio) — Cost Explorer, Compute Optimizer, Budgets, FinOps analysisawsknowledge(HTTP) — AWS documentation, architecture guidanceskills/observability-on-aws/SKILL.md): Auto-triggers on observability intent (monitoring, troubleshooting, incident response, log analysis, security audit, cost analysis, billing, FinOps, etc.) with progressive disclosure via reference file linksskills/observability-on-aws/references/):prerequisites.md— IAM permissions (least-privilege, including CloudTrail Lake), MCP server configurationincident-response.md— Incident management framework (detection, investigation, mitigation)incident-patterns.md— Common incident patterns (deployment, resource exhaustion, dependency failure)incident-postmortem.md— Root cause analysis and postmortem templateslog-analysis.md— CloudWatch Logs Insights query patterns and syntax referencealerting-setup.md— Core alarm concepts, recommended configurations, alarm patternsalerting-advanced.md— Composite alarms, anomaly detection, SLO-based alertingperformance-monitoring.md— Application Signals APM overview and tool catalogperformance-traces.md— Transaction search patterns, X-Ray filters, trace analysisperformance-slos.md— SLO configuration, troubleshooting workflows, best practicessecurity-auditing.md— CloudTrail core concepts, data source priority, query examplessecurity-investigations.md— Security incident investigation and compliance audit queriessecurity-monitoring.md— Security monitoring use cases and critical alerting patternsobservability-gap-analysis.md— Codebase observability audit frameworkobservability-language-patterns.md— Language-specific observability patterns (Python, Java, JS, Go)application-signals-setup.md— Application Signals enablement guidecloudtrail-data-source-selection.md— CloudTrail data source priority decision tree.claude-plugin/marketplace.json): Addedobservability-on-awsentry with categoryobservability, tags includefinopsBuild validation: All checks pass (markdown lint, manifest validation, cross-reference validation, formatting, security scans).
Acknowledgment
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.