Skip to content

feat(observability-on-aws): Add AWS Observability & FinOps plugin#68

Open
theagenticguy wants to merge 19 commits intoawslabs:mainfrom
theagenticguy:aws-observability
Open

feat(observability-on-aws): Add AWS Observability & FinOps plugin#68
theagenticguy wants to merge 19 commits intoawslabs:mainfrom
theagenticguy:aws-observability

Conversation

@theagenticguy
Copy link
Contributor

@theagenticguy theagenticguy commented Feb 26, 2026

Adds a new observability-on-aws plugin — a comprehensive AWS observability and FinOps platform combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, Billing & Cost Management, and automated codebase observability gap analysis.

Related

Changes

New plugin: observability-on-aws (20 files)

  • Plugin manifest (.claude-plugin/plugin.json): name, description, keywords (including billing, cost-management, finops, incident-response), Apache-2.0 license
  • MCP servers (.mcp.json): 5 servers from AWS Labs:
    • awslabs.cloudwatch-mcp-server (stdio) — CloudWatch Logs, Metrics, Alarms
    • awslabs.cloudwatch-applicationsignals-mcp-server (stdio) — Application Signals APM, SLOs, tracing
    • awslabs.cloudtrail-mcp-server (stdio) — CloudTrail security auditing
    • awslabs.billing-cost-management-mcp-server (stdio) — Cost Explorer, Compute Optimizer, Budgets, FinOps analysis
    • awsknowledge (HTTP) — AWS documentation, architecture guidance
  • Skill (skills/observability-on-aws/SKILL.md): Auto-triggers on observability intent (monitoring, troubleshooting, incident response, log analysis, security audit, cost analysis, billing, FinOps, etc.) with progressive disclosure via reference file links
  • Reference files (17 files in skills/observability-on-aws/references/):
    • prerequisites.md — IAM permissions (least-privilege, including CloudTrail Lake), MCP server configuration
    • incident-response.md — Incident management framework (detection, investigation, mitigation)
    • incident-patterns.md — Common incident patterns (deployment, resource exhaustion, dependency failure)
    • incident-postmortem.md — Root cause analysis and postmortem templates
    • log-analysis.md — CloudWatch Logs Insights query patterns and syntax reference
    • alerting-setup.md — Core alarm concepts, recommended configurations, alarm patterns
    • alerting-advanced.md — Composite alarms, anomaly detection, SLO-based alerting
    • performance-monitoring.md — Application Signals APM overview and tool catalog
    • performance-traces.md — Transaction search patterns, X-Ray filters, trace analysis
    • performance-slos.md — SLO configuration, troubleshooting workflows, best practices
    • security-auditing.md — CloudTrail core concepts, data source priority, query examples
    • security-investigations.md — Security incident investigation and compliance audit queries
    • security-monitoring.md — Security monitoring use cases and critical alerting patterns
    • observability-gap-analysis.md — Codebase observability audit framework
    • observability-language-patterns.md — Language-specific observability patterns (Python, Java, JS, Go)
    • application-signals-setup.md — Application Signals enablement guide
    • cloudtrail-data-source-selection.md — CloudTrail data source priority decision tree
  • Marketplace registry (.claude-plugin/marketplace.json): Added observability-on-aws entry with category observability, tags include finops

Build validation: All checks pass (markdown lint, manifest validation, cross-reference validation, formatting, security scans).

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

Adds a comprehensive AWS observability plugin combining CloudWatch Logs,
Metrics, Alarms, Application Signals (APM), CloudTrail security auditing,
and automated codebase observability gap analysis.

Includes 4 MCP servers (CloudWatch, Application Signals, CloudTrail,
AWS Documentation) and 8 reference files covering incident response,
log analysis, alerting, performance monitoring, security auditing,
observability gap analysis, and Application Signals setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new aws-observability plugin to the Agent Plugins for AWS repo, providing an operational/observability-focused skill that integrates CloudWatch, Application Signals, CloudTrail, and AWS documentation via MCP servers, with supporting steering/reference docs.

Changes:

  • Introduces the aws-observability plugin manifest and MCP server configuration (CloudWatch, Application Signals, CloudTrail, AWS docs).
  • Adds an aws-observability skill with progressive-disclosure reference files for incident response, log analysis, alerting, APM, security auditing, and codebase gap analysis.
  • Registers the new plugin in the marketplace registry under the observability category.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
plugins/aws-observability/.claude-plugin/plugin.json New plugin manifest (metadata, keywords, version, license).
plugins/aws-observability/.mcp.json Defines MCP servers used by the plugin (stdio via uvx).
plugins/aws-observability/skills/aws-observability/SKILL.md Main skill entrypoint: prerequisites, configuration, capability overview, and reference index.
plugins/aws-observability/skills/aws-observability/references/incident-response.md Incident response workflows and cross-signal correlation guidance.
plugins/aws-observability/skills/aws-observability/references/log-analysis.md CloudWatch Logs Insights syntax/patterns and tool parameter guidance.
plugins/aws-observability/skills/aws-observability/references/alerting-setup.md CloudWatch alarm configuration patterns and best practices.
plugins/aws-observability/skills/aws-observability/references/performance-monitoring.md Application Signals concepts, tool entrypoints, and troubleshooting workflows.
plugins/aws-observability/skills/aws-observability/references/security-auditing.md CloudTrail data-source priority and security/compliance query patterns.
plugins/aws-observability/skills/aws-observability/references/observability-gap-analysis.md Multi-language codebase observability gap analysis framework and templates.
plugins/aws-observability/skills/aws-observability/references/application-signals-setup.md Application Signals enablement guidance using the MCP server enablement tool.
plugins/aws-observability/skills/aws-observability/references/cloudtrail-data-source-selection.md Utility guide describing CloudTrail Lake/Logs/LookupEvents priority strategy.
.claude-plugin/marketplace.json Registers aws-observability in the marketplace.

krokoko
krokoko previously approved these changes Feb 27, 2026
@theagenticguy theagenticguy added the do-not-merge Do not merge the pull request label Feb 27, 2026
Copilot AI review requested due to automatic review settings March 3, 2026 17:27
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@krokoko krokoko self-requested a review March 3, 2026 17:28
- Replace wildcard IAM permissions with least-privilege read-only actions
  in SKILL.md (Copilot review comment awslabs#5)
- Add missing `| limit 100` to Performance Analysis query example in
  SKILL.md (Copilot review comment awslabs#4)
- Fix DynamoDB Throttles alarm pattern to use ReadThrottleEvents /
  WriteThrottleEvents instead of UserErrors (Copilot review comment awslabs#3)
- Fix lookup_events example to use 90-day window matching API limits
  (Copilot review comment awslabs#1)
- Remove orphaned pattern numbering ("Pattern 2/3/4" with no Pattern 1)
  in security-auditing.md (Copilot review comment awslabs#2)
- Replace all "steering file" terminology with "reference" across all 8
  reference files for consistency with plugin conventions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace awslabs.aws-documentation-mcp-server (local stdio via uvx)
with awsknowledge (remote HTTP at knowledge-mcp.global.api.aws),
matching the pattern used by deploy-on-aws plugin.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 5, 2026 14:27
@theagenticguy theagenticguy removed the do-not-merge Do not merge the pull request label Mar 5, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 10 comments.


You can also share your feedback on Copilot code review. Take the survey.

theagenticguy and others added 2 commits March 5, 2026 11:11
- Add awslabs.billing-cost-management-mcp-server (stdio) to .mcp.json
  for cost analysis, forecasting, Compute Optimizer, Budgets, and
  Billing Conductor capabilities
- Update SKILL.md: add Billing & Cost Management capability section,
  MCP server table entry, IAM permissions, and clarify configuration
  applies to all 4 stdio servers (not just cloudwatch-mcp-server)
- Add missing `| limit` clauses to log-analysis.md patterns 3, 4, 5, 9
- Reduce incident-response.md quick error snapshot from limit 1000 to
  limit 100 to avoid context overflow
- Update Cost Explorer references in incident-response.md to use the
  billing-cost-management MCP server

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restructure SKILL.md from a human-facing README style to an
agent-optimized skill format:

- Replace verbose Prerequisites/Configuration sections (20+ lines of
  IAM permissions, JSON examples, quick test) with a single-line config
  note
- Move IAM permissions and setup details to new
  references/prerequisites.md for on-demand loading
- Merge Capabilities and MCP Servers into a single decision table
- Replace flat "Reference Files" list with a "Workflow Decision Tree"
  that tells the agent exactly when to load each reference
- Rename "Best Practices" to "Key Tool Entry Points" with actionable
  tool-selection guidance
- Add billing/cost trigger keywords to description frontmatter

Result: SKILL.md drops from ~172 lines to ~97 lines. Initial agent
context is sharper and more actionable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 5, 2026 19:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@theagenticguy theagenticguy enabled auto-merge March 5, 2026 19:22
Signed-off-by: Alain Krok <alkrok@amazon.com>
@krokoko krokoko self-assigned this Mar 9, 2026
Signed-off-by: Alain Krok <alkrok@amazon.com>
Copilot AI review requested due to automatic review settings March 9, 2026 18:03
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.


**Default:** Uses `default` AWS profile and `us-east-1` region.

## Required IAM Permissions (read-only, least-privilege)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header says "Required IAM Permissions (read-only, least-privilege)" but lists ce:, billingconductor:, etc. These wildcards include write actions like ce:CreateAnomalyMonitor and ce:DeleteCostCategoryDefinition. We should either replace with actual read-only actions (e.g., ce:GetCostAndUsage, ce:GetCostForecast) or remove the "read-only, least-privilege" claim.


## Required IAM Permissions (read-only, least-privilege)

- **CloudWatch Metrics & Alarms**: `cloudwatch:GetMetricData`, `cloudwatch:GetMetricStatistics`, `cloudwatch:ListMetrics`, `cloudwatch:DescribeAlarms`, `cloudwatch:DescribeAlarmsForMetric`, `cloudwatch:DescribeAlarmHistory`, `cloudwatch:DescribeAnomalyDetectors`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the plugin's Priority 1 data source is CloudTrail Lake, aren't we missing permissions to query CL ? like cloudtrail:ListEventDataStores, cloudtrail:StartQuery,...

### Pattern 10: Anomaly Detection

```
anomaly @message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the anomaly keyword follow a pattern keyword ? According to https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-Anomaly.html
(e.g., pattern @message | anomaly)

@krokoko
Copy link
Contributor

krokoko commented Mar 9, 2026

  • need new entry in CODEOWNERS file
  • not a blocker for now, but should we gather shared content in separate files and cross-reference it (to avoid duplicates) ?. Like Log query patterns overlap across 5 files, and Alerting configuration is duplicated between alerting-setup.md and performance-monitoring.md

Copilot AI review requested due to automatic review settings March 11, 2026 18:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Copy link

@MichaelWalker-git MichaelWalker-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copilot AI review requested due to automatic review settings March 19, 2026 00:04
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 14 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +143 to +146
- Full SQL support with JOINs, aggregations, and window functions
- 7-year retention by default
- Cross-account and cross-region queries
- Cost-effective for large-scale analysis
Comment on lines +1 to +7
# Prerequisites and Configuration

## Requirements

1. **AWS CLI configured** with credentials (`aws configure` or `~/.aws/credentials`)
2. **Python 3.10+** and `uv` installed
3. **Application Signals enabled** in your AWS account when applicable
Comment on lines +1 to +5
# Application Signals Setup and Enablement Guide

This reference provides comprehensive guidance for setting up AWS Application Signals using the plugin's enablement guide feature.

## Quick Start: Get Enablement Guide

- Use `list_event_data_stores` to check for enabled event data stores
- If available, use `lake_query` for SQL-based analysis
- Best for complex queries, long-term retention (7 years), and cost efficiency
Comment on lines +25 to +33
## Required IAM Permissions (read-only, least-privilege)

- **CloudWatch Metrics & Alarms**: `cloudwatch:GetMetricData`, `cloudwatch:GetMetricStatistics`, `cloudwatch:ListMetrics`, `cloudwatch:DescribeAlarms`, `cloudwatch:DescribeAlarmsForMetric`, `cloudwatch:DescribeAlarmHistory`, `cloudwatch:DescribeAnomalyDetectors`
- **CloudWatch Logs**: `logs:DescribeLogGroups`, `logs:DescribeLogStreams`, `logs:GetLogEvents`, `logs:FilterLogEvents`, `logs:StartQuery`, `logs:StopQuery`, `logs:GetQueryResults`, `logs:DescribeQueries`
- **X-Ray**: `xray:BatchGetTraces`, `xray:GetTraceSummaries`, `xray:GetTraceGraph`, `xray:GetServiceGraph`, `xray:GetTimeSeriesServiceStatistics`
- **CloudTrail**: `cloudtrail:LookupEvents`, `cloudtrail:DescribeTrails`, `cloudtrail:GetTrail`, `cloudtrail:ListTrails`, `cloudtrail:GetEventSelectors`
- **Application Signals**: `application-signals:GetService`, `application-signals:ListServices`, `application-signals:ListServiceOperations`, `application-signals:GetServiceLevelObjective`, `application-signals:ListServiceLevelObjectives`, `application-signals:BatchGetServiceLevelObjectiveBudgetReport`
- **Billing & Cost Management**: `ce:*`, `cost-optimization-hub:*`, `compute-optimizer:*`, `budgets:ViewBudget`, `pricing:*`, `freetier:GetFreeTierUsage`, `bcm-pricing-calculator:*`, `billingconductor:*`
- `synthetics:GetCanary`, `synthetics:GetCanaryRuns` for canary analysis
Comment on lines +5 to +6
This reference provides guidance for accessing and analyzing CloudTrail audit data for security auditing, compliance monitoring, and governance analysis.

Comment on lines +1 to +5
# CloudWatch Logs Insights Analysis

## Purpose

This reference provides guidance for using CloudWatch Logs Insights QL syntax for log analysis, troubleshooting, and data extraction via the CloudWatch MCP server.
… IAM permissions

Split 6 oversized reference files (530-804 lines each) into focused
sub-references under ~100 lines per the design guidelines, improving
progressive disclosure and reducing agent context pressure.

Changes:
- Split security-auditing.md → security-investigations.md, security-monitoring.md
- Split performance-monitoring.md → performance-traces.md, performance-slos.md
- Split incident-response.md → incident-patterns.md, incident-postmortem.md
- Split alerting-setup.md → alerting-advanced.md
- Split observability-gap-analysis.md → observability-language-patterns.md
- Trim cloudtrail-data-source-selection.md from 346 to 92 lines
- Fix prerequisites.md: replace billing wildcards (ce:*, billingconductor:*)
  with specific read-only actions, add CloudTrail Lake query permissions
- Fix SKILL.md: replace direct cloudtrail-data-source-selection.md link with
  prerequisites.md link (fixes orphaned file + utility-not-directly-loaded)
- Fix CloudTrail Lake retention wording: "configurable retention" not "7-year default"
- Verify anomaly syntax: pattern @message | anomaly (already correct)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ix tool names

- Rename plugin from aws-observability to observability-on-aws
  (directories, plugin.json, marketplace.json, SKILL.md frontmatter)
- Fix SKILL.md: replace incorrect tool names (cost-explorer, compute-optimizer)
  with actual MCP server tool names (get_cost_and_usage, get_cost_forecast,
  list_recommendations)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 19, 2026 19:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Comment on lines +5 to +6
"description": "Comprehensive AWS observability platform combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, and automated codebase observability gap analysis for complete monitoring, troubleshooting, and optimization.",
"homepage": "https://github.com/awslabs/agent-plugins",
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin manifest description omits Billing/Cost Management, but this plugin config includes the Billing & Cost Management MCP server and SKILL.md advertises cost workflows. Consider updating the manifest description (and possibly keywords) to reflect the full capability set so discovery in the marketplace matches what the plugin actually provides.

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +77
"category": "observability",
"description": "Comprehensive AWS observability platform combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, and automated codebase observability gap analysis.",
"keywords": [
"aws",
"observability",
"cloudwatch",
"monitoring",
"logs",
"metrics",
"alarms",
"application-signals",
"apm",
"cloudtrail",
"security",
"tracing"
],
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This marketplace entry description omits Billing/Cost Management, but the plugin ships with the billing-cost-management MCP server and the skill docs describe cost analysis workflows. Updating the marketplace description/keywords would make the listing accurately reflect the plugin's capabilities.

Copilot uses AI. Check for mistakes.
theagenticguy and others added 2 commits March 19, 2026 19:28
… correct

The billing MCP server exposes single dispatcher tools (cost-explorer,
compute-optimizer) that route via an operation parameter, not individual
tools per API call. Verified against actual source code at
awslabs/mcp/src/billing-cost-management-mcp-server/tools/.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 19, 2026 19:30
…words

Update plugin.json and marketplace.json descriptions to mention
Billing & Cost Management and FinOps. Add keywords: billing,
cost-management, finops, incident-response. Add finops tag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@theagenticguy theagenticguy changed the title feat(aws-observability): Add AWS Observability plugin feat(observability-on-aws): Add AWS Observability & FinOps plugin Mar 19, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Comment on lines +5 to +19
"description": "Comprehensive AWS observability and FinOps platform combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, Billing & Cost Management, and automated codebase observability gap analysis for monitoring, troubleshooting, cost optimization, and incident response.",
"homepage": "https://github.com/awslabs/agent-plugins",
"keywords": [
"aws",
"observability",
"cloudwatch",
"monitoring",
"logs",
"metrics",
"alarms",
"application-signals",
"apm",
"cloudtrail",
"security",
"tracing",
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin manifest description/keywords don’t mention Billing & Cost Management, but the skill and .mcp.json include the billing-cost-management MCP server. Consider updating the manifest metadata to reflect that capability for consistency and discoverability.

Copilot uses AI. Check for mistakes.

## Critical Alerting Patterns

Use the Root Account and Security Group queries above as CloudWatch Alarms. Additional critical alerts:
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloudWatch Alarms can’t be created directly from Logs Insights queries; they alarm on metrics. This section should clarify the needed translation step (e.g., metric filters on the CloudTrail log group, or scheduled Logs Insights queries that publish metrics) before creating alarms.

Suggested change
Use the Root Account and Security Group queries above as CloudWatch Alarms. Additional critical alerts:
Use the Root Account and Security Group queries above as the basis for CloudWatch Alarms by first turning these patterns into metrics (for example, with CloudWatch metric filters on the CloudTrail log group or scheduled Logs Insights queries that publish metrics), then creating alarms on those metrics. Additional critical alerts (as queries you can similarly translate into metrics and alarms):

Copilot uses AI. Check for mistakes.
# AWS Observability

Requires AWS CLI credentials. All stdio MCP servers use `AWS_PROFILE` and `AWS_REGION` from their env config (defaults: `default` profile, `us-east-1`).

Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description/RFC scope says this plugin is intended to be read-only (no AWS resource provisioning/modification), but SKILL.md doesn’t state that explicitly and the incident-response guidance includes mitigation actions that could be interpreted as “do this now”. Add an explicit note here that the plugin should only query/inspect and provide recommendations unless the user explicitly asks to make changes (and ideally direct provisioning changes to an appropriate workflow/plugin).

Suggested change
Note: This plugin is read-only. It should only query and inspect AWS resources and provide recommendations. It must not provision, modify, or delete AWS resources unless the user explicitly asks for a change, and such changes should preferably be executed via a dedicated deployment or provisioning workflow/plugin.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants