Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .github/prompts/aks-check-nodes.prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
model: Claude Sonnet 4
description: 'This prompt is used to check the health status of nodes in an Azure Kubernetes Service (AKS) cluster.'
---

# Check for AKS Nodes Health Issues

Check the health status of all nodes in an Azure Kubernetes Service (AKS) cluster and identify any nodes that are not in a 'Ready' state. Provide a summary of the issues found and suggest possible remediation steps.

### Run these Commands

```bash
kubectl get nodes
kubectl describe node <node-name>
kubectl top nodes
kubectl cluster-info
```


### Output
The output a report in a readable format (e.g., plain text, JSON) that includes:
- Cluster Name
- Node Name
- Node Status
- Issues Found (if any)
- Suggested Remediation Steps

### Remediation Suggestions
For nodes that are not in the 'Ready' state, suggest possible remediation steps such as:
- Checking for resource constraints (CPU, memory)
- Reviewing node logs for errors
- Scaling the cluster if resource limits are being hit
- Contacting Azure support if the issue persists

### Note
Ensure that you have the necessary permissions to access the AKS clusters and perform the required operations.
Do not generate any scripts.
35 changes: 35 additions & 0 deletions .github/prompts/aks-check-pods.prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
model: Claude Sonnet 4.5
description: 'This prompt is used to check the health status of pods in an Azure Kubernetes Service (AKS) cluster.'
---

# Check for Pod Health Issues

Check the health status of all pods in an Azure Kubernetes Service (AKS) cluster and identify any pods that are not in a 'Running' state. Provide a summary of the issues found and suggest possible remediation steps.

### Run these Commands

```bash
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>
```

### Output
The output a report in a readable format (e.g., plain text, JSON) that includes:
- Cluster Name
- Pod Name
- Pod Status
- Issues Found (if any)
- Suggested Remediation Steps

### Remediation Suggestions
For pods that are not in the 'Running' state, suggest possible remediation steps such as:
- Checking for resource constraints (CPU, memory)
- Reviewing pod logs for errors
- Scaling the cluster if resource limits are being hit
- Redeploying the pod if it is in a crash loop

### Note
Do not generate any scripts.
Do not directly fix the issues; only provide analysis and suggestions.
15 changes: 15 additions & 0 deletions .github/prompts/aks-remediation.prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
model: Claude Sonnet 4.5
description: 'This prompt is used to provide remediation suggestions for pods in an Azure Kubernetes Service (AKS) cluster.'
---

# AKS Remediation for cluster issues

Provide remediation based on analysis and suggestions from the previous steps.

### Proposed Remediation Steps
Be specific in your remediation suggestions, including commands to run, configuration changes to make, or resources to consult. Tailor the suggestions based on the identified issues.

# Notes
- Do not generate any scripts.
- Always ask for confirmation before applying any remediation steps.
125 changes: 55 additions & 70 deletions .github/workflows/argocd-deployment-failure.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,102 +13,87 @@ jobs:
runs-on: ubuntu-latest

steps:
- name: Verify webhook signature
id: verify
env:
PAYLOAD: ${{ toJson(github.event.client_payload) }}
WEBHOOK_SECRET: ${{ secrets.ARGOCD_WEBHOOK_SECRET }}
run: |
# This is a placeholder - GitHub repository_dispatch doesn't include signatures
# The security comes from the GitHub token scope limitation
echo "Webhook received from ArgoCD"
echo "App: ${{ github.event.client_payload.app_name }}"
echo "Status: ${{ github.event.client_payload.operation_phase }}"

- name: Extract deployment info
id: deployment_info
run: |
APP_NAME="${{ github.event.client_payload.app_name }}"
HEALTH_STATUS="${{ github.event.client_payload.health_status }}"
SYNC_STATUS="${{ github.event.client_payload.sync_status }}"
REVISION="${{ github.event.client_payload.revision }}"
MESSAGE="${{ github.event.client_payload.message }}"
REPO_URL="${{ github.event.client_payload.repo_url }}"
TIMESTAMP="${{ github.event.client_payload.timestamp }}"

echo "app_name=${APP_NAME}" >> $GITHUB_OUTPUT
echo "health_status=${HEALTH_STATUS}" >> $GITHUB_OUTPUT
echo "sync_status=${SYNC_STATUS}" >> $GITHUB_OUTPUT
echo "revision=${REVISION}" >> $GITHUB_OUTPUT

- name: Create GitHub Issue
uses: actions/github-script@v7
with:
script: |
const payload = context.payload.client_payload || {};
const appName = payload.app_name || 'unknown';
const clusterName = payload.cluster || 'in-cluster';
const namespace = payload.namespace || 'default';
const healthStatus = payload.health_status || 'unknown';
const syncStatus = payload.sync_status || 'unknown';
const message = payload.message || 'No error message available';
const revision = payload.revision || 'unknown';
const repoUrl = payload.repo_url || '';
const timestamp = payload.timestamp || new Date().toISOString();
const resources = payload.resources || [];

// Build degraded resources section
let degradedDetails = '';
const degradedResources = resources.filter(r =>
r.health && (r.health.status === 'Degraded' || r.health.status === 'Missing' || r.health.status === 'Unknown')
);

if (degradedResources.length > 0) {
degradedDetails = '\n### 🔴 Degraded Resources\n\n';

for (const resource of degradedResources) {
const kind = resource.kind || 'Unknown';
const name = resource.name || 'unknown';
const resourceNamespace = resource.namespace || namespace;
const healthStatus = resource.health?.status || 'Unknown';
const healthMessage = resource.health?.message || 'No message';
const syncStatus = resource.status || 'Unknown';

degradedDetails += `#### ${kind}: \`${name}\`\n\n`;
degradedDetails += `- **Namespace:** ${resourceNamespace}\n`;
degradedDetails += `- **Health Status:** ${healthStatus}\n`;
degradedDetails += `- **Sync Status:** ${syncStatus}\n`;
degradedDetails += `- **Message:** ${healthMessage}\n\n`;

// Add kubectl command for this specific resource
degradedDetails += `**Troubleshoot:**\n\`\`\`bash\n`;
degradedDetails += `kubectl describe ${kind.toLowerCase()} ${name} -n ${resourceNamespace}\n`;
if (kind === 'Pod' || kind === 'Deployment' || kind === 'StatefulSet' || kind === 'DaemonSet') {
degradedDetails += `kubectl logs ${kind.toLowerCase()}/${name} -n ${resourceNamespace}\n`;
}
degradedDetails += `\`\`\`\n\n`;
}
}
const appName = '${{ github.event.client_payload.app_name }}';
const healthStatus = '${{ github.event.client_payload.health_status }}';
const syncStatus = '${{ github.event.client_payload.sync_status }}';
const operationPhase = '${{ github.event.client_payload.operation_phase }}';
const message = '${{ github.event.client_payload.message }}';
const revision = '${{ github.event.client_payload.revision }}';
const repoUrl = '${{ github.event.client_payload.repo_url }}';
const timestamp = '${{ github.event.client_payload.timestamp }}';
const clusterName = '${{ github.event.client_payload.cluster_name }}';
const clusterServer = '${{ github.event.client_payload.cluster_server }}';
const destNamespace = '${{ github.event.client_payload.destination_namespace }}';

const issueTitle = `🚨 ArgoCD Deployment Failed: ${appName}`;

const issueBody = `## ArgoCD Deployment Failure

**Application:** \`${appName}\`
**Status:** ${operationPhase}
**Timestamp:** ${timestamp}

### Cluster Information

| Field | Value |
|-------|-------|
| Cluster Name | \`${clusterName}\` |
| Namespace | \`${namespace}\` |

### Application Status
### Details

| Field | Value |
|-------|-------|
| Cluster | \`${clusterName || clusterServer}\` |
| Namespace | \`${destNamespace}\` |
| Health Status | \`${healthStatus}\` |
| Sync Status | \`${syncStatus}\` |
| Revision | \`${revision}\` |
| Repository | ${repoUrl} |

### Raw payload
\`\`\`json
${JSON.stringify(github.event.client_payload, null, 2)}
\`\`\`

### Error Message

\`\`\`
${message}
${message || 'No error message available'}
\`\`\`
${degradedDetails}
### Troubleshooting Commands

\`\`\`bash
# Check application status in ArgoCD
argocd app get ${appName}

# Check pods in namespace
kubectl get pods -n ${namespace}
### Recommended Actions

# Describe failed pods
kubectl describe pods -n ${namespace}

# Get pod logs
kubectl logs -n ${namespace} <pod-name>

# Check events
kubectl get events -n ${namespace} --sort-by='.lastTimestamp'
\`\`\`
1. Check the ArgoCD UI for detailed error logs
2. Review the application manifest for syntax errors
3. Verify resource quotas and limits
4. Check for image pull errors or missing secrets
5. Review recent commits to the source repository

### Quick Links

Expand Down
35 changes: 35 additions & 0 deletions Act-3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Act-3: Kubernetes Operations Don’t Scale Linearly

Problem:
Kubernetes becomes the operational choke point and your team in having a hard time dealing with misconfigurations, failed deployments and runtime issues.
Your team, platform engineering, is busy firefight instead of improving the platform. The deep Kubernetes expertise on your team doesn't scale across teams.

Answer:
Let agents give your team a hand, turning a siloed operational knowledge into a shared capability.

## Crawl

A Senior member of the team (Steve) has created a reusable prompts that can run arbitrarily when someone needs to troubleshoot a container workload on an AKS cluster. Steve made this available in the repo and this can be used in GitHub Copilot in VSCode via "Slash Commands" if you follow the folder/naming convension set out by GitHub/VScode (i.e. `<repo-root>/.github/prompts/<prompt-name>.prompt.md`).

Execute this prompt locally:

![write-prompt](images/write-prompt.png)

## Walk/Run

Create a GitHub Action Workflow that will be called upon for each push to the repo. For this example it will be just for the main branch, but you can set up the triggers/rules for when the workflow gets run. See the docs about [Events That Trigger Workflows](https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows).

> [!NOTE]
> We will use the GitHub Copilot CLI to automate the execution of our custom prompt in a scripted CI Runner - GitHub Actions.

We have an example of this in [Act-2 .github/workflows](../.github/workflows/copilot.generate-docs.yml).

### What does this do?

- The GitHub Action Workflow triggers on each push to the main branch - this ensures that documentation is created, if and when needed regardless if you remembered or not. This ensures that all team members have docs created for them, even if they did not run the `/write-docs` prompt manually before committing their changes. It also can be run manually in GitHub Actions since it also has the `workflow_dispatch` trigger enabled...this is optional of course but we have it here as an example anyways.
- It installs the GitHub Copilot CLI
- It ensures that we provide it credentials to call GitHub Copilot
> [!NOTE]
> Currently calling GitHub Copilot is a User only ability - meaning that GitHub Copilot is licensed to and therefore only callable by a human user account. In this example we have stored a Fine-Grained GitHub Personal Access Token (PAT -> a user bound API Key) that has been scoped with the `Copilot-Requests: Read-only` Permission. As such this will consume GitHub Copilot PRUs (Premium Request Units) from the tied user account. Today this is the only billing model to consume GitHub Copilot.
- Store the required prompt file contents as an environment variable
- Pass in the prompt and call GitHub Copilot CLI to generate docs
Loading
Loading