Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions skills/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,24 @@ Some skills are distilled from the legacy `apecloud/kubeblocks-addon-skills` rep

These skills were adapted to include standard `name` / `description` frontmatter. Their old `docs/...` and `scripts/...` references are bundled under `skills/kubeblocks-addon-source-docs/`; when installed as user skills, copy that support directory beside the individual skill directories.

## Methodology companion skills

Some skills are distilled directly from the methodology guides in `docs/`. These are used when the guide contains a repeatable decision tree, preflight, review checklist, or closeout protocol:

- `bounded-eventual-convergence` — companion to `docs/addon-bounded-eventual-convergence-guide.md`
- `evidence-discipline` — companion to `docs/addon-evidence-discipline-guide.md`
- `first-blocker-classification` — companion to `docs/addon-test-acceptance-and-first-blocker-guide.md`
- `github-submission-discipline` — companion to `docs/addon-github-submission-discipline-guide.md`
- `paramdef-range-validation` — companion to `docs/addon-paramdef-cue-range-validation-guide.md`
- `plain-language-with-westonnnn` — communication rule for westonnnn-facing messages
- `probe-soft-failure` — companion to `docs/addon-probe-timeout-and-soft-failure-guide.md`
- `runner-portability` — companion to `docs/addon-test-runner-portability-guide.md`
- `soak-test-classification` — companion to `docs/addon-soak-test-result-classification-guide.md`
- `team-skill-rollout` — companion to `docs/addon-agent-skill-rollout-guide.md`
- `xp-design-contract-review` — companion to `docs/addon-design-contract-review-during-xp-guide.md`

Treat this list as the first conversion batch. Future conversions should keep the same rule: only turn a document into a skill when it changes an agent's behavior at the moment of coding, testing, reviewing, debugging, or reporting.

## Distribution

Each addon team member loads skills via one of:
Expand Down
105 changes: 105 additions & 0 deletions skills/github-submission-discipline/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
---
name: github-submission-discipline
description: Use before committing, amending, pushing, force-pushing, or merging public GitHub PRs from AI agents. Prevents AI attribution leakage and multi-agent branch overwrite incidents by enforcing clean commit bodies, trailer grep checks, fetch/log comparison, and single-owner recovery.
allowed-tools: Bash(git *) Bash(gh *) Bash(rg *) Bash(grep *)
---

# GitHub Submission Discipline

## Hard Rules

1. **No AI attribution in public commits or PR bodies.** Remove `Co-Authored-By`, generated-by banners, model names, and vendor noreply addresses.
2. **Check every commit body, not only the latest.** Squash and amend can keep old body text.
3. **Fetch before force-push.** `--force-with-lease` protects only against the last fetched remote tip, not a teammate push you never fetched.
4. **Compare local and remote commit chains before pushing shared PR branches.**
5. **One recovery owner at a time.** When a branch was overwritten, stop parallel fixes and let one person rebuild the chain.
6. **Dropped commit owner verifies content.** The owner of the dropped work checks the recovered diff, not only the person who rebuilt the branch.

## When To Invoke

Use this skill when:

- Creating a commit for a public GitHub repo.
- Amending or squashing a PR.
- Pushing or force-pushing a branch other agents may touch.
- Reviewing a PR for merge readiness.
- Recovering from a missing commit, unexpected rebase, or force-push race.

## Pre-Push Checks

Run before any push:

```bash
git fetch origin
git log --oneline --decorate --graph --max-count=12 HEAD
git log --oneline --decorate --graph --max-count=12 origin/$(git branch --show-current) 2>/dev/null || true

git log origin/main..HEAD --format='%H' | while read c; do
git show -s --format='%B' "$c" | grep -Ei 'Co-Authored-By|Generated with|Anthropic|Claude|OpenAI|🤖|noreply@anthropic' && {
echo "AI attribution found in $c" >&2
exit 1
}
done
```

Also inspect the PR body before requesting review:

```bash
gh pr view --json body --jq .body | grep -Ei 'Co-Authored-By|Generated with|Anthropic|Claude|OpenAI|🤖|noreply@anthropic' && exit 1
```

## Commit Message Pattern

Use a normal human commit:

```text
docs: add probe soft-failure skill

Summarize what changed and why.
```

Do not add model signatures, generated-by sections, or AI co-author trailers.

## Shared Branch Push Protocol

Before force-pushing:

1. `git fetch origin`
2. Compare `HEAD` vs remote branch.
3. Confirm the commits you expect are present.
4. Use `git push --force-with-lease`.
5. Immediately re-read remote tip:
```bash
git ls-remote origin "$(git branch --show-current)"
```

If the remote contains work you did not expect, stop and coordinate. Do not overwrite.

## Recovery Protocol

If a commit disappeared:

1. Freeze branch writes.
2. Identify remote tip before and after the bad push.
3. Reconstruct the intended chain in one local checkout.
4. Ask each dropped-commit owner to verify their diff is present.
5. Push once with `--force-with-lease`.
6. Run the attribution grep again across the final chain.

## Closeout Language

Good:

```text
Submission check passed: PR body clean; commits origin/main..HEAD clean for AI attribution; local chain matches remote tip after fetch; pushed with force-with-lease.
```

Bad:

```text
Looks clean.
```

## Related Docs

- `docs/addon-github-submission-discipline-guide.md`
88 changes: 88 additions & 0 deletions skills/paramdef-range-validation/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
name: paramdef-range-validation
description: Use when adding or reviewing KubeBlocks ParametersDefinition cue/tpl numeric ranges, or when a reconfigure OpsRequest passes validation but the database cannot restart. Forces practical_min/practical_max validation instead of copying vendor hard limits.
allowed-tools: Bash(kubectl *) Bash(helm *) Bash(rg *) Read
---

# Parameter Range Validation

## Hard Rules

1. **Schema guards startup, not documentation theory.** A range in ParametersDefinition must block values that make the database fail to start.
2. **Use practical_min / practical_max, not doc_hard_min / doc_hard_max.** Vendor docs often describe parser limits, not values that can run a real instance.
3. **Every numeric lower bound needs evidence.** If you write `>=N`, be able to say how `N` was tested.
4. **Always set an upper bound.** A lower-only range still allows values that can exhaust PID, memory, file descriptor, or engine resource limits.
5. **Version upgrades reset the evidence.** Do not reuse a 12c / 5.7 / old-major range for a new major engine version without a boundary test.
6. **Separate default, recommendation, and guardrail.** The schema rejects unsafe values; defaults and docs recommend good values.

## When To Invoke

Use this skill when:

- Writing a new `ParametersDefinition` / cue / tpl file.
- Reviewing any `int & >=...`, `int & <=...`, min/max, or enum-like numeric range.
- Debugging reconfigure where validation succeeded but the engine failed to restart.
- Upgrading an addon to a new major engine version or KubeBlocks parameter API version.

## Review Checklist

For each numeric parameter:

1. Identify the source of each bound:
- `doc_hard_min/max`
- observed startup boundary
- resource cap
- arbitrary default copied from examples
2. Require a startup proof for `practical_min`:
- value at `practical_min` starts successfully
- value at `practical_min - 1` is rejected by schema or demonstrably unsafe
3. Require a resource proof for `practical_max`:
- container resources can tolerate the value
- engine hard max is not exceeded
4. Add a negative validation case:
- unsafe low value must fail during ValidatePhase
- unsafe high value must fail during ValidatePhase
5. Confirm runtime readback after a valid reconfigure:
- OpsRequest success alone is not enough
- read the parameter from the running engine

## Incident Triage

If reconfigure is stuck after a numeric change:

1. Check whether the value passed schema validation.
2. Check engine logs for startup failure after writing that value.
3. If the value is below practical startup requirements, classify the bug as **schema guard too weak**.
4. Fix the schema first. Do not paper over it by only changing the test.
5. Add a boundary-negative test so the same value is rejected before engine restart.

## Bad Patterns

| Pattern | Why It Fails | Fix |
|---|---|---|
| Copy vendor minimum directly | Parser accepts it, engine may not start | Use tested practical minimum |
| Lower bound only | Huge values can exhaust resources | Add practical upper bound |
| Default as schema minimum | Rejects values that are safe but not recommended | Keep default separate |
| OpsRequest Succeed as proof | Runtime may not apply or engine may restart later | Perform engine readback |
| Shared range across major versions | Startup requirements change | Split by engine version |

## Closeout Language

Good:

```text
Checked parameter range for `processes`: schema now uses practical_min=100, not vendor hard_min=6. Boundary negative `99` is rejected in ValidatePhase; `100` starts and runtime readback confirms applied value.
```

Bad:

```text
Range looks fine; copied from docs.
```

## Related Docs

- `docs/addon-paramdef-cue-range-validation-guide.md`
- `docs/addon-reconfigure-guide.md`
- `docs/addon-reconfigure-version-skew-guide.md`
- `docs/cases/oracle/oracle-12c-processes-cue-paramdef-range-case.md`
94 changes: 94 additions & 0 deletions skills/probe-soft-failure/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
name: probe-soft-failure
description: Use when writing or reviewing addon livenessProbe/readinessProbe scripts, or diagnosing unexpected restarts/readiness flaps. Forces channel errors such as client timeout or connection refusal to be treated as transient unless the client succeeds and output proves a bad database state.
allowed-tools: Bash(kubectl *) Bash(rg *) Read
---

# Probe Soft Failure

## Hard Rules

1. **Probe exit code must represent product state, not client transport state.**
2. **Every external client call must be wrapped in a shorter timeout than the probe timeout.**
3. **Client `rc != 0` is transient / unknown, so probe should usually `exit 0`.**
4. **Only `client succeeded + output proves bad state` may `exit 1`.**
5. **Known legal slow windows must short-circuit before the client call.** Use a narrow process guard when startup, DBCA, backup, restore, switchover, or role setup is known to make the control plane slow.
6. **Probe timing must fit the slowest legal operation.** Size `initialDelaySeconds`, `periodSeconds`, `timeoutSeconds`, and `failureThreshold` from real slow windows.
7. **Readiness can be conservative; liveness must be careful.** A false liveness failure kills the container and can create a restart cascade.

## When To Invoke

Use this skill when:

- Adding or reviewing `livenessProbe` / `readinessProbe` in `ComponentDefinition`.
- Writing shell scripts called by K8s probes.
- Seeing restarts increase while engine logs do not show a real crash.
- Seeing readiness flap after switchover, reconfigure, backup/restore, role rebuild, or startup.
- Reviewing PRs that use `mysql`, `redis-cli`, `sqlplus`, `dgmgrl`, `psql`, or similar clients in probes.

## Probe Script Shape

Use this shape unless the engine has a stronger local health API:

```bash
# 1. Known legal slow window guard
if pgrep -f '<init-or-role-change-marker>' >/dev/null 2>&1; then
echo "probe: legal slow window, soft pass"
exit 0
fi

# 2. Client call with explicit timeout
out=$(timeout 25 <engine-client-command> 2>&1)
rc=$?

# 3. Transport/client failure is unknown, not product failure
if [ "$rc" -ne 0 ]; then
echo "probe: client unavailable rc=$rc, soft pass"
exit 0
fi

# 4. Only parsed bad state fails
case "$out" in
*'<known-good-state>'*) exit 0 ;;
*'<known-bad-state>'*) echo "probe: bad state: $out"; exit 1 ;;
*) echo "probe: unknown output, soft pass"; exit 0 ;;
esac
```

## Review Checklist

Before approving a probe:

1. Find every command substitution and pipeline that calls an engine client.
2. Confirm each client call has an explicit timeout.
3. Confirm `rc != 0` does not directly become `exit 1`.
4. Confirm `exit 1` paths are based on parsed output from a successful client call.
5. Confirm legal slow windows are guarded before the client call.
6. Confirm probe timing leaves enough time for the slowest legal operation.
7. Confirm liveness and readiness semantics are not copied blindly from each other.

## Incident Triage

If a pod is repeatedly killed by liveness:

1. Compare restart timestamps with DB startup / role-change / backup / restore activity.
2. Read previous container logs and probe script output.
3. If probe failed on timeout, connection refusal, or transient client error, classify as **channel error promoted to product failure**.
4. Patch the probe script semantics first.
5. Re-run the same lifecycle path and verify restarts stop.

## Bad Patterns

| Pattern | Risk | Fix |
|---|---|---|
| `client || exit 1` | Kills during legal slow window | Treat client failure as unknown |
| `stderr contains ORA/ERR => exit 1` | Transient startup errors become product failure | Require successful readback + parsed state |
| No explicit `timeout` | Probe SIGKILLs client mid-call | Wrap client with `timeout N` |
| Broad `pgrep` guard | Probe always soft-passes | Use narrow marker strings |
| Liveness checks role readiness | Role changes kill otherwise healthy DB | Keep liveness minimal |

## Related Docs

- `docs/addon-probe-timeout-and-soft-failure-guide.md`
- `docs/addon-test-probe-classification-guide.md`
- `docs/cases/oracle/oracle-12c-post-switchover-probe-cascade-kill.md`
Loading