Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed docs/assets/img/chatops/status.png
Binary file not shown.
Binary file removed docs/assets/img/chatops/status_stalk.png
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/before/call_etiquette.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ The Incident Commander (IC) is the leader of the incident response process and i
## Problems?

#### There's no incident commander on the call! I don't know what to do!
Ask on the call if an IC is present. If you have no response, type `!ir page` in Slack. This will page the primary and backup IC to the call.
Ask on the call if an IC is present. If you have no response, type `/ir page` in Slack. This will page the primary and backup IC to the call.

#### I can join the call or Slack, but not both, what should I do?
You're welcome to join only one of the channels, however you should not actively participate in the incident response if so as it causes disjointed communication. Liaise with someone who is both in Slack and on the call to provide any input you may have so that they can raise it.
6 changes: 3 additions & 3 deletions docs/before/severity_levels.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ The first step in any incident response process is to determine what actually [c
<td>
<p class="response">Major incident response.</p>
<ul>
<li>Page an IC in Slack <code>!ir page</code>.</li>
<li>Page an IC in Slack <code>/ir page</code>.</li>
<li>See <a href="/during/during_an_incident">During an Incident</a>.</li>
<li>Notify internal stakeholders.</li>
<li>Public notification.</li>
Expand All @@ -54,7 +54,7 @@ The first step in any incident response process is to determine what actually [c
<td>
<p class="response">Major incident response.</p>
<ul>
<li>Page an IC in Slack <code>!ir page</code>.</li>
<li>Page an IC in Slack <code>/ir page</code>.</li>
<li>See <a href="/during/during_an_incident">During an Incident</a>.</li>
</ul>
</tr>
Expand All @@ -79,7 +79,7 @@ The first step in any incident response process is to determine what actually [c
<li>If related to recent deployment, rollback.</li>
<li>Monitor status and notice if/when it escalates.</li>
<li>Mention on Slack if you think it has the potential to escalate.</li>
<li>Trigger incident response if necessary (<code>!ir page</code>).</li>
<li>Trigger incident response if necessary (<code>/ir page</code>).</li>
</ul>
</td>
</tr>
Expand Down
2 changes: 1 addition & 1 deletion docs/before/what_is_an_incident.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Automatic monitoring is only part of the process. We may have parts of our funct
We trigger on any unplanned disruption or degradation of service to which any Elimu Informatics employee deems necessary of requiring coordinated incident response.

!!! question "Is a response required?"
If you are unsure of whether response is required, trigger our incident response process. All you need to do to start the process is page an Incident Commander in Slack with `!ir page`.
If you are unsure of whether response is required, trigger our incident response process. All you need to do to start the process is page an Incident Commander in Slack with `/ir page`.

## Incident Severity
Our [severity definitions](../before/severity_levels.md) determine how severe we _think_ an incident is based on some predefined guidelines. The intent is to guide responders on the type of response they can provide. For example, the higher the severity, the riskier the decisions you can take to return the system to normal.
Expand Down
9 changes: 2 additions & 7 deletions docs/during/during_an_incident.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Information on what to do during a major incident. See our [severity level descr
<td><a href="#">+1 555 BIG FIRE</a> (+1 555 244 3473) / PIN: 123456</td>
</tr>
<tr>
<td colspan="3" class="centered">Need an IC? Do <code>!ir page</code> in Slack</td>
<td colspan="3" class="centered">Need an IC? Do <code>/ir page</code> in Slack</td>
</tr>
<tr>
<td colspan="3"><em>For executive summary updates only, join <a href="#">#executive-summary-updates</a>.</em></td>
Expand All @@ -39,7 +39,7 @@ Information on what to do during a major incident. See our [severity level descr

1. Follow instructions from the Incident Commander.
* **Is there no IC on the call?**
* Manually page them via Slack, with `!ir page` in Slack. This will page the primary and backup IC's at the same time.
* Manually page them via Slack, with `/ir page` in Slack. This will page the primary and backup IC's at the same time.
* Never hesitate to page the IC. It's much better to have them and not need them than the other way around.

## Steps for Incident Commander
Expand Down Expand Up @@ -85,11 +85,6 @@ You are there to document the key information from the incident in Slack.
1. Update the Slack room with who the IC is, who the Deputy is, and that you're the scribe (if not already done).
* e.g. "IC: Bob Boberson, Deputy: Deputy Deputyson, Scribe: Writer McWriterson"

1. Start our status monitoring bot so that all responders can see the current state without needing to ask.
* OfficerURL can help you to monitor the status on Slack,
* `!status` - Will tell you the current status.
* `!status stalk` - Will continually monitor the status and report it to the room every 30s.

1. You should add notes to Slack when significant actions are taken, or findings are determined. You don't need to wait for the IC to direct this - use your own judgment.
* You should also add `TODO` notes to the Slack room that indicate follow-ups slated for later.

Expand Down
2 changes: 1 addition & 1 deletion docs/during/security_incident_response.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ description: Checklist of actions for responding to a security incident at Elimu
</div>

!!! warning "Incident Commander Required"
As with all major incidents at Elimu Informatics, security incidents will also involve an Incident Commander who will delegate the tasks to relevant responders. Tasks may be performed in parallel as assigned by the IC. Page one at the earliest possible opportunity `!ir page`.
As with all major incidents at Elimu Informatics, security incidents will also involve an Incident Commander who will delegate the tasks to relevant responders. Tasks may be performed in parallel as assigned by the IC. Page one at the earliest possible opportunity `/ir page`.

!!! question "Not Sure it's a Security Incident?"
Trigger the process anyway. It's better to be safe than sorry. The Incident Commander will make a determination on if response is needed.
Expand Down
38 changes: 12 additions & 26 deletions docs/resources/chatops.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
---
cover: assets/img/covers/chatops.png
description: Throughout this documentation references are made to various chat commands, all starting with an exclamation (e.g. 'ir page'). We have bots running in our Slack rooms which watch for these commands and execute various actions for us when they're detected. This page gives an overview of the commands we've referenced in this documentation, and what they do behind the scenes.
description: Throughout this documentation references are made to various chat commands. Incident response uses Slack slash commands (e.g. '/ir page'). This page gives an overview of the commands we've referenced in this documentation, and what they do behind the scenes.
---
Throughout this documentation, references are made to various chat commands, all starting with an exclamation (e.g. `!ir page`). We have bots running in our Slack rooms which watch for these commands and execute various actions for us when they're detected. This page gives an overview of the commands we've referenced in this documentation, and what they do behind the scenes.
Throughout this documentation, references are made to various chat commands. Incident response actions use Slack slash commands (e.g. `/ir page`), handled by our incident response bot. This page gives an overview of the commands we've referenced in this documentation, and what they do behind the scenes.

## Incident Response

Our `!ir` commands poll the OpsGenie API behind the scenes for various on-call schedules we specify. It caches the names and contact details for the current on-call users, so that if there's any issue in making API requests, the funtionality isn't impacted.
Our `/ir` commands poll the OpsGenie API behind the scenes for various on-call schedules we specify. It caches the names and contact details for the current on-call users, so that if there's any issue in making API requests, the funtionality isn't impacted.

### `!ir`
### `/ir`
This command lists out the current Incident Commander(s) on-call, their phone numbers, and a message telling users how to page them.

![Incident Commander List](../assets/img/chatops/ir.png)

### `!ir page`
### `/ir page`
This is the command we use to manually trigger our incident response process. It will page the current Incident Commander(s) on-call (the primary, the backup, and any trainees who are shadowing). It will also create a new incident in Jira that will be used for reporting purposes and a new Slack channel for discussions about the incident (aka an incident war room). Links to the Jira incident and new Slack channel will be displayed in the channel where the command was entered.

![Paging Incident Commanders](../assets/img/chatops/ir_page.png)
Expand All @@ -22,36 +22,22 @@ If for any reason we are unable to page the Incident Commanders automatically, t

![Testing for Failure](../assets/img/chatops/test_for_failure.png)

### `!ir responders`
This works similarly to the `!ir` command, but it uses all of the team schedules instead of just the Incident Commander schedules. It will list out all the current people who are on-call for each team. This is useful to also know who will likely be joining the incident call momentarily.
### `/ir responders`
This works similarly to the `/ir` command, but it uses all of the team schedules instead of just the Incident Commander schedules. It will list out all the current people who are on-call for each team. This is useful to also know who will likely be joining the incident call momentarily.

![Listing Responders](../assets/img/chatops/ir_responders.png)

### `!ir page responders`
This works similarly to `!ir page`, but it pages all of the team responders instead of the Incident Commander(s). This is rarely used, since generally only the relevant team will get paged. However, sometimes we require an "all hands on deck" response, and need the ability to quickly page all the current on-calls.
### `/ir page responders`
This works similarly to `/ir page`, but it pages all of the team responders instead of the Incident Commander(s). This is rarely used, since generally only the relevant team will get paged. However, sometimes we require an "all hands on deck" response, and need the ability to quickly page all the current on-calls.

![Paging Responders](../assets/img/chatops/ir_page_responders.png)

### `!ir who <user>`
Sometimes we may need to identify a specific individual to bring them onto a call. This command lists out the contact info for a specific user, and a message telling users how to page them.
### `/ir who <target>`
If `<target>` is a team name, this lists that team's current on-call. Otherwise it's treated as a user search — it lists the contact info for the matched person, along with a message telling users how to page them.

![Identifying Users](../assets/img/chatops/ir_who_rich.png)

### `!ir page <user>`
### `/ir page <user>`
This will page a specific person by username.

![Paging a User](../assets/img/chatops/ir_page_rich.png)

## Status

Our `!status` commands look at our internal monitoring systems to determine the current system state, as reported by the systems themselves. This is the status our alerting tooling uses to automatically notify us of issues.

### `!status`
This will tell us the current overview of our system state. It will also alert us if it is unable to check for the status, since that could also be an indication of an issue. Typically though, it will hopefully show a status of `NORMAL`.

![Displaying Status](../assets/img/chatops/status.png)

### `!status stalk`
This does the same as the above, only it polls every 30s until we stop it (with `!status unstalk`). It will only report the status into the chat room if it has changed since the last time it checked. We have this running during an incident so we can easily see if our system is getting worse or recovering without having to manually check our monitoring.

![Stalking Status](../assets/img/chatops/status_stalk.png)
2 changes: 1 addition & 1 deletion docs/training/internal_liaison.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ The [Steps for Internal Liaison](../during/during_an_incident.md/#steps-for-inte
Here are some examples of phrases and patterns you should use during incident calls.

### Keep Track of Responders
As you listen to the call, you should keep track of the responders to the call as you hear them speak. Make a note on a piece of paper, or use the `!ir responders` to see who they are. The IC may ask you who is on-call for a particular system, and you should know the answer, and be able to page them.
As you listen to the call, you should keep track of the responders to the call as you hear them speak. Make a note on a piece of paper, or use the `/ir responders` to see who they are. The IC may ask you who is on-call for a particular system, and you should know the answer, and be able to page them.

> Do we have a representative from [X] on the call?

Expand Down
12 changes: 0 additions & 12 deletions docs/training/scribe.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ It's important for the rest of the command staff to be able to focus on the prob

Your job as Scribe is to listen to the call and to watch the incident Slack room, keeping track of context and actions that need to be performed, documenting these in Slack as you go. **You should not be performing any remediations, checking graphs, or investigating logs.** Those tasks will be delegated to the subject matter experts (SME's) by the Incident Commander.


## Prerequisites
Before you can be a Scribe, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training!

Expand Down Expand Up @@ -44,13 +43,6 @@ The [Steps for Scribe](../during/during_an_incident.md/#steps-for-scribe) provid

Here are some examples of phrases and patterns you should use during incident calls.

### Status Stalking
At the start of any major incident call, you should start our status stalking bot, so that it will post to the room an update automatically.

> !status stalk

This will provide the update and allow the IC to see the status without having to keep asking.

### Note Important Actions
During a call, you will hear lots of discussion happening, you should not be documenting all of this in the chat room. You only want to document things which will be important for the final timeline. It's not always obvious what this might be, and it's usually a matter of judgement. You generally want to note any actions the IC has asked someone to perform, along with the result of any polling decisions.

Expand All @@ -69,7 +61,3 @@ The postmortem owner will find these after and raise tasks for them.
When the IC ends the call, you should post a message into Slack to let everyone know the call is over, and that they should continue discussion elsewhere.

> Call is over, thanks everyone. Follow up in Slack.

Don't forget to also stop the status stalking.

> !status unstalk
Binary file added theme_overrides/assets/images/favicon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.