Emergency Procedures in SRE

Originally published on medium.

Emergency procedures used in incident response are aimed at stabilizing the system in a degraded state. When used properly, they result in faster incident response and become a foundation for further resiliency improvements in your system. In the post we’ll also explore how emergency procedures differ from runbooks.

Runbooks

Imagine you get paged in the middle of the night. The situation you encounter looks familiar to you and you’re sure you or your colleague had dealt with it before. However, as seconds ago your were still in deep sleep, you simply can’t recall what to do. Wouldn’t it be great if you had something to help your memory? This is what runbooks are for.

Typically linked in the description of the alert that paged you, runbooks describe routine procedures aimed at restoring regular service of your application. Runbooks list the exact steps to take. It’s helpful to name a runbook with a short summary of the executed procedure as title (e.g. “scale up application”, “retrigger batch job”). Other types of runbooks document migrations, failovers, DB upgrades to help retain know-how within your teams as this type of work is typically executed rather infrequently. Lastly, there are also runbooks aimed at helping with triaging unknown failure situations. When followed, they help inspect typical metrics of the system across layers and components in search for the culprit of the observed issues. A good example here is the sequence of commands used in Brendan Gregg’s Linux performance analysis in 60 seconds.

Emergency procedures

Let’s consider a different scenario: when paged, you observe that the system is overloaded and users experience increased latencies and error rates. You can’t scale up the system due to your dependencies (or storages) taking too long to scale up. To restore service you need to reduce load by 25% as soon as possible. How will you proceed? Will you disable feature A or B? Will you degrade service for user/country X or Y? Do you even have means to do so?

Enter emergency procedures. Unlike for typical runbooks, the goal of an emergency procedure is to bring the system into a degraded, yet stable state. Such state needs to be acceptable to users and stakeholders while trading off availability over customer experience.

Structure

Emergency procedures have defined trigger conditions and impact, both from the business and operational side. It’s important that they’re agreed with business owners ahead of time and thus do not require active approval during the incident response. Impact can be expressed in customer behavior, description how a feature will be working when the procedure has been carried out, or expressed in the change to the system’s load, for example.

Here an example for a food delivery application:

title: reduce search radius to 400m
trigger: increased latency or error rate for search queries
business impact: as all search queries will be limited to a max. 400m radius, customers will see 10–20% less search results, leading to a drop in conversion rate
operational impact: load on the datastore load will be reduced by 20% within 2 minutes of activating the feature toggle
steps: …

It’s important that the on-call team regularly practices the emergency procedures. This will verify the correctness of the to be executed steps and operational implications. Additionally, it ensures that the team (and stakeholders) are familiar with the degraded state of the system, which would be rarely observed otherwise. It’s highly recommended to include stakeholder contacts in the procedures in order to keep them informed about the interventions taken during incident response.

Designing for resilience

Systems need to be explicitly designed for supporting emergency procedures. Be it through runtime toggles that allow controlling certain features (e.g. on/off switches, enabling less expensive processing using cached values), or infrastructure mechanisms (e.g. short-circuiting processing for certain user groups or request types). This also requires annotating incoming requests with sufficient metadata to be able to apply differing treatment per feature, traffic origin, etc.

Here a few example degradations that can be introduced to a system with the system property outlined in curly braces:

enforce serving data from a cache instead to reduce load on the datastore (data freshness)
always serve the first page of a result set to reduce load on the datastore (data completeness)
limit retrieved records to N reducing the working dataset of the DB (data completeness defined by amount, distance, or time)
switch HD video to SD (degrade quality to save bandwidth) or serve images instead (reduce load by preventing auto-play)
drop traffic from unauthenticated users (user coverage)
pause all asynchronous batch jobs (feature completeness, data freshness)
process only critical requests (feature degradation, data completeness)

Automation

It’s certainly advisable to automate frequently used emergency procedures over time by building them into the system as part of your resiliency patterns (fallbacks, retries on error with adjusted input, etc.). Manual execution of the emergency procedure ensures that a human assesses the situation before proceeding with the procedure, which helps harden the defined preconditions. A few manual executions enable you to evaluate if automation is really of value when compared with than other product features planned. It’s important to factor in on-call health into the prioritization. At times of unexpectedly high growth when system availability is a concern, automation is just necessary to cope with overload scenarios efficiently.

Conclusion

Defining emergency procedures requires taking a different view on your system — one where some features are explicitly switched into a degraded mode, thus enabling the overall system to get healthy. The thought exercise of imagining such a degraded, yet usable state is time well spent and highly recommended throughout the design or production readiness stage for your applications. By building in the necessary failure handling mechanisms into the software, or designing it in a way that naturally accommodates the failure states, incident mitigation becomes simpler and less tedious. Having the procedures at hand, you will thank yourself next time you’re on-call in the middle of the night (or day).

Runbooks#

Emergency procedures#

Structure#

Designing for resilience#

Automation#

Conclusion#