Most common design issues found during Production Readiness and Post-Incident Reviews

Originally published on medium.

Operating software in production offers great insights into software quality. Learning from production incidents is key to improving existing software and learning how to design reliable applications. The Production Readiness Review is an established practice in Site Reliability Engineering aiming at applying past post-incident experience and findings into the software development process. This post provides an overview of a few common themes and pitfalls I’ve experienced being surfaced during production readiness and post-incident reviews.

Using defaults is just asking for trouble

Every framework, http client library or server, connection pool, database, and operating system assume defaults for configuration settings. Aside from risking publicly exposing sensitive data, default settings are often impacting the application’s performance and reliability. There are important settings that must be revisited before deploying an application into a production environment.

Overall, the most commonly missed defaults that impact reliability are timeouts: http client timeouts (connection, read timeout), DNS cache timeouts, and database connection pool and statement timeouts. Framework authors bear great responsibility when setting default values, but often fail to make the developer’s life easy. For example, Java’s DNS cache can be infinite, Apache HttpClient’s v4 RequestConfig.Builder uses “-1” as default timeouts with its documentation stating “A negative value is interpreted as undefined (system default)” and the newer v5 uses 3 minutes instead, whereas .NET’s default timeout is 100 seconds. Timeouts need to be set with care, so it’s important to understand the meaning of the different configuration possibilities. For a great overview on the http request lifecycle and timeouts, check out guide to Go net/http timeouts.

Misconfigured reliability patterns

Retries

Retrying a failed http request is one of the simplest reliability patterns to implement. When done right, retries have an exponentially increasing wait time between attempts and leverage jitter to prevent retry storms leading to the thundering herd problem. However, setting the timeouts to lower values than those that the dependency uses internally, will lead to subsequent retries that pile up and overload the dependency with work that is useless (given that the client won’t wait for the computation result it initially requested). Timeout values must therefore be carefully aligned with service providers, ideally based on their SLOs.

Circuit breakers

Circuit breakers enable the application to fail early in case of an overloaded or faulty dependency and serve a degraded experience via fallbacks. Dropping requests that would fail to be processed before its clients time out is also helping the dependency to recover from failure due to the load reduction. The additional time can be used by the service provider to stabilise the system (e.g. through provisioning of additional instances) and thus regain the ability to serve the required load.

Like every reliability pattern, it needs to be properly configured to function well. Configuring an execution timeout that is too high, will lead to a situation where the circuit breaker never opens, thus keeping the load on the dependency when its performance degrades and making recovery more difficult. Correctly configuring the circuit breaker requires careful planning based on peak load and p99 latencies. The original Hystrix documentation contains detailed guidance for this. Note that the default execution timeout is 1 second. Too bad that official tutorials for frameworks (e.g. Spring) fail to even mention the word “timeout” and do not link to the appropriate documentation.

Circuit breakers require proper isolation

Even if timeouts are configured properly, the circuit breaker may not function as intended keeping the business use case in mind as the degradation will be too broad. Imagine an application A that is fetching a risk score for shipping addresses by calling system B. Because countries may be served by a different risk scoring provider, B will have multiple connectors (one for each provider) and will internally hold a logic defining which provider to choose based on the received address. Service A has a circuit breaker for calls to B. The failure rate of B however, will depend on the failure rate of the connected providers and the distribution of the requests across the providers. In such situation, a failure of a single provider can cause the circuit breaker to open preventing calls to be routed to the remaining providers thus degrading responses for all calls.

In the example below, B receives 300 rps and calls Provider 1 with 200 rps. When Provider 1 becomes unavailable, more than 50% of requests from A to B fail, causing the circuit breaker to open (following the default configuration of the popular Hystrix library) whereas requests routed to other providers would have been processed correctly.

Figure 1. Provider 1 becomes unavailable, triggering the circuit breaker for calls from A to B to open and reject requests to healthy Providers 2 … N. — Figure 1. **Provider 1** becomes unavailable, triggering the circuit breaker for calls from A to B to open and reject requests to healthy **Providers 2 … N**.

A potential solution for this type of insufficient isolation requires a custom strategy for counting the failure rate per provider or the creation of distinct circuit breakers per country within service A. Both solutions provide different types of isolation. Note that the former requires exposing additional information through the APIs (provider) whereas the latter is purely steered through knowledge that the caller already has based on the processed addresses (country). The right question to ask about this example is — why doesn’t B have circuit breakers to the individual providers? While these would be great to have, it’s often practically impossible, because B is a black box (e.g. a 3rd party service, a monolith that’s hard to adjust, …) and cannot be easily adjusted.

Mixing synchronous and asynchronous workloads

Let’s imagine a service that has a spike in the p99 latencies every x minutes. Sounds familiar? Frequently, it’s a log rotation demon running on the machine where the gzip operation is eating up resources, but more often than not, it is the service itself that causes such spike. It can be a scheduled job that is fetching and processing information, for example periodically refreshing a cache or cleaning up old entries in the DB. If not designed for carefully, such asynchronous execution will impact the p99 latency of the service for synchronous calls.

If you really need to mix such workloads within one application, ensure at least that the http, database connection pools, and thread pools are properly isolated from one another. Otherwise, a long-running async task will impact the synchronous workloads and worst case prevent those from being processed at all.

Missing protection from overload situations

Rate limiting as means to protect from incoming request load

A safe strategy for preventing overload of a service is applying a rate limit to incoming requests. Rate limits are set based on the resources, which impact the scaling ability of the application. These may be driven by its dependencies (e.g. database, 3rd party API) or just costs. Rate limiting can be applied within the application itself or outside, for example in API gateways or ingress controllers. A stricter version of rate limiting is load shedding through request rejection to signal overload situations. To implement load shedding, aside from the technical capabilities to execute this operation, it is helpful to understand per client the business impact of rejecting requests completely as this allows for easy selection of which clients to block first until the service is stabilised.

Rate limits ensure also that clients are forced to negotiate a limit increase with the service provider making scaling needs and capacity planning an explicit conversation. Lastly, rate limiting uncovers and helps dealing with rogue clients of the service. Imagine a service hosting static configuration, which can be cached for a long period of time (e.g. 4 hours). This service should process a request load that is dependent on the number of clients this service has. Load is therefore expected to fluctuate only due to its clients scaling up or down to accommodate incoming load. This service’s load is not expected to follow the traffic patterns of its clients. If it does, it means that such clients are not caching retrieved data correctly and just retrieve it while processing incoming requests.

Protection from unexpected request execution

While rate limits provide external protection, services must implement internal protection as well. As discussed before, services should define SLOs, on which clients will base their timeout configuration. However, the service itself must be designed to honour this SLO across all operations. This is achieved using timeouts on various levels starting from the persistence layer (e.g. via database statement timeouts) through http connection and request timeouts, up to TCP keepalives and timeouts and similar operating system settings. Pure compute operations can leverage execution budgets where a maximum execution time is defined, after which the computation will be aborted. If done correctly, it will prevent overload in cases where the calculation time is unexpectedly influenced by the processed data, for example a regex causing catastrophic backtracking. As past incidents have shown, this is important even for calculation executed in the background, because even though results are not used for the response itself, the calculation consumes CPU cycles and will overload the application anyway.

Lack of control of the application

Having the ability to control the applications inner workings is very helpful during incident response. Though it should not be required during normal operations of the service where one relies on reliability patterns, it comes in extremely handy when mitigating incidents. This can be a capability to pause batch jobs, re-trigger processing, or disabling expensive computation in favor of a degraded, but simpler one. Too often, similar changes require a code change and a deployment of the service instead of a simple switch in a feature flag system or via a management API endpoint. The longer the execution of the CI/CD pipeline in the absence of such controls, the worse the ability to quickly react during an incident.

Insufficient visibility

Aside from monitoring the four golden signals (latency, traffic, errors, saturation), further metrics help to gain quick understanding of production incidents. This starts with collecting metrics on connection pools, the rate of incoming requests per client, outgoing requests per dependency, or duration of batch job execution, etc. It can also include metrics very specific to the service itself — for example, if a service offering a batch API is collecting statistics on the batch size, such data can be used to verify whether clients use its API effectively or break it with unexpectedly big batches. Plotting response times per batch size, will provide insights into the processing times and drive discussions on SLOs for the service. Detailed service instrumentation using standard formats, like OpenTelemetry allows to drill down even further and find causes of incidents across the service call chain as well as identify areas that can be optimised (e.g. through parallelisation of calls) further improving performance and stabilising systems.

Summary

This post provided an overview of common service design flaws and pitfalls that impact reliability. I can only encourage you to check the out the references provided, especially the SRE book, which is a great starting point to dive deep into reliability engineering. Further, if you learned something from production issues, please share your failure stories, so that others can learn from your findings.

Using defaults is just asking for trouble#

Misconfigured reliability patterns#

Retries#

Circuit breakers#

Circuit breakers require proper isolation#

Mixing synchronous and asynchronous workloads#

Missing protection from overload situations#

Rate limiting as means to protect from incoming request load#

Protection from unexpected request execution#

Lack of control of the application#

Insufficient visibility#

Summary#