Skip to content

Reliable, Scalable and Maintainable Applications

Most modern systems are built by composing small components behind simple interfaces (APIs) that hide internal details. While this abstraction simplifies usage, developers of system must now provide guarantees so others can judge whether the system fits their needs. These guarantees address questions such as how data correctness is ensured, how failures are handled, and how the system behaves under load. Among these concerns, three are fundamental:

  • Reliability: Does the system continue to operate at the desired performance level despite failures?
  • Scalability: Can the system handle growth in data, traffic, or other demands?
  • Maintainability: Can the system be efficiently updated and adapted to new requirements?

A clear understanding of these three concepts is essential and will be discussed in this chapter.

Reliability

A reliable system is one that continues to work correctly even when internal components misbehave. Such misbehaviors are called faults, and a system that can handle or tolerate them is known as fault-tolerant. This does not mean the system can tolerate every possible fault—which is impractical—but only a defined set of faults based on system requirements.

A term often confused with fault is failure, which occurs when the system stops providing its intended service. In practice, fault-tolerance is designed around those faults that could lead to failures. Some commonly occurring faults include the following.

Chaos Monkey

Netflix introduced Chaos Monkey to deliberately inject faults into production systems. By forcing failures, teams identify and fix critical issues early, ensuring the system is continuously tested against real failure scenarios.

Hardware Faults: One of the most direct ways a system can fail is through hardware issues such as power outages, unplugged network cables, disk crashes, or faulty RAM. Traditionally, these are handled by adding hardware redundancy— for example, hot-swappable components, RAID disk configurations, dual power supplies, and backup power.

However, as applications and data have grown with the internet, hardware failures have become common at datacenter scale, where individual machines frequently become unavailable. As a result, modern systems increasingly tolerate hardware faults at the software level, in addition to hardware redundancy. This allows the system to survive the loss of entire machines by using techniques such as replication across multiple nodes or datacenters.


Software Errors: These are often more severe than hardware faults because they can propagate across all nodes, causing system-wide failure. For example, the leap second on June 30, 2012, triggered a Linux kernel bug that caused many applications to hang simultaneously. Such bugs often remain dormant until a rare event violates hidden assumptions, leading to failure. To reduce the impact of software errors:

  • Make explicit, well-considered assumptions about how the system interacts with its environment.
  • Test the system thoroughly.
  • Isolate processes so faults do not cascade across the system.
  • Allow processes to crash and restart automatically.
  • Continuously monitor and analyze system behavior in production.

Human Errors: Even with the best intentions, we're a major source of faults, introducing errors during system design, development, or operation. To reduce this impact:

  • Minimize opportunities for mistakes by defining clear standards, good documentation, and concrete dos and don'ts.
  • Decouple high-risk components from areas where mistakes are common, for example by using sandbox environments that do not affect real users.
  • Test thoroughly using unit tests, integration tests, and manual testing, and automate coverage for edge cases.
  • Build mechanisms for rapid recovery, such as rollbacks and gradual or staged deployments.
  • Monitor system metrics and errors continuously, and set up alerts for warning and critical conditions.

Reliable systems matter for both critical applications (such as medical devices or aircraft) and non-critical ones (such as a personal notes database). If a system fails to work as promised, users can suffer significant losses. Reliability is therefore a responsibility: it protects the trust users place in the system.

Scalability

A system that is reliable today may become unreliable as load increases. The ability of a system to cope with growth is called scalability. Designing for scalability requires deciding how the system will handle growth in certain parameters (such as data volume or traffic) and how additional resources will be added to manage the increased load.

Describing Load

Load is described using measurable values called load parameters. The choice of parameters depends on the system’s architecture. Common examples include requests per second to a server, the read-to-write ratio in a database, or cache hit rate. Clearly defining the right load parameters is crucial for scaling effectively—for example, following Twitter case studies show how this clarity enabled them to scale their system more efficiently.

Case Study: Twitter

Twitter’s system has two primary operations:

  • Post Tweet: allows a user to publish a new message to their followers (peak ~12k requests/sec).
  • Home Timeline: allows a user to view tweets from people they follow (peak ~300k requests/sec).

Twitter’s main scaling challenge came from fan-out(1), where a single tweet needs to be delivered to many followers when building home timelines. Initially, tweets were stored in a table, and a user’s timeline was generated on demand by fetching all followed users and then retrieving their tweets. This approach did not scale as the number of users (the initial load parameter) increased.

To address this, Twitter switched to a different design: each user’s home timeline was built from a cache, and new tweets from followed users were directly written into this cache. This worked well because reading timelines was far more frequent than posting tweets. The trade-off was that posting a tweet became more expensive, since it now required additional write operations.

A further challenge came from outlier accounts, such as celebrities, with extremely large follower counts. For these accounts, pushing tweets to every follower’s cache caused excessive writes. Twitter therefore reverted to the original pull-based approach for such accounts. By considering the distribution of the followers per user load parameter, Twitter was able to scale the system more effectively.

  1. Fan-out refers to a situation where one request to an upstream service triggers multiple requests to downstream services.

Describing Performance

Once load parameters are defined, the next step is to describe how system performance changes as those parameters grow, such as how many additional resources are required to maintain acceptable performance. This requires defining performance metrics, such as throughput(1) in batch-processing systems, or response time(2) and latency(3) in web services.

  1. number of items processed per second.
  2. time from sending a request to receiving a response, including processing, network, and queuing delays.
  3. time taken to handle a queued request.

When response time is used as a performance metric, a single value like the average does not capture the full behavior of the system. An average response time cannot show how many requests were significantly faster or slower than that value. Instead, analyzing the distribution of response times provides a clearer picture of system performance.

A commonly used metric is the median(1) response time, also known as the 50th percentile (p50). It represents the typical user experience, since 50% of requests are completed in at most this time, while the remaining 50% take longer.

  1. To compute the median, sort all response times within a given time window and select the midpoint value. This indicates that 50% of requests in that window were served at or below the median response time, while the remaining 50% experienced higher latency.

To evaluate worst-case behavior, higher percentiles such as p95, p99, and p999 (99.9th percentile) are used. For example, p99 represents the response-time threshold that 99% of requests meet. These higher percentiles—often called "tail latencies"—are especially important because the users affected by them are frequently the most valuable or most demanding users of the system. For instance, AWS tracks p999 latency to reflect the experience of its highest-usage customers.

SLAs

Percentiles are commonly used in SLOs and SLAs(1), which define expected service performance and availability. Including such metrics in SLAs sets clear expectations for clients and allows customers to seek compensation if the agreed-upon guarantees are not met.

  1. Service Level Objectives and Service Level Agreements

Collecting performance metrics is as important as defining them, and measurements should reflect the real impact on end users.

  • For example, measuring response time only on the server may exclude delays caused by request queuing when the server is busy. To capture the true user-perceived latency, response times can instead be measured on the client side, for example through periodic client APIs.
  • During testing, load generators must send requests independently. Sending requests serially can distort performance results and fail to represent real-world load.
Tail Latency Amplification

Tail latencies are especially significant in services that are invoked multiple times within a single request lifecycle. In such cases, the slowest sub-request determines the overall response time, amplifying latency even when most requests are fast.


"How are load parameters and performance metrics used to build scalable systems?" There is no single architecture that fits all scalability needs.

  • Applications designed for a single machine are simpler, but as load grows, upgrading to more powerful hardware becomes increasingly expensive (vertical scaling), limiting scalability. Distributed systems scale better but introduce additional complexity.
  • Within distributed systems, services with unpredictable load benefit from elastic architectures, where resources are added or removed automatically based on defined load metrics. Services with predictable load can be scaled manually, which is simpler and involves fewer operational surprises.
  • Scalability also depends on the nature of the data. Stateful systems are harder to distribute because data must be tracked and managed across machines, as in databases.

Systems that scale well are designed around assumptions about which operations are frequent and which are rare—that is, around their load parameters. If these assumptions are wrong, scaling efforts may be ineffective. Still for some systems, such as early-stage startups, prioritizing product features over premature scalability design can be the more practical choice.

Maintainability

Most of a software system’s cost comes from ongoing maintenance—fixing bugs, keeping the system running, and adapting to new requirements. Without careful design during development, these activities become difficult, often resulting in legacy systems that are unpleasant to maintain. To avoid this, the book recommends three key design principles.

Operability: Running a system in production requires tasks such as setting up monitoring, deploying updates, diagnosing and restoring degraded performance, and keeping platforms up to date with security patches. These tasks are usually handled by an operations team, so making them easy reduces operational overhead and frees time for higher-value work. This can be achieved by providing:

  • Clear monitoring and health-check endpoints for visibility into system behavior.
  • Support for standard automation and deployment tools.
  • Clear, accessible documentation for operational tasks.
  • Sensible defaults with the ability to override them when needed.

Simplicity: As software systems grow, unmanaged complexity makes them harder to understand and modify. Sources of complexity include tight coupling between modules, excessive dependencies, inconsistent terminology, and short-term hacks. This directly impacts maintenance tasks such as adding features, fixing bugs, or improving performance. One of the most effective tools for managing complexity is "abstraction".

Good abstractions hide unnecessary details behind simple interfaces while enabling reuse and flexibility. For example, high-level programming languages abstract hardware details, and SQL abstracts storage and memory management. Learning to design good abstractions often comes from studying large, well-designed systems.


Evolvability: System requirements continuously change due to new technologies, features, regulations, or business priorities. Organizations often use "Agile" practices to adapt to change, employing techniques such as test-driven development and refactoring. While Agile methods work well at small scales, additional strategies are needed to maintain agility in large, multiservice systems—topics explored in later chapters.


In conclusion, there is no single solution for building reliable, scalable, and maintainable systems. However, certain design patterns and techniques consistently reappear across successful systems. The remainder of the book explores these patterns and how they help meet different system goals and constraints.