Designing for Failure

Treat failure as the normal condition of a complex system and build the reasoning that follows: partial failure, latent flaws, tolerance, graceful degradation, and resilience.

Depth
Deep
Steps
8 existing graph entries
Start reading →
  1. 1
    Mental Model

    Failure Is Normal

    Failure is not an anomaly to be eliminated but a permanent, normal property of complex and distributed systems. Robust design assumes things are always partly broken and focuses on containing faults and degrading gracefully, rather than chasing a flawless state that does not e...

    Begin by accepting failure as the normal condition of any sufficiently complex system, not an exception.

    Related within this path

    • References: Degraded Mode Operation
    • References: Fault Tolerance
    • References: Latent Failures
    • References: Partial Failure
  2. 2
    Concept

    Partial Failure

    In a distributed system some parts can be broken while others keep working, and a node often cannot tell whether a remote node has failed, is merely slow, or whether the network dropped the message. Unlike a single machine that either works or crashes, distributed systems fail...

    In a distributed system some parts fail while others keep running; this is the default, not the edge case.

    Related within this path

    • References: Failure Is Normal
    • Related To: Fault Tolerance
    • Related To: Latent Failures
  3. 3
    Concept

    Latent Failures

    Complex systems always contain multiple flaws, each latent and individually insufficient to cause harm. Because the system keeps running, these flaws accumulate largely unnoticed, and you can never remove them all. Failures are not anomalies waiting to be eliminated; they are...

    Most failures are latent flaws waiting to combine, not a single root cause.

    Related within this path

    • References: Failure Is Normal
    • Related To: Change Introduces New Failure Modes
    • Related To: Defense in Depth
    • Related To: Degraded Mode Operation
    • Related To: Partial Failure
  4. 4
    Concept

    Change Introduces New Failure Modes

    Every change to a complex system, including changes that fix problems or add safety, creates new and often unforeseen paths to failure. Improvement and new risk arrive together. Changes alter the web of interactions and consume the margin that absorbed past variation, so the s...

    Every change, including fixes, can introduce new ways to fail.

    Related within this path

    • Related To: Latent Failures
  5. 5
    Concept

    Fault Tolerance

    A fault is one component deviating from its specification; a failure is the system as a whole stopping serving its users. A fault tolerant system is designed so that faults do not escalate into failures. Some faults can even be triggered deliberately, killing nodes, injecting...

    Design so that faults do not become failures the user sees.

    Related within this path

    • References: Failure Is Normal
    • Related To: Partial Failure
    • Related To: Resilience
  6. 6
    Concept

    Degraded Mode Operation

    Complex systems run continuously in a partially broken state. They function not because they are flawless but because enough redundancy and human adjustment keep them working despite the flaws they always carry. The normal condition of a complex system is "degraded," not "perf...

    Prefer degrading gracefully over failing completely.

    Related within this path

    • References: Failure Is Normal
    • Related To: Defense in Depth
    • Related To: Latent Failures
    • Related To: Resilience
  7. 7
    Concept

    Defense in Depth

    Complex systems are protected by multiple, overlapping layers of defense, so that no single failure produces catastrophe. Harm requires several defenses to fail at once. Because latent flaws are always present, robustness comes not from one perfect barrier but from layering im...

    Layer independent defenses so no single failure is catastrophic.

    Related within this path

    • Related To: Degraded Mode Operation
    • Related To: Latent Failures
  8. 8
    Concept

    Resilience

    A system's ability to recover its function and structure after disturbance, to persist within a variable environment. Resilience comes from rich, overlapping, redundant feedback loops, not from optimization toward a single target. Resilient systems absorb shocks and repair the...

    Close on resilience: the capacity to absorb disturbance and keep functioning.

    Related within this path

    • Related To: Degraded Mode Operation
    • Related To: Fault Tolerance