Failure Is Normal
- Categories
- Systems
Failure is not an anomaly to be eliminated but a permanent, normal property of complex and distributed systems. Robust design assumes things are always partly broken and focuses on containing faults and degrading gracefully, rather than chasing a flawless state that does not exist.
Reinforced By
- Latent Failures — complex systems always contain multiple flaws, and you can never remove them all.
- Partial Failure — in a distributed system some parts are always broken, and you often cannot tell which.
- Fault Tolerance — reliability comes from containing faults, not from preventing them.
- Degraded Mode Operation — systems normally run in a partially degraded state and keep working anyway.
- Blameless Postmortem — when failure is treated as normal and systemic, incidents become learning fed back into the system rather than blame assigned to a person.
Why it Matters
How Complex Systems Fail argues that complex systems always run with multiple latent flaws and in a degraded mode, so failure is the normal condition, not a deviation. Designing Data-Intensive Applications reaches the same place for distributed systems: the network, clocks, and nodes are unreliable, partial failure is the defining difficulty, and reliability means tolerating faults rather than preventing them. Site Reliability Engineering builds an entire operational practice on the same premise: 100% reliability is the wrong target, so failure is budgeted for, and when it happens the blameless postmortem turns it into learning rather than blame. Across live operations, distributed systems, and production engineering the lesson is identical, design for a world that is always partly broken, with detection, redundancy, and graceful degradation, instead of assuming a healthy steady state.
Tension
Accepting failure as normal must not slide into tolerating preventable faults; the goal is to contain and degrade well, not to stop caring. And the redundancy and slack that graceful degradation needs cost efficiency that optimization pressure will try to strip away (see Resilience).