Fault Tolerance

Categories
Systems
Sources
Designing Data-Intensive Applications

A fault is one component deviating from its specification; a failure is the system as a whole stopping serving its users. A fault-tolerant system is designed so that faults do not escalate into failures. Some faults can even be triggered deliberately, killing nodes, injecting errors, to keep the tolerance mechanisms exercised.

Why it Matters

You cannot prevent all faults, so reliability comes from containing them rather than eliminating them. Naming the boundary between fault and failure tells you where to invest: detection, redundancy, and recovery in the gap between a component's defect and a user-visible failure.

Signals

  • A single component's defect taking down the whole service.
  • "It works until any one thing breaks."
  • Recovery paths that are never exercised until a real incident, when they turn out broken.

Benefits

Predictable behavior under partial breakage and confidence that common faults are absorbed rather than propagated.

Risks

Tolerating faults so well that they hide and accumulate unnoticed; recovery code that is itself untested and fails the moment it is finally needed.

Tensions

Tolerance adds redundancy and complexity that can introduce new faults of its own. Prevention is impossible past a point, so effort must be split between preventing faults and tolerating them.

Examples

A service that keeps serving when one replica dies; deliberately terminating nodes in production to verify that failover actually works.