Tail Latency

Categories: Architecture
Sources: Designing Data-Intensive Applications

The response times at the high percentiles (p95, p99, p999), not the average or median. Because a few slow requests dominate the worst experiences, and often hit the most active, most valuable users, the tail is the number that matters.

Why it Matters

Averages and medians conceal the slow requests people actually notice. Optimizing p99 targets the cases that drive abandonment. In a system that fans out to many backends, one slow call makes the whole request slow (tail amplification), so the tail grows with fan-out.

Signals

Performance reported only as an average; SLOs defined on the mean.
A request that calls many services being only as fast as its slowest call.
"It's fast on average" while users complain about occasional hangs.

Benefits

Targets the experience that loses users and exposes amplification in fan-out architectures.

Risks

Chasing ever-higher percentiles at disproportionate cost; treating p999 dominated by rare, acceptable events as a defect.

Tensions

Tightening the tail is expensive and has diminishing returns; some tail latency is unavoidable, so the choice is how far down the tail to optimize.

Examples

An endpoint with a 10ms median but a 1s p99; a page that aggregates twenty backend calls being slow whenever any single one is slow.