Service Level Objectives

Categories
Systems
Sources
Site Reliability Engineering (Google)

A Service Level Objective (SLO) is an explicit target for a service's reliability, expressed over a service level indicator (SLI): a carefully chosen metric of user-visible health such as the fraction of requests served correctly and quickly. The SLO is the line the service is managed to. A Service Level Agreement (SLA) is the external, often contractual, promise; the SLO is the tighter internal target that keeps the SLA safe.

Why it Matters

It replaces "as reliable as possible" with a number the whole organization can reason about. Without a chosen target, reliability is argued by anecdote and every incident feels equally urgent. With one, reliability becomes a measurable property that can be traded against cost and velocity, and "good enough" is defined rather than assumed.

Signals

  • The team can state, in a number, how reliable the service is supposed to be, and measures whether it is meeting that.
  • Indicators track what users actually experience, not internal counters that look healthy while users suffer.
  • The objective is set just below the level users would notice, not at an unreachable 100%.

Benefits

A shared, quantified definition of "reliable enough," alerts and priorities anchored to user impact, and a foundation for spending risk deliberately through an error budget.

Risks

Choosing indicators that are easy to measure rather than meaningful to users; setting the target by aspiration instead of by what users need; chasing additional nines whose cost far exceeds their value. An SLO nobody enforces is decoration.

Tensions

Every nine of reliability costs more than the last, while the marginal value to users falls. The objective names the point where added reliability stops being worth it, which is a judgment about users and cost, not a maximization.

Examples

Targeting 99.9% of requests served under a latency threshold over a rolling month, measured from the user's edge rather than the server, and treating sustained breach as the trigger to slow change rather than ship features.