Toil

Categories
Systems
Sources
Site Reliability Engineering (Google)

Toil is operational work that is manual, repetitive, automatable, tactical, and devoid of enduring value, and that scales linearly with the size of the service. It is not the same as overhead or hard work: writing a one-off design is hard but not toil; clicking through the same recovery steps every week is. The defining trait is that doing it again produces nothing new.

Why it Matters

Toil that grows with the service eventually consumes a team's whole capacity, leaving no time for the engineering that would reduce it, a reinforcing trap. Naming toil, and capping the fraction of time spent on it, protects the effort that compounds: automation and design that make the next unit of growth cheaper rather than costlier.

Signals

  • The same manual procedure is run repeatedly, and more of it arrives as usage grows.
  • A team's time is increasingly spent reacting and decreasingly spent building.
  • Work that could be scripted is done by hand because there is never time to script it.

Benefits

Capping toil keeps capacity for engineering, so the team's leverage rises over time instead of being eaten by linear operational load. Work shifts from absorbing growth to enabling it.

Risks

Automating indiscriminately can cost more than the toil it removes, or hide the problem that generated the work. Some manual work is genuinely irreducible. The opposite failure is letting toil creep until firefighting is all the team does and there is no slack to escape it.

Tensions

Eliminating toil is an investment that competes with the very feature work whose growth produces the toil. Paying down toil now slows delivery now in exchange for not drowning later, the same deferred-payoff tension as design investment.

Examples

Replacing a weekly hand-run failover checklist with an automated, tested procedure; deciding not to automate a procedure run twice a year because the automation would cost more to build and maintain than the toil it saves.