Site Reliability Engineering

Betsy Beyer, Chris Jones, Jennifer Petoff & Niall Richard Murphy

Pages

552

Year

2016

Difficulty

Challenging

Themes

site reliability engineering, monitoring, incident response, capacity planning, production systems

The book that defined site reliability engineering as a discipline. Written by members of Google’s SRE team and edited by Beyer, it explains how Google builds, deploys, monitors, and maintains some of the largest software systems in the world.

Why Start Here

Site Reliability Engineering is Beyer’s most important editorial achievement and the work that established her reputation. She organized and shaped essays from dozens of Google engineers into a coherent book that covers everything from monitoring and alerting to incident response, capacity planning, and on-call rotations. The central argument is that reliability should be engineered with the same rigor as any product feature.

Google’s approach, setting error budgets, automating toil, and treating operations work as software development, became the blueprint for how modern organizations think about running production systems. Before this book, these ideas lived inside Google. Beyer and her co-editors made them accessible to the entire industry.

What to Expect

A 552-page collection of essays by Google engineers, organized into sections on principles, practices, and management. Writing quality varies across chapters since it is a multi-author work, but the best chapters are exceptionally clear. This is not a book most people read cover to cover. Pick the chapters relevant to your situation and use the rest as reference. The sections on culture, incident management, and on-call practices are accessible to a broad audience, while other chapters require significant technical background.

What to Read Next

Similar authors