Foreword
	Preface
	
	Part Ⅰ.Introduction
	1. Introduction
	The Sysadmin Approach to Service Management
	Google's Approach to Service Management: Site Reliability Engineering
	Tenets of SRE
	The End of the Beginning
	2. The Production Environment at 6oogle, from the Viewpoint of an SRE
	Hardware
	System Software That "Organizes" the Hardware
	Other System Software
	Our Software Infrastructure
	Our Development Environment
	Shakespeare: A Sample Service
	
	Part Ⅱ.Principles
	3. Embracing Risk
	Managing Risk
	Measuring Service Risk
	Risk Tolerance of Services
	Motivation for Error Budgets
	4. Service Level Objectives
	Service Level Terminology
	Indicators in Practice
	Objectives in Practice
	Agreements in Practice
	5. Eliminating Toil
	Toil Defined
	Why Less Toil Is Better
	What Qualifies as Engineering?
	Is Toil Always Bad?
	Conclusion
	6. Monitoring Distributed Systems
	Definitions
	Why Monitor?
	Setting Reasonable Expectations for Monitoring
	Symptoms Versus Causes
	Black-Box Versus White-Box
	The Four Golden Signals
	Worrying About Your Tail (or, Instrumentation and Performance)
	Choosing an Appropriate Resolution for Measurements
	As Simple as Possible, No Simpler
	Tying These Principles Together
	Monitoring for the Long Term
	Conclusion
	7. The Evolution of Automation at Google
	The Value of Automation
	The Value for Google SRE
	The Use Cases for Automation
	Automate Yourself Out of a Job: Automate ALL the Things!
	Soothing the Pain: Applying Automation to Cluster Turnups
	Borg: Birth of the Warehouse-Scale Computer
	Reliability Is the Fundamental Feature
	Recommendations
	8. Release Engineering
	The Role of a Release Engineer
	Philosophy
	……