SRE Principles and Culture¶

Site Reliability Engineering (SRE) is a set of engineering techniques for building and maintaining reliable systems. Originated at Google (2003, Ben Treynor). "SRE is what happens when a software engineer is tasked with what used to be called operations."

SRE vs DevOps¶

DevOps defines WHAT should be done (cultural movement, broad practices)
SRE defines HOW to implement reliability (prescriptive, engineering-focused)
SRE implements DevOps principles. SRE includes coding: 50% operations, 50% engineering projects

Key Principles¶

Operations is a software problem - solve with engineering
Manage by SLOs - not "maximum reliability"
Error budgets - quantified acceptable unreliability
Toil reduction - automate repetitive manual work
Shared ownership - SREs co-own production with developers
Blame-free postmortems - focus on systemic fixes

Culture Pillars¶

Engineering Culture¶

Free exchange of ideas. No organizational consequences for problems (if SLO maintained)
"Everyone in the same boat" - shared responsibility
Blameless: blame is neurophysiological - build culture that overcomes it

Measure Everything (Observability)¶

Know your system: where to look, what's happening, quickly
Simple-sounding but complex and expensive to implement

Blameless Postmortems¶

Finding blame is counterproductive
Open culture for discussing problems when they arise

Incident Management¶

Failures always happen. Runbooks, escalation ladders, classification
Well-classified incidents with known procedures lead to self-healing systems
Simplest example: health checks with auto-restart

Key Metrics¶

Metric	Description
Deploy Frequency	How often new code ships. Decreasing = pipeline problem
MTTC	Mean Time To Commit - commit to production
Change Failure Rate	Proportion of changes causing failure
MTTA	Mean Time To Acknowledge (~5 min industry standard)
MTTR	Mean Time To Resolve (~1 hour industry standard)

Error Budget¶

100% - SLO = error budget

SLO 99.9% = 0.1% budget = 43.2 min/month
Budget exhausted -> freeze feature releases, focus on reliability
Budget remaining -> deploy faster, take more risks
Error budget burn rate tracks consumption speed
Error budget policy defines actions at thresholds

Toil¶

Repetitive manual work that can be automated. Characteristics: manual, automatable, tactical, no enduring value, scales linearly with growth.

Toil / All tasks = toil ratio. Target: below 50%.

Automation Maturity Levels¶

No automation - fully manual
System-specific scripts - individual admins
Generic shared automation - shared tools
Built-in automation - system self-manages
Autonomous systems - self-healing, self-scaling

SRE Team Models¶

Model	Description
Infrastructure SRE	Shared platform team. Unified approach. Risk: disconnected from product
Product SRE	Embedded with product team. Deep knowledge. Risk: siloed practices
Consulting SRE	Temporary engagement. Helps teams improve. Moves on when SLO met

When SRE Is NOT Needed¶

Service already works well
Service is unimportant
Service is legacy being decommissioned

Anti-patterns¶

"SRE solves all problems" - SRE is techniques; people solve problems
"SRE is a silver bullet" - works only with certain systems at certain lifecycle stages
"SRE is part of DevOps team" - goals differ, subordination is problematic
Launching SRE without commitment from ALL departments

Gotchas¶

Excessive reliability is wasteful - SRE ensures "reliable enough" for business
SRE requires commitment from all departments, not just engineering
Know-Do-Learn cycles: constant change is the new reality