Quality Attributes - Reliability and Fault Tolerance¶

Reliability = system operates correctly under specified conditions over time, including recovery from failures. Three pillars: availability, fault tolerance, recoverability.

Availability¶

Percentage of time system is accessible. Uptime / total observation time.

SLA	Downtime/Year	Classification
99.999%	~5 min	Five nines (ideal)
99.99%	~52 min	Four nines
99.9%	~8.76 hours	Standard
99%	~3.65 days	Basic
95%	~18.25 days	Poor

Availability Tools¶

Tool	Purpose
Auto-scaling	Dynamically add/remove instances based on metrics
Load balancer	Distribute traffic across instances
CDN	Geographic distribution for static content
Caching	Reduce backend load
Rate limiting	Prevent overload
Service discovery	Auto-detect services (Consul, etcd, ZooKeeper)
Health checks	Active probes or passive heartbeats
Connection pooling	Reuse pre-established connections
Hot standby	Backup runs in sync, immediate failover
Cold standby	Backup off, needs startup time

Fault Tolerance¶

System continues operating during partial failures.

Key Metrics¶

Metric	Measures	Ideal	Normal	Poor
MTTF	Time between failures	2 years	6 months	1 week
MTTR	Time to restore	5 min	30 min	3 days

Fault Tolerance Patterns¶

Failover: - Active-Passive - one active, standby waits - Active-Active - all handle traffic simultaneously

Bulkhead: Isolate resources so one failure doesn't cascade. Thread-level or process-level isolation.

Timeouts + Retries: Exponential backoff with jitter to prevent thundering herd.

Graceful Degradation: Disable non-critical features. If recommendations service down, purchases still work.

Chaos Engineering: Intentionally inject failures. Netflix Chaos Monkey randomly kills production services to verify self-healing.

Organizational Practices¶

Recovery plans - pre-prepared scenarios, regularly tested
Post-mortem analysis - timeline, root cause, actions taken, communication effectiveness

Recoverability¶

Ability to return to operational state after failure.

Key Metrics¶

Metric	Measures	Ideal	Normal	Poor
RTO	Max acceptable downtime	5 min	1 hour	3 days
RPO	Max acceptable data loss	1 min	15 min	1 day

Recovery Tools¶

Automated recovery - scripts + orchestration (Kubernetes auto-restarts)
Rollback - quickly revert to previous stable version
Database rollback - transaction rollback mechanisms
Configuration management - Ansible for config versioning

Key principle: preparation is everything. Pre-create scripts and scenarios for fast recovery.

Problem-to-Tool Mapping¶

Problem	Solution
Server crashes, manual restart needed	Health checks + auto-restart
Traffic spikes cause degradation	Auto-scaling
Microservice down, can't find standby	Service discovery
All requests hit single server	Load balancer
High latency for remote users	CDN + geographic distribution
Database overloaded by queries	Caching
Single client floods API	Rate limiting
Single server failure = total outage	Hot/cold standby

Architect's Role¶

The architect's value is deep understanding of problem domain and business needs, not knowing every technical detail. Core question: "what and why" - implementation details delegated to dev teams.

Technical decisions are interconnected with organizational culture, business strategy, regulatory requirements. All reliability metrics vary by use case, system criticality, and business requirements.

Gotchas¶

Five nines costs exponentially more than three nines - each additional nine roughly 10x the cost
MTTR matters more than MTTF in practice - failures are inevitable, fast recovery is the goal
Hot standby without testing - standby that was never tested may not work when needed. Test failover regularly
Chaos engineering without monitoring - injecting failures without observing results is just breaking things
SLA != SLO != SLI - SLI measures, SLO targets, SLA is the contract with consequences

Quality Attributes - Reliability and Fault Tolerance¶

Availability¶

Availability Tools¶

Fault Tolerance¶

Key Metrics¶

Fault Tolerance Patterns¶

Organizational Practices¶

Recoverability¶

Key Metrics¶

Recovery Tools¶

Problem-to-Tool Mapping¶

Architect's Role¶

Gotchas¶

See Also¶