Skip to content

Quality Attributes - Reliability and Fault Tolerance

Reliability = system operates correctly under specified conditions over time, including recovery from failures. Three pillars: availability, fault tolerance, recoverability.

Availability

Percentage of time system is accessible. Uptime / total observation time.

SLA Downtime/Year Classification
99.999% ~5 min Five nines (ideal)
99.99% ~52 min Four nines
99.9% ~8.76 hours Standard
99% ~3.65 days Basic
95% ~18.25 days Poor

Availability Tools

Tool Purpose
Auto-scaling Dynamically add/remove instances based on metrics
Load balancer Distribute traffic across instances
CDN Geographic distribution for static content
Caching Reduce backend load
Rate limiting Prevent overload
Service discovery Auto-detect services (Consul, etcd, ZooKeeper)
Health checks Active probes or passive heartbeats
Connection pooling Reuse pre-established connections
Hot standby Backup runs in sync, immediate failover
Cold standby Backup off, needs startup time

Fault Tolerance

System continues operating during partial failures.

Key Metrics

Metric Measures Ideal Normal Poor
MTTF Time between failures 2 years 6 months 1 week
MTTR Time to restore 5 min 30 min 3 days

Fault Tolerance Patterns

Failover: - Active-Passive - one active, standby waits - Active-Active - all handle traffic simultaneously

Bulkhead: Isolate resources so one failure doesn't cascade. Thread-level or process-level isolation.

Timeouts + Retries: Exponential backoff with jitter to prevent thundering herd.

Graceful Degradation: Disable non-critical features. If recommendations service down, purchases still work.

Chaos Engineering: Intentionally inject failures. Netflix Chaos Monkey randomly kills production services to verify self-healing.

Organizational Practices

  • Recovery plans - pre-prepared scenarios, regularly tested
  • Post-mortem analysis - timeline, root cause, actions taken, communication effectiveness

Recoverability

Ability to return to operational state after failure.

Key Metrics

Metric Measures Ideal Normal Poor
RTO Max acceptable downtime 5 min 1 hour 3 days
RPO Max acceptable data loss 1 min 15 min 1 day

Recovery Tools

  • Automated recovery - scripts + orchestration (Kubernetes auto-restarts)
  • Rollback - quickly revert to previous stable version
  • Database rollback - transaction rollback mechanisms
  • Configuration management - Ansible for config versioning

Key principle: preparation is everything. Pre-create scripts and scenarios for fast recovery.

Problem-to-Tool Mapping

Problem Solution
Server crashes, manual restart needed Health checks + auto-restart
Traffic spikes cause degradation Auto-scaling
Microservice down, can't find standby Service discovery
All requests hit single server Load balancer
High latency for remote users CDN + geographic distribution
Database overloaded by queries Caching
Single client floods API Rate limiting
Single server failure = total outage Hot/cold standby

Architect's Role

The architect's value is deep understanding of problem domain and business needs, not knowing every technical detail. Core question: "what and why" - implementation details delegated to dev teams.

Technical decisions are interconnected with organizational culture, business strategy, regulatory requirements. All reliability metrics vary by use case, system criticality, and business requirements.

Gotchas

  • Five nines costs exponentially more than three nines - each additional nine roughly 10x the cost
  • MTTR matters more than MTTF in practice - failures are inevitable, fast recovery is the goal
  • Hot standby without testing - standby that was never tested may not work when needed. Test failover regularly
  • Chaos engineering without monitoring - injecting failures without observing results is just breaking things
  • SLA != SLO != SLI - SLI measures, SLO targets, SLA is the contract with consequences

See Also