Chaos Engineering and Reliability Testing¶

Reliability testing validates system behavior under stress and failure. Chaos engineering purposefully injects failures to discover weaknesses before they cause outages.

Infrastructure Testing Pyramid¶

Config validation (unit) - syntax checks, schema validation
Infrastructure integration - service interaction tests
System-level - full stack validation
Chaos - failure injection

Reliability Testing Types¶

Type	Purpose	Tools
Load testing	Sustained expected load, verify SLOs	k6, Locust, JMeter
Stress testing	Beyond expected load, find breaking point	k6, Gatling
Soak testing	Extended duration, detect leaks/exhaustion	k6, custom
Chaos engineering	Inject failures, verify resilience	Chaos Monkey, Litmus, Gremlin

Chaos Engineering¶

Prerequisites¶

Working incident management process
Culture of learning from errors (blameless)
Trust within organization
Good observability/monitoring

Principles¶

Define steady state (normal SLI values)
Hypothesize that steady state continues during experiment
Introduce real-world events (kill pod, network partition, disk full)
Try to disprove hypothesis
Minimize blast radius (start small, automated kill switch)

Tools¶

Chaos Monkey (Netflix) - randomly terminates instances
Litmus - Kubernetes-native chaos
Gremlin - SaaS chaos platform
PowerfulSeal - K8s pod/node failure injection

Game Days¶

Planned reliability exercises simulating real incidents:

Define scenario and scope
Notify relevant teams
Execute simulated failure
Observe team response
Debrief and document lessons

Regular cadence (monthly/quarterly). Tests runbooks, alerting, communication.

Infrastructure Testing Tools¶

InSpec (Compliance)¶

control 'sshd-8' do
  impact 0.6
  title 'Server: Configure the service port'
  describe sshd_config do
    its('Port') { should cmp 22 }
  end
end

Testinfra (Python + Pytest)¶

def test_access(host):
    addr = host.addr("google.com")
    assert addr.is_resolvable
    assert addr.port(443).is_reachable

Configuration Validation¶

nginx -t -c nginx.conf
terraform validate
ansible --check
kubectl apply --dry-run=client -f deploy.yml
yamllint file.yaml
python -m json.tool sample.json

Resilience Patterns¶

Circuit Breaker¶

Three states: CLOSED (normal) -> OPEN (fail fast) -> HALF-OPEN (probe recovery).

@CircuitBreaker(name = "service", fallbackMethod = "fallback")
public Response callService() { ... }

Load Shedding¶

Reject requests when at capacity. Return 503. Better to serve 80% successfully than 100% slowly.

Rate Limiting¶

Per-client (token bucket, sliding window)
Global (distributed counter in Redis)
Adaptive (adjust based on current load)
Back-pressure propagation

Graceful Degradation¶

Return cached/stale data when backend unavailable
Disable non-critical features under load
Priority queues for request processing

Performance Metrics¶

APDEX (Application Performance Index)¶

Standardized user satisfaction: Satisfied (4T).

Score = (Satisfied + Tolerating/2) / Total. Range 0-1.

Percentiles over Averages¶

p50, p95, p99 for latency. Averages hide outliers and tail latency.

Gotchas¶

Chaos engineering without working incident response is irresponsible
Too large blast radius = real outage, not experiment
Load testing should include realistic data and access patterns
Soak tests are often skipped but catch memory leaks that short tests miss
siege -c 50 -t 30s http://localhost:80 for quick load testing