← Back to Paths
[PLACEHOLDER hero banner]
Site Reliability Engineering Foundations
Adopt SRE principles to build services that are measurably reliable and recoverable by any engineer.
CREATED BY
A
Ananya S. [PLACEHOLDER] ★ 4.9
Senior Software Engineer at ConnectIn | 7+ years of experience
About this Path
For DevOps engineers, platform engineers, and senior software engineers transitioning into SRE roles. You will learn the Google SRE model from SLOs to error budgets, incident command to chaos engineering, and operational toil reduction through automation. You will leave with the vocabulary, tooling patterns, and decision frameworks expected at SRE interviews at product-scale companies.
Path Overview
Intermediate LevelCertificate of CompletionAbout 44 hours to completeEnglish language20+ curated videosLearn online at your own pace6 modules with resourcesGamified & interactive
Path Curriculum
The Google SRE Model
Error budget philosophy, SRE vs DevOps, and the operational contract between Eng and SRE.
Toil: Definition & Measurement
Quantifying toil as % of sprint time and the 50% ceiling rule from the SRE Book.
Reliability as a Feature
Framing reliability work in product roadmaps, scoring risk with failure mode analysis.
SRE Team Models
Embedded, consulting, and platform SRE structures — trade-offs at different org sizes.
Defining Meaningful SLIs
Request success rate, latency percentiles, freshness, and durability SLI templates.
Setting Realistic SLO Targets
Starting from user pain thresholds, not nines; iterating SLOs quarterly from real data.
Error Budget Policies
Freeze deploys, slow down features, run chaos — the tiered response to budget exhaustion.
Multi-Window Burn Rate Alerting
5m/1h/6h alert windows that catch both fast burns and slow bleeds without false positives.
On-Call Design & Rotation Health
Alert volume targets, escalation policies, and PagerDuty/OpsGenie scheduling patterns.
Incident Command System for Tech Teams
IC, Comms Lead, and Operations Lead roles — keeping the room calm during SEV-1.
Blameless Postmortems
The five-why structure, contributing factor mapping, and action-item ownership tracking.
Incident Metrics: MTTD, MTTR, MTBF
Tracking and trending detection, response, and recurrence rates over rolling quarters.
Runbook Automation with Ansible & Scripts
Converting manual runbook steps into idempotent scripts triggered by alert webhooks.
Self-Healing with Kubernetes Operators
Custom controllers that detect and remediate drift, restart unhealthy pods, and scale proactively.
CI/CD Reliability Gates
Automated canary analysis, rollback triggers on SLO regression, and deployment freeze policies.
Toil Tracking Dashboard
Measuring engineering hours on reactive ops work week-over-week as a team health metric.
Chaos Engineering Principles
Hypothesis-driven experiments: steady state, variable injection, blast radius control.
Chaos Mesh on Kubernetes
Injecting pod failures, network delays, CPU stress, and DNS errors in staging environments.
Game Days
Structured drills simulating datacenter failure, dependency outage, and data corruption scenarios.
Dependency & Fallback Analysis
Circuit breakers, bulkheads, and graceful degradation patterns validated through chaos tests.
Load Testing with k6 & Locust
Scripting realistic traffic profiles, ramping strategies, and interpreting p50/p95/p99 results.
Traffic Forecasting
Seasonal decomposition, linear extrapolation, and when to provision 3× headroom vs 1.5×.
Cost-Reliability Trade-offs
Right-sizing Kubernetes requests/limits, spot instance risk modeling, and multi-AZ overhead.
Performance Regression Pipelines
Automated benchmark jobs in CI that fail PRs when p99 latency degrades beyond threshold.
What you'll learn
- ✓Define SLIs, SLOs, and error budgets for any service and use them to make data-driven reliability investment decisions.
- ✓Design and rehearse an incident response process including on-call rotations, escalation paths, and blameless postmortems.
- ✓Reduce operational toil by 50% through automation playbooks, runbooks, and self-healing Kubernetes controllers.
- ✓Implement chaos engineering experiments with Chaos Monkey or Chaos Mesh to validate failure assumptions before incidents do.
- ✓Apply capacity planning techniques — load testing, traffic forecasting, and headroom analysis — to avoid surprise saturation.
- ✓Instrument services with the four golden signals and build SLO-based alerting that eliminates noisy threshold alerts.