← Back to Paths
[PLACEHOLDER hero banner]
Switch from Support to SRE
Translate your incident expertise into SRE engineering skills and land your first reliability engineering role.
CREATED BY
D
Dev R. [PLACEHOLDER] ★ 4.8
Senior Data Engineer at StreamBase | 10+ years of experience
About this Path
Purpose-built for L1/L2/L3 support engineers who live in incident queues and want to cross over into SRE or Platform Engineering. You already understand production pain — this path teaches you to fix it systematically: SLOs, automation with Python and Bash, infrastructure-as-code, container observability, and postmortem culture. The outcome is a GitHub portfolio and the vocabulary to pass a technical SRE screen.
Path Overview
Intermediate LevelCertificate of CompletionAbout 48 hours to completeEnglish language20+ curated videosLearn online at your own pace6 modules with resourcesGamified & interactive
Path Curriculum
Support Engineer to SRE: what actually changes
Map your existing incident skills to the SRE job ladder and identify gaps.
SLIs, SLOs, and Error Budgets from Scratch
Define availability, latency, and throughput SLIs and set realistic SLO targets.
Error Budget Policy: when to freeze releases
Write a policy that halts feature work when the error budget drops below 10%.
Blameless Postmortems and the Five Whys
Facilitate a postmortem and produce action items with owners and due dates.
Identifying and Measuring Toil in Your Queue
Categorize tickets by type and calculate hours per week lost to repetitive tasks.
Python Scripting for Ops: boto3, requests, and subprocess
Automate AWS resource queries, webhook calls, and shell command execution.
Writing Robust Bash Scripts with Error Handling
Use set -euo pipefail, traps, and logging to make scripts safe in production.
Building a Simple CLI Tool and Publishing it Internally
Package a Python script as a CLI with argparse and document it in a README.
Terraform Core Workflow: init, plan, apply, destroy
Provision an EC2 instance and VPC from scratch and inspect the state file.
Modules, Variables, and Remote State with S3
Refactor inline config into reusable modules and store state in a shared backend.
Terraform in CI: plan on PR, apply on merge
Wire a GitHub Actions workflow to show a plan diff before any infrastructure change.
Prometheus Data Model and PromQL Fundamentals
Query rate(), histogram_quantile(), and recording rules for SLO dashboards.
Grafana Dashboard Design for SRE
Build a USE method (Utilization, Saturation, Errors) dashboard from real metrics.
Structured Logging with Python and the ELK Stack
Emit JSON logs, ship them to Elasticsearch, and build a Kibana error-rate view.
Distributed Tracing with OpenTelemetry and Jaeger
Instrument a Python service and trace a request across three microservices.
Docker Essentials Refresher for SREs
Build a container image and debug it with exec, logs, and inspect in 30 minutes.
Kubernetes Architecture from an Operator Perspective
Understand control plane, kubelet, etcd, and how they relate to incident diagnosis.
kubectl Survival Guide: get, describe, logs, exec, top
Diagnose CrashLoopBackOff, OOMKilled, and Pending pods with five commands.
Liveness, Readiness, and Startup Probes
Configure probes to prevent bad deploys from receiving traffic prematurely.
Building Your SRE Portfolio: three GitHub projects to have
Deploy a monitored app on Kubernetes, an IaC repo, and a toil-reduction script.
SRE Interview Formats: design, troubleshooting, and coding screens
Walk through sample questions on error budgets, postmortems, and Python scripting.
Negotiating the SRE Title and Compensation Jump
Position your support background as production depth and anchor your number correctly.
What you'll learn
- ✓Define SLIs, SLOs, and error budgets for a real service and use them to prioritize reliability work over feature requests.
- ✓Write production-grade Python and Bash automation scripts that replace repetitive toil with reliable, tested tooling.
- ✓Provision and manage cloud infrastructure using Terraform, applying the same change-control discipline you use for incidents.
- ✓Instrument applications with Prometheus metrics and build Grafana dashboards that surface error rate, latency, and saturation.
- ✓Lead blameless postmortems and write action items that durably fix root causes instead of patching symptoms.
- ✓Containerize a workload with Docker and deploy it to a Kubernetes cluster, interpreting pod events and logs to diagnose failures.