← Back to Paths
[PLACEHOLDER hero banner]
Observability: Metrics, Logs & Traces
Instrument, correlate, and alert on production systems so on-call means fixing, not guessing.
CREATED BY
A
Arnav K. [PLACEHOLDER] ★ 5.0
SSE at MegaMart Tech | 8+ years of experience
About this Path
For engineers who have used Grafana dashboards or Kibana but have never designed an observability stack from scratch. You will learn the three pillars — metrics, logs, traces — and how to wire them together with OpenTelemetry, Prometheus, and distributed tracing backends. You will leave able to pinpoint the root cause of a p95 latency spike in under five minutes.
Path Overview
Intermediate LevelCertificate of CompletionAbout 32 hours to completeEnglish language14+ curated videosLearn online at your own pace5 modules with resourcesGamified & interactive
Path Curriculum
Pillars: Metrics, Logs, Traces
What each pillar answers, their cost trade-offs, and why all three together matter.
The OpenTelemetry Data Model
Signals, resources, attributes, and the OTLP wire format that replaces vendor agents.
RED & USE Methodology
Mapping Rate/Errors/Duration to service health and Utilization/Saturation/Errors to infrastructure.
Cardinality & Cost Control
Why high-cardinality labels break Prometheus and how to design labels you can afford.
Instrumentation with prometheus-client
Counter, Gauge, Histogram, Summary — choosing the right type and recording correct units.
PromQL Fundamentals
rate(), increase(), histogram_quantile(), label_replace() — the queries every SRE must know.
Alerting with Alertmanager
Routing trees, inhibition, silence, and grouping to avoid alert storms on cascading failures.
Recording Rules & Federation
Pre-computing expensive queries for dashboards, and federating across multiple Prometheus shards.
JSON Log Schema Design
Mandatory fields: timestamp, level, service, trace_id, span_id, request_id, error.
Log Aggregation with Loki or Elasticsearch
Ship logs via Promtail / Fluentbit, index strategies, retention policies, and cost.
Log-Based Alerting & Anomaly Detection
LogQL metric queries, Kibana Watcher, and pattern-change detection for error spikes.
Trace Context Propagation
W3C TraceContext header, B3 propagation, and why context must cross every async boundary.
Auto-Instrumentation vs Manual Spans
OpenTelemetry auto-instrument for HTTP/DB vs adding custom spans for business logic.
Sampling Strategies
Head-based, tail-based, and adaptive sampling — balancing coverage against storage cost.
Jaeger & Tempo Deep Dive
Service maps, dependency graphs, span attribute search, and exemplar linking to Prometheus.
Grafana Dashboard Design
Service overview → resource saturation → trace drill-down — a three-layer drill pattern.
Defining SLIs and SLOs
Availability, latency, and error SLIs expressed as PromQL with 28-day rolling windows.
Error Budgets & Burn Rate Alerts
Multi-window, multi-burn-rate alerting from the Google SRE Workbook, wired to PagerDuty.
Runbook-Driven Incident Response
Linking alerts to runbooks, blameless postmortem templates, and MTTR tracking dashboards.
What you'll learn
- ✓Instrument any service with OpenTelemetry SDK to emit traces, metrics, and logs from a single agent.
- ✓Design Prometheus metric schemas using counters, gauges, histograms, and summaries without cardinality explosions.
- ✓Write PromQL queries that expose RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors) signals.
- ✓Correlate a distributed trace in Jaeger or Tempo back to the exact log lines and metric anomalies that explain it.
- ✓Build Grafana dashboards with drill-down from service SLO overview to individual request span waterfalls.
- ✓Define SLOs, error budgets, and alert routing policies that minimize alert fatigue while catching real incidents.