← Back to Paths
[PLACEHOLDER hero banner]
Become a Data Engineer
Build production data pipelines and lakehouse platforms that reliably feed analysts, ML models, and executives.
CREATED BY
P
Punit D. [PLACEHOLDER] ★ 5.0
Senior Product Designer at [BRAND] | 7+ years of experience
About this Path
For software engineers or analysts transitioning into data engineering roles at data-driven product companies. You will design and operate ELT pipelines using dbt and Airflow, architect lakehouses on Iceberg and Delta Lake, master Spark for large-scale processing, and build data-quality frameworks—ready for mid-to-senior DE interviews.
Path Overview
Intermediate LevelCertificate of CompletionAbout 48 hours to completeEnglish language16+ curated videosLearn online at your own pace5 modules with resourcesGamified & interactive
Path Curriculum
Airflow 2: DAG Design, TaskFlow API & Dynamic DAGs
XCom patterns, Datasets-based scheduling, executor types for scale.
dbt Core: Models, Tests & Incremental Strategies
Merge vs append incremental, snapshot SCD2, dbt exposures for lineage.
Ingestion Patterns: CDC with Debezium & Fivetran Connectors
Kafka CDC pipeline, log-based vs query-based CDC trade-offs, watermarks.
Pipeline Reliability: Idempotency, Retries & Dead-Letter Queues
At-least-once vs exactly-once semantics, backfill strategies, failure alerting.
Apache Iceberg: Table Format, Metadata & Hidden Partitioning
Snapshot isolation, partition evolution, compaction strategies for small files.
Delta Lake: ACID Transactions, Z-Ordering & Liquid Clustering
Optimistic concurrency, MERGE patterns, Delta Change Data Feed.
Table Governance: Unity Catalog, AWS Glue & Apache Polaris
Fine-grained access control, cross-engine catalog federation, lineage graphs.
Spark Internals: DAG Execution, Stages & Shuffle Optimisation
Adaptive Query Execution, skew handling with salting, broadcast join thresholds.
PySpark DataFrames, Pandas UDFs & Arrow Optimisation
Vectorised UDFs, predicate pushdown, Catalyst optimizer hints.
Spark Structured Streaming & Kafka Integration
Trigger modes, watermarking for late data, stateful aggregations with RocksDB.
Spark on Kubernetes & Cost Tuning
Dynamic resource allocation, spot node tolerance, Graviton Spark benchmarks.
Dimensional Modeling: Star Schema, Slowly Changing Dimensions
SCD Type 1/2/4 patterns in dbt, conformed dimensions across subject areas.
Data Vault 2.0: Hubs, Links & Satellites
Hash key strategy, load dating, pit and bridge tables for performance.
Query Performance in Snowflake, BigQuery & Redshift
Clustering keys, partition elimination, materialized views, query profiling.
Great Expectations & Soda Core: Expectations & Data Contracts
Suite design, checkpoint pipelines, contract enforcement between producer and consumer.
Data Observability: Lineage, Freshness & Volume Anomalies
Monte Carlo field-level lineage, anomaly thresholds, on-call alerting playbook.
DE Interview Scenarios: Pipeline Design & SQL Deep Dives
Window functions, recursive CTEs, pipeline debugging questions with model answers.
Data Engineering System Design: End-to-End Walkthrough
Real-time analytics platform design covering ingestion, storage, serving, and SLAs.
What you'll learn
- ✓Design and operate production ELT pipelines with Apache Airflow, dbt, and Fivetran ingesting millions of daily events.
- ✓Build a lakehouse on Apache Iceberg or Delta Lake with schema evolution, time-travel queries, and partition pruning.
- ✓Process large-scale datasets with PySpark, optimising shuffles, joins, and memory spills for sub-10-minute SLAs.
- ✓Implement data quality frameworks using Great Expectations or Soda Core with automated alerting and quarantine patterns.
- ✓Apply dimensional modeling, slowly changing dimensions, and Data Vault 2.0 for analytics-ready warehouse layers.
- ✓Instrument pipelines with data observability using Monte Carlo or Bigeye and build SLA dashboards for stakeholders.