← Back to Paths
EXPERT ROADMAP
Apache Spark for Data Engineering
Apache Spark for Data Engineering
Process petabyte-scale datasets reliably and ship them to production pipelines on time.
CREATED BY
K
Kiran N. ★ 5.0
Site Reliability Engineer at NetSwitch | 5+ years of experience
About this Path
For data engineers who know Spark basics but need to master the internals. Covers Catalyst optimiser, shuffle mechanics, Delta Lake ACID semantics, Structured Streaming, and Databricks deployment patterns. You will build a Lambda-architecture-free lakehouse pipeline with incremental processing and automated data quality checks.
Path Overview
Advanced LevelCertificate of CompletionAbout 60 hours to completeEnglish language22+ curated videosLearn online at your own pace6 modules with resourcesGamified & interactive
Path Curriculum
DAG, Stages, Tasks, and Shuffle Boundaries
Understand how wide transformations create stage boundaries and network I/O cost.
Catalyst Optimizer and Tungsten Execution Engine
Logical to physical plan transformations; whole-stage code generation internals.
Adaptive Query Execution — AQE Deep Dive
Runtime coalescing of shuffle partitions, skew join handling, and join strategy switching.
Reading and Interpreting Spark UI Plans
Identify slow stages, spill events, and skewed tasks from the Spark web interface.
Partition Sizing — spark.sql.files.maxPartitionBytes and repartition
Target 128–256 MB per partition; avoid small-file and large-task anti-patterns.
Broadcast Joins and Skew Join Mitigation
Force broadcast hints; salt skewed keys; use AQE skew join to split hot partitions.
Bucketing for Repeated Joins on the Same Key
Pre-partition data at write time to eliminate shuffle on frequently joined tables.
Z-Ordering and Data Skipping with Delta Lake
Co-locate related data on disk; measure file-level statistics for predicate pushdown.
ACID Transactions and Write Conflict Resolution
Optimistic concurrency control; understand isolation levels and retry semantics.
Schema Enforcement, Evolution, and Merge Schema
Prevent schema drift; safely add nullable columns without rewriting historical data.
MERGE INTO for SCD Type 1 and Type 2 Patterns
Implement upserts and slowly changing dimension logic efficiently in Delta.
Time Travel, Vacuum, and Retention Configuration
Query historical snapshots; clean expired log files without breaking open transactions.
Micro-Batch vs Continuous Processing Mode
Trade-offs in trigger intervals, latency, and exactly-once semantics for each mode.
Watermarks and Late Data Handling
Define event-time watermarks; understand how late records affect window finalisation.
Stateful Aggregations and MapGroupsWithState
Maintain arbitrary state per key with timeout policies for session and custom windows.
Kafka Source and Delta Lake Sink with Exactly-Once
Configure checkpointing, idempotent writes, and foreachBatch for reliable delivery.
Great Expectations and Delta Live Tables Expectations
Define column-level constraints; fail, quarantine, or warn based on severity.
Unit Testing PySpark Transformations with pytest
Use SparkSession fixtures, schema assertions, and in-memory DataFrames in CI.
Data Contracts and Schema Registry Integration
Enforce upstream schema agreements before ingestion reaches the bronze layer.
Cluster Types — All-Purpose, Jobs, and SQL Warehouse
Choose the right compute for interactive, batch, and BI workloads; avoid idle cost.
Auto-Scaling, Spot Instances, and Photon Engine
Configure min/max workers; mix on-demand and spot for cost-stable SLA delivery.
Databricks Asset Bundles and CI/CD Pipelines
Version-control notebooks and jobs; promote across dev/staging/prod with DAB deploy.
Monitoring with Ganglia, Spark UI, and Datadog
Alert on driver OOM, shuffle spill, and job duration regressions in production.
What you'll learn
- ✓Read Spark physical plans to identify shuffle bottlenecks, broadcast join opportunities, and partition skew.
- ✓Tune executor memory, garbage collection, and adaptive query execution parameters for stable production jobs.
- ✓Design a medallion lakehouse with Delta Lake enforcing schema evolution, ACID transactions, and time travel.
- ✓Build low-latency Structured Streaming pipelines with watermarks, stateful aggregations, and exactly-once sinks.
- ✓Write and schedule reliable data pipelines using PySpark with unit tests, expectations, and CI integration.
- ✓Operate Spark on Databricks — cluster sizing, auto-scaling policies, spot instance strategies, and cost controls.