Data Engineer Interview Questions

200 scenario-based questions with detailed model answers, organized skill-wise and tool-wise. Filter by topic, level or keyword, reveal the answer — then pressure-test yourself in a real mock.

SKILL / TOOL

LEVEL

200 questions

Q001SQL & Query OptimizationMid

Your nightly revenue rollup that used to finish in 12 minutes now takes 3 hours after a product launch tripled row counts. Walk me through how you'd read the query plan and decide whether the fix is indexing, partitioning, or rewriting the join.

Q002SQL & Query OptimizationSenior

An analyst's dashboard query joins a 2-billion-row events table to a dimension on a non-unique key and silently fans out rows, inflating revenue by 40%. How do you detect this class of bug systematically across hundreds of queries, not just fix this one?

Q003SQL & Query OptimizationMid

You inherit a 900-line SQL query with seven nested subqueries and no comments, and it returns wrong numbers for exactly one region. Describe your approach to decomposing and debugging it without rewriting the whole thing blind.

Q004SQL & Query OptimizationSenior

Your window-function dedup using ROW_NUMBER over a 500-million-row table spills to disk and dominates warehouse spend. What alternatives do you weigh — QUALIFY, pre-aggregation, incremental dedup at ingestion — and how do you prove the winner with numbers?

Q005SQL & Query OptimizationMid

A LEFT JOIN that should preserve all customers is dropping 3% of them, and you discover a WHERE clause filtering on the right table's column. Explain how you'd fix it and what review habits prevent this recurring across the team.

Q006SQL & Query OptimizationSenior

Two queries compute monthly churn and disagree by 2%. One uses DISTINCT counts, the other a GROUP BY with late filtering of test accounts. How do you reconcile them, decide the canonical definition, and stop the drift permanently?

Q007SQL & Query OptimizationMid

The optimizer suddenly chose a nested-loop join on your fact-to-dimension query after stats went stale, and runtime exploded overnight. What's your immediate mitigation, and what do you automate so stale statistics never page you again?

Q008SQL & Query OptimizationSenior

Finance needs a point-in-time report: order values as they appeared on the last day of each quarter, including later corrections excluded. The orders table is mutable and overwritten daily. How do you make this query even possible, and what schema change would you push for?

Q009SQL & Query OptimizationMid

Your query filters on DATE(created_at) = CURRENT_DATE and scans the full table every run despite a partition on created_at. Explain why partition pruning failed here and how you'd rewrite the predicate without changing results.

Q010SQL & Query OptimizationSenior

A recursive CTE building an org hierarchy works in staging but hits recursion limits in production where the data contains a cycle from a bad HR import. How do you make the query cycle-safe and surface the bad rows to the source team?

Q011SQL & Query OptimizationMid

You're asked why SELECT COUNT(DISTINCT user_id) on a day's events takes 20x longer than COUNT(*). Walk through what the engine does differently and when you'd offer an approximate distinct count instead, including how you'd sell the accuracy tradeoff.

Q012SQL & Query OptimizationSenior

After a migration, identical SQL returns different row counts on the old and new warehouse, and you trace it to NULL handling in a NOT IN subquery. How do you audit the rest of the migrated codebase for similar semantic landmines?

Q013SQL & Query OptimizationMid

A self-join computing session gaps runs fine daily but times out on the monthly backfill range. The logic is correct. What rewriting strategies — window functions, range bucketing, incremental chunks — do you try first, and why that order?

Q014SQL & Query OptimizationSenior

Your team's most expensive query computes a 90-day rolling retention matrix from raw events every morning. Nobody questions it because it's 'always been there.' How do you decide between materializing intermediates, restructuring the calculation, or killing it entirely?

Q015SQL & Query OptimizationSenior

An UPDATE joining a staging table into a 4-billion-row target locks the table for an hour and blocks downstream readers during business hours. How do you restructure this mutation — batching, merge strategies, swap tables — to make it invisible to consumers?

Q016Spark & Distributed ProcessingMid

Your Spark job's last stage shows 199 tasks finishing in seconds and one task running for 40 minutes. Walk me through confirming it's key skew, identifying the hot key, and the mitigations you'd try before touching cluster size.

Q017Spark & Distributed ProcessingSenior

A join between a 2TB fact and a 60GB dimension keeps failing with executor OOM, and broadcast is off the table at that size. How do you choose between salting, bucketing, AQE skew handling, or pre-aggregating, and validate the choice?

Q018Spark & Distributed ProcessingMid

Your pipeline writes 80,000 tiny Parquet files per day to the lake, and downstream Spark reads are getting slower every week. Explain where the small-file problem comes from in your write path and how you'd fix output sizing without breaking consumers.

Q019Spark & Distributed ProcessingSenior

A Spark job that ran fine for months started intermittently failing with shuffle fetch failures after the data grew past a threshold. The retry usually succeeds, so the team ignores it. Why is this a ticking bomb, and what's your remediation plan?

Q020Spark & Distributed ProcessingMid

You see a 3am failure: a Spark job died on a corrupt Parquet footer in one input file out of 12,000. How do you get the pipeline green before morning, quarantine the bad file, and prevent corrupt inputs from killing whole runs?

Q021Spark & Distributed ProcessingSenior

Two runs of the same Spark job over identical input produced different aggregate results, and you suspect a non-deterministic UDF combined with stage retries. How do you hunt down nondeterminism in a 5,000-line codebase and enforce determinism going forward?

Q022Spark & Distributed ProcessingMid

Your driver keeps OOMing on a job that 'only' aggregates, and you find a collect() pulling 30 million rows for a Python-side lookup. What patterns do you replace it with, and how do you catch driver-side anti-patterns in code review?

Q023Spark & Distributed ProcessingSenior

Finance flags that the Spark cluster bill doubled in a quarter while data volume grew 20%. You have a hundred jobs and no per-job attribution. How do you build cost visibility and pick the first three optimization targets?

Q024Spark & Distributed ProcessingMid

A teammate cached a large DataFrame to 'speed things up' and now executors are evicting blocks and recomputing lineage repeatedly, making the job slower. How do you diagnose bad caching from the Spark UI and decide what actually deserves persistence?

Q025Spark & Distributed ProcessingSenior

You're migrating 200 Hive-era Spark jobs to a newer Spark version and a handful return different results due to changed timestamp parsing and ANSI behavior. Describe your migration harness: how do you certify equivalence at scale instead of eyeballing outputs?

Q026Spark & Distributed ProcessingMid

Your structured streaming job's batch duration creeps from 30 seconds to 5 minutes over a day, then recovers after restart. What state-store and watermark issues would you investigate, and what metrics confirm the diagnosis before you blame 'memory leaks'?

Q027Spark & Distributed ProcessingSenior

A nightly Spark job reads from a table another team compacts concurrently, and once a month it fails with missing-file errors at 3am. Neither team wants to change schedules. How do you make the read path resilient and broker a durable contract?

Q028Spark & Distributed ProcessingMid

Explain how you'd debug a Spark job where the logical plan looks fine but the physical plan shows a cartesian product the author never intended. What join-condition mistakes typically cause this, and how do you guard against them in CI?

Q029Spark & Distributed ProcessingSenior

Your team runs Spark on spot instances to save money, but a critical 4-hour job loses executors near completion twice a week and reruns from scratch. How do you redesign for graceful degradation — checkpointing, stage isolation, fallback capacity — without losing the savings?

Q030Spark & Distributed ProcessingSenior

A PySpark job spends 70% of its runtime in Python UDFs doing string normalization. Walk through your optimization ladder — built-in functions, pandas UDFs, moving logic to SQL — and how you measure each rung's gain on production data.

Q031Spark & Distributed ProcessingMid

After enabling dynamic allocation, your job sometimes grabs 400 executors and starves every other team on the shared cluster. How do you tune allocation bounds and queue isolation so one greedy job can't take down the platform's morning SLAs?

Q032Kafka & StreamingMid

Consumer lag on your payments topic has been climbing for six hours and the on-call dashboard shows processing rate flat while produce rate doubled. Walk through how you'd scale consumption safely when partition count is the ceiling you hit.

Q033Kafka & StreamingSenior

A downstream team reports duplicate orders in the warehouse, and you trace it to consumers committing offsets before processing completes, then crashing. How do you redesign the consume-process-commit loop, and where do you place the dedup if exactly-once isn't achievable end-to-end?

Q034Kafka & StreamingMid

At 3am a producer deployment started writing events with a new field as a different type, and your streaming job is crash-looping on deserialization. What's your immediate containment, and how does a schema registry with compatibility rules prevent the repeat?

Q035Kafka & StreamingSenior

Your clickstream topic is keyed by user_id, and one bot account generates 30% of traffic, hot-spotting a single partition and stalling that consumer. How do you handle hot keys without breaking the per-user ordering your sessionization logic depends on?

Q036Kafka & StreamingMid

You discover events arriving up to 6 hours late from mobile clients, and your hourly streaming aggregates silently dropped them past the watermark. How do you quantify what was lost, choose a new lateness policy, and repair historical aggregates?

Q037Kafka & StreamingSenior

A topic with 7-day retention had a consumer down for 9 days due to a forgotten paused deployment, and the data is gone from Kafka. Walk through your recovery options, what this incident says about your lag alerting, and the retention-vs-cost conversation that follows.

Q038Kafka & StreamingMid

Your consumer group rebalances every few minutes under load, repeatedly pausing all processing. What causes rebalance storms — long processing inside poll loops, session timeouts, flapping members — and how do you stabilize the group without masking real failures?

Q039Kafka & StreamingSenior

Product wants a real-time fraud score within 2 seconds of a transaction, joining the transaction stream against a customer profile table updated hourly in the warehouse. Sketch your architecture for the stream-table join and explain where staleness is acceptable.

Q040Kafka & StreamingMid

A poison message with malformed JSON is crash-looping your consumer and blocking the partition behind it. Design your dead-letter strategy: when do you skip, when do you halt, and how does the DLQ get monitored rather than becoming a write-only graveyard?

Q041Kafka & StreamingSenior

Your Flink job's checkpoint duration grew from 5 seconds to 4 minutes over a month, and now checkpoints time out under peak load, causing restart loops. How do you investigate state growth, and what's your plan when state legitimately must be that large?

Q042Kafka & StreamingMid

Two services publish 'order_created' events with subtly different field semantics, and analytics built on the merged topic shows inconsistent totals. How do you untangle ownership, define one canonical event, and migrate consumers without a big-bang cutover?

Q043Kafka & StreamingSenior

During a region failover your mirrored Kafka cluster came up, but consumers resumed from wrong offsets and reprocessed 4 hours of events into downstream systems. What about offset translation failed, and how do you design downstream sinks to absorb such replays harmlessly?

Q044Kafka & StreamingSenior

You need to add 5 years of historical orders into a streaming pipeline that currently only sees new events, without doubling infrastructure. Compare replaying history through the topic versus a batch bootstrap with a stream cutover, and the ordering hazards of each.

Q045Kafka & StreamingSenior

Your streaming sessionization emits sessions on a 30-minute inactivity gap, but a clock-skewed fleet of devices sends timestamps 15 minutes in the future, splitting sessions and inflating session counts by 12%. How do you detect, correct, and backfill this?

Q046Kafka & StreamingSenior

Leadership wants to know why you'd keep a nightly batch pipeline when a streaming version exists. Using a concrete pipeline you've owned, argue when streaming's operational cost — state, replays, on-call complexity — outweighs freshness, and where the line flips.

Q047Airflow & OrchestrationMid

It's 3am and the daily warehouse load failed at task 14 of 40. The on-call runbook just says 'rerun the DAG.' Which tasks are safe to rerun blindly, which aren't, and how do you make this decision obvious for the next on-call?

Q048Airflow & OrchestrationSenior

Your Airflow instance runs 600 DAGs, and the scheduler delays task starts by 20 minutes during the morning rush, blowing SLAs. How do you diagnose scheduler contention versus worker capacity, and what structural changes — pools, priorities, DAG consolidation — do you consider?

Q049Airflow & OrchestrationMid

A DAG with catchup enabled was paused for two weeks during an incident, and unpausing it instantly queued 14 backfill runs that hammered the warehouse. How should this DAG have been configured, and how do you recover right now?

Q050Airflow & OrchestrationSenior

Your finance pipeline depends on an upstream team's DAG finishing, currently coordinated by a 6am time-based guess that breaks whenever they run long. Compare sensors, datasets, and event-driven triggers for this handshake, and the failure modes each introduces.

Q051Airflow & OrchestrationMid

A task succeeded but wrote zero rows because the upstream file was empty, and Airflow happily marked everything green while dashboards went stale. Where do you add data-aware checks so 'task success' actually means 'data arrived'?

Q052Airflow & OrchestrationSenior

You discover a teammate's DAG does heavy pandas processing inside the Airflow worker, periodically OOMing workers and killing unrelated tasks. What's your principle on orchestrator-as-compute, and how do you migrate this DAG without changing its outputs?

Q053Airflow & OrchestrationMid

Your DAG uses execution_date logic written before a timezone migration, and after daylight saving changed, runs began double-processing one hour of data. Walk through debugging schedule semantics and writing interval logic that survives timezone and calendar quirks.

Q054Airflow & OrchestrationSenior

An incident review reveals five teams have built five retry-and-alert conventions, so identical failures page differently every night. Design the paved-road DAG standards you'd enforce — retries, SLAs, callbacks, ownership tags — and how you'd roll them out without halting delivery.

Q055Airflow & OrchestrationSenior

A dynamic DAG factory reads a config table and generates 300 near-identical DAGs, and now parse times throttle the scheduler. When do you collapse them into a single parameterized DAG with mapped tasks, and what observability do you lose or gain?

Q056Airflow & OrchestrationSenior

Your nightly DAG failed because a third-party API returned 200 OK with an HTML error page in the body, and the task 'succeeded' then poisoned downstream tables. How do you harden external-source tasks, and where does validation belong in the DAG?

Q057Airflow & OrchestrationMid

Two DAG runs of the same pipeline overlapped after a long-running backfill collided with the scheduled run, and both wrote to the same partition. Which concurrency controls — max_active_runs, task locks, partition-level guards — do you apply, and in what order?

Q058Airflow & OrchestrationSenior

You're asked to cut the data platform's 6am SLA to 4am. Critical-path analysis shows the bottleneck is a chain of sequential tasks with hidden parallelism. How do you systematically find and exploit parallelism across DAGs without creating dependency spaghetti?

Q059Airflow & OrchestrationMid

After an Airflow upgrade, a previously stable DAG's XCom-passed DataFrame now intermittently fails serialization for large payloads. Why is passing data through XCom an anti-pattern at this size, and what handoff mechanism do you move to?

Q060Airflow & OrchestrationSenior

A retried task re-sent 40,000 marketing emails because the operator wrapped a non-idempotent API call. Beyond fixing this task, how do you audit your DAG estate for retry-unsafe side effects and build guardrails into your operator library?

Q061Airflow & OrchestrationSenior

Your team debates moving from cron-scheduled DAGs to fully event-driven orchestration triggered by data arrival. Using a pipeline that suffers from late upstream data, argue the migration's payoff and the new failure modes — missing triggers, event storms — you must now own.

Q062dbt & TransformationMid

Your dbt project's full run went from 20 minutes to 3 hours over a year, and nobody knows which models matter anymore. Walk through how you'd map the DAG, find dead models, and introduce selective builds without breaking BI dependencies.

Q063dbt & TransformationSenior

An incremental dbt model has silently diverged from its full-refresh result by 1.8% over six months due to late-arriving updates outside the lookback window. How do you detect this drift class automatically and design lookbacks with evidence instead of folklore?

Q064dbt & TransformationMid

A teammate changed a shared staging model's column logic and broke 14 downstream models in production because nothing ran in CI. Design the pull-request pipeline you'd implement — state-based selective testing, deferred runs — and its cost-speed tradeoffs.

Q065dbt & TransformationSenior

Marketing and finance each maintain their own revenue model in dbt and the numbers disagree in every QBR. How do you drive consolidation to one governed mart — technically and politically — and what stops the fork from reappearing?

Q066dbt & TransformationMid

Your dbt tests pass but a stakeholder finds orders with negative quantities in the mart. The not_null and unique tests were never going to catch this. How do you decide which business-rule assertions deserve tests, and where do they live?

Q067dbt & TransformationSenior

A 4,000-model dbt monorepo now takes 15 minutes just to parse, and ten teams trample each other's deployments. Lay out your path to domain-split projects with cross-project references, and what contracts must exist at the seams.

Q068dbt & TransformationMid

An incremental model keyed on updated_at missed rows because the source system back-dates corrections. You're paged when finance reconciliation fails. What's your immediate repair, and how do you choose a more trustworthy incremental strategy for this source?

Q069dbt & TransformationSenior

Your dbt snapshots table on a high-churn source has ballooned to billions of rows and the snapshot run dominates the nightly window. How do you rethink change capture for this entity — CDC upstream, partitioned snapshots, retention policies — without losing point-in-time history?

Q070dbt & TransformationMid

A jinja macro used in 60 models has a subtle bug in fiscal-quarter logic affecting only January. Walk through how you assess blast radius, fix and test the macro, and decide which historical tables must be rebuilt.

Q071dbt & TransformationSenior

Your team materializes everything as tables 'to be safe,' and the warehouse bill reflects it. For a pipeline of staging, intermediate, and mart layers, explain how you'd choose between views, tables, and incremental models with actual usage data.

Q072dbt & TransformationSenior

You enabled dbt model contracts on a core mart, and a routine column type widening now blocks deployment for a downstream team mid-sprint. How do you operate contracts so they catch real breakage without becoming a bureaucratic veto on every change?

Q073dbt & TransformationSenior

A dbt Cloud job retried a model that had partially committed via a non-atomic pre-hook, leaving a mart half-updated for 3 hours during trading hours. What does this reveal about hooks and transactionality, and how do you redesign for atomic publishes?

Q074dbt & TransformationMid

A new analyst keeps putting business logic in BI-tool calculated fields because 'dbt PRs take too long,' and metric definitions are drifting again. How do you fix the contribution workflow so the governed path is also the fastest path?

Q075Data Modeling (star schema/SCD)Mid

Sales reports show customers attributed to their current region instead of the region at time of sale, overstating one territory's growth. Diagnose this as an SCD problem and walk through converting the dimension from Type 1 to Type 2 safely.

Q076Data Modeling (star schema/SCD)Senior

Your Type 2 customer dimension has exploded to 40 versions per customer because a sync rewrites unchanged rows with new timestamps, bloating joins. How do you fix the change-detection logic and collapse the false history without corrupting true history?

Q077Data Modeling (star schema/SCD)Mid

An analyst joins the fact table to a Type 2 dimension without filtering current rows and double-counts revenue in every report. What modeling or tooling changes — current-row views, effective-date join helpers — stop this whole class of mistakes?

Q078Data Modeling (star schema/SCD)Senior

Product orders can be amended, partially refunded, and split across shipments, and your single orders fact table now has nullable columns for every case. Argue how you'd remodel this — transaction facts, accumulating snapshots, or both — and migrate consumers.

Q079Data Modeling (star schema/SCD)Mid

Your events land with user attributes embedded in each record, and the team queries them directly instead of using dimensions. Reports disagree depending on which event's attributes a query happened to pick. How do you introduce conformed dimensions into this mess?

Q080Data Modeling (star schema/SCD)Senior

Two acquired companies' order systems must feed one revenue mart, with different order lifecycles, currencies, and customer identifiers. Walk through your conforming strategy: shared dimensions, mapping tables, identity resolution, and what you refuse to merge until the business decides.

Q081Data Modeling (star schema/SCD)Senior

A fact table was built at order-line grain, but ops asks questions at shipment grain and analysts keep writing aggregation bridges that disagree. How do you decide whether to add a second fact table versus a periodic snapshot, and who arbitrates grain?

Q082Data Modeling (star schema/SCD)Senior

Your wide denormalized 'one big table' served the startup well, but at 300 columns, every schema change is a fire drill and storage costs are absurd. Lay out your incremental path back to a dimensional model while dashboards stay live.

Q083Data Modeling (star schema/SCD)Mid

Late-arriving dimension members — orders referencing products not yet in the product dimension — are being dropped by your inner join, undercounting daily sales. Explain the unknown-member pattern you'd implement and how the rows get reconciled when the dimension catches up.

Q084Data Modeling (star schema/SCD)Senior

A GDPR deletion request requires erasing a customer who appears in a Type 2 dimension and five years of facts. How do you honor erasure while preserving aggregate integrity and audit trails, and what does this mean for your surrogate key design?

Q085Data Modeling (star schema/SCD)Mid

Your date dimension lacks fiscal calendars, and three teams have each hand-rolled fiscal-period CASE statements that disagree on quarter boundaries. Walk through fixing the dimension and migrating consumers off their homegrown logic without breaking historical reports.

Q086Data Modeling (star schema/SCD)Senior

Leadership wants metrics by 'team,' but the org chart changes monthly and history is rewritten retroactively in the HR system. Design how you'd model organizational hierarchy over time so last quarter's numbers don't shift every reorg.

Q087Data Modeling (star schema/SCD)Senior

Your bridge table for many-to-many patient-to-insurance relationships double-counts claims when analysts aggregate without weighting. How do you implement allocation factors, and how do you make the safe aggregation path the default rather than tribal knowledge?

Q088Data Modeling (star schema/SCD)Mid

A new data source must join on email, but emails change and get reused across customers in your dimension. What identity-resolution approach do you take, and how do you model uncertain matches so downstream teams know the confidence level?

Q089Data Quality & ContractsMid

At 3am, your volume anomaly check fires: today's orders file is 60% smaller than usual. It might be a real demand drop or an upstream export bug. Walk through how you triage, who you wake up, and what the pipeline should do meanwhile.

Q090Data Quality & ContractsSenior

A backend team renamed a column in their service database, and your pipeline carried NULLs into finance reports for nine days before anyone noticed. Design the data contract and enforcement mechanism that makes this break loudly at deploy time instead.

Q091Data Quality & ContractsMid

Your dashboard shows 4% of transactions with NULL store_id starting Tuesday. Trace your debugging path across ingestion, transformation, and source, and explain how you separate 'newly broken' from 'always broken but newly visible'.

Q092Data Quality & ContractsSenior

Your team has 2,000 data quality checks and on-call ignores most alerts because 95% are noise. How do you triage this estate — severity tiers, ownership, auto-resolution, deletion — so a firing alert means something again?

Q093Data Quality & ContractsMid

A currency field arrives as 'USD', 'usd', 'US Dollar', and occasionally NULL depending on which upstream service wrote it. Where do you standardize — ingestion, staging, or contracts pushed upstream — and how do you handle the historical mess already in the lake?

Q094Data Quality & ContractsSenior

You're asked to define SLAs for data products: freshness, completeness, accuracy. For a revenue mart consumed by finance, set concrete SLOs, explain how you'd measure each, and what happens operationally when one is breached at month-end close.

Q095Data Quality & ContractsMid

Duplicate rows appear in a fact table only on days when the upstream retried a partial file drop. Your unique-key test catches it post-load, after dashboards refresh. How do you move detection earlier and make the load tolerate redelivery?

Q096Data Quality & ContractsSenior

An ML team trains on your tables and discovers that a 'cleaned' field has had its cleaning logic changed three times in two years, making historical data inconsistent for training. How should transformations be versioned, and what do you owe consumers when logic changes?

Q097Data Quality & ContractsMid

A stakeholder insists the data is wrong; your checks all pass. You discover their definition of 'active user' differs from the mart's. How do you turn this recurring definitional conflict into documented, testable metric contracts?

Q098Data Quality & ContractsSenior

Your anomaly detection flags a 15% metric shift, and after a day of investigation it's a genuine business change from a marketing campaign nobody told you about. How do you wire business context — launches, campaigns, holidays — into your quality system to cut false positives?

Q099Data Quality & ContractsMid

Reconciliation between your warehouse order totals and the payment processor's settlement report is off by 0.3% every day, in different directions. Walk through how you'd build a row-level reconciliation harness and classify the mismatch categories you'd expect to find.

Q100Data Quality & ContractsSenior

A producing team agrees to a schema contract but says they can't promise semantics — the meaning of status codes changes with business rules. How do you contract semantics, version enumerations, and test for semantic drift you can't enumerate upfront?

Q101Data Quality & ContractsSenior

After a bad load corrupted a week of metrics, leadership demands 'no bad data ever reaches dashboards.' Design your staging-validation-publish architecture with quarantine and circuit breakers, and explain the freshness cost you'll make them sign off on.

Q102Data Quality & ContractsMid

Your null-rate check on a key field passes at table level, but nulls are concentrated 100% in one country's data, breaking that region's reports. How do you design dimension-sliced quality checks without combinatorial alert explosion?

Q103Warehouses (Snowflake/BigQuery/Redshift)Senior

Your Snowflake bill doubled this month, and the usage view shows one warehouse running near 24/7. Walk through how you'd attribute the spend to queries and teams, and the first three levers — auto-suspend, sizing, query fixes — you'd evaluate.

Q104Warehouses (Snowflake/BigQuery/Redshift)Senior

A BigQuery query that scans 8TB runs every 15 minutes to refresh a dashboard viewed twice a day. The team says 'storage is cheap, queries are fine.' Build the cost case and the redesign — partitioning, clustering, materialized views, scheduled refresh.

Q105Warehouses (Snowflake/BigQuery/Redshift)Mid

Redshift queries queue for 10 minutes every morning when BI dashboards refresh simultaneously with your ETL. Explain how you'd diagnose WLM queue contention and separate workloads so the ETL window and the analysts stop fighting.

Q106Warehouses (Snowflake/BigQuery/Redshift)Senior

You're migrating from Redshift to Snowflake, and a validation run shows 40 of 500 queries return different results — mostly timestamp arithmetic, division semantics, and collation. Describe your equivalence-testing framework and how you decide what to fix versus formally re-baseline.

Q107Warehouses (Snowflake/BigQuery/Redshift)Mid

A Snowflake MERGE into a 2-billion-row table got slower every week until it blew the nightly window, and you find the table's micro-partitions badly clustered on the merge key. How do you confirm clustering decay and decide if auto-clustering is worth its cost?

Q108Warehouses (Snowflake/BigQuery/Redshift)Senior

An intern ran a cross join that consumed your entire BigQuery on-demand budget for the month in one afternoon. What guardrails — custom quotas, reservations, maximum bytes billed, dry-run gates in CI — do you implement, and how do you keep them from blocking legitimate work?

Q109Warehouses (Snowflake/BigQuery/Redshift)Mid

Your dashboards intermittently show stale data, and you trace it to a transformation job and BI extract racing each other on the same tables. How do you implement publish signals or swap patterns so consumers never read mid-load state?

Q110Warehouses (Snowflake/BigQuery/Redshift)Senior

Someone dropped a production schema at 4pm Friday, and downstream jobs start failing within minutes. Walk through your recovery using time travel or snapshots, what your RPO actually was versus what you promised, and the access-control review that follows.

Q111Warehouses (Snowflake/BigQuery/Redshift)Mid

A query against your clustered Snowflake table still scans nearly all partitions, and you notice the filter wraps the cluster key in a function. Explain how pruning works on micro-partitions and how you'd restructure the query and verify improvement from the profile.

Q112Warehouses (Snowflake/BigQuery/Redshift)Senior

Marketing wants to share live customer-segment tables with an external agency. Compare secure data sharing, replicated datasets, and API extracts for this, covering revocation, cost attribution, PII controls, and what you'd refuse outright.

Q113Warehouses (Snowflake/BigQuery/Redshift)Mid

Your BigQuery streaming inserts land rows minutes before they're queryable in some edge cases, and a reconciliation job that runs immediately after ingestion intermittently reports missing data. How do you make the job buffer-aware and define 'data is ready' precisely?

Q114Warehouses (Snowflake/BigQuery/Redshift)Senior

Your company runs Snowflake and BigQuery after a merger, and teams duplicate datasets across both 'temporarily.' Lay out your consolidation decision framework — workload fit, egress, contracts, team skills — and the interim governance to stop the duplication bleeding.

Q115Warehouses (Snowflake/BigQuery/Redshift)Mid

An hourly job leaves thousands of small commits on a Redshift table, and VACUUM can't keep up; scans degrade through the day. How do you restructure the load pattern — micro-batch consolidation, staging-and-append, sort key review — and verify scan health?

Q116Warehouses (Snowflake/BigQuery/Redshift)Senior

Row-level security must restrict 4,000 sales users to their own territories across 50 tables, and the first implementation made every dashboard 8x slower. Diagnose why RLS predicates wreck performance and design a faster enforcement model.

Q117Warehouses (Snowflake/BigQuery/Redshift)Senior

A month before contract renewal, you're asked to cut warehouse spend 30% without breaking SLAs. Present your prioritized plan across storage tiering, compute right-sizing, query optimization, and workload scheduling, including what you'd measure weekly to prove progress.

Q118Lakehouse (Delta/Iceberg/Parquet)Mid

Queries on your Delta table slow down every week, and DESCRIBE DETAIL shows millions of small files from streaming writes. Explain your compaction strategy — OPTIMIZE cadence, file size targets, auto-compaction — and how you measure read improvement afterward.

Q119Lakehouse (Delta/Iceberg/Parquet)Senior

A 3am job failed mid-write to a plain Parquet directory, leaving partial files a downstream job then read, double-counting half a day's data. Make the case for migrating this path to a table format, and outline the migration with consumers live.

Q120Lakehouse (Delta/Iceberg/Parquet)Mid

Two writers hit your Iceberg table concurrently and one fails with a commit conflict every night. Walk through how optimistic concurrency works here, which operations actually conflict, and how you'd reschedule or partition writes to coexist.

Q121Lakehouse (Delta/Iceberg/Parquet)Senior

Your Delta table's storage tripled in two months. Investigation shows VACUUM never ran and a high-frequency MERGE rewrites most files daily. Explain the cost mechanics, your retention-versus-time-travel tradeoff, and the safe vacuum rollout when downstream readers might use old snapshots.

Q122Lakehouse (Delta/Iceberg/Parquet)Mid

An upstream producer added a nested field to incoming JSON, and your lakehouse table now has mergeSchema silently turned on, accumulating typo-variant columns like 'userId' and 'user_id'. How do you clean up and put schema evolution under control?

Q123Lakehouse (Delta/Iceberg/Parquet)Senior

You must choose between Delta and Iceberg as the org standard, with Spark, Trino, and Flink all in play plus a managed warehouse reading external tables. Walk through your evaluation criteria and where catalog interoperability actually bites in practice.

Q124Lakehouse (Delta/Iceberg/Parquet)Mid

A query reading yesterday's partition returns rows from today because writers used processing time while readers assume event time partitioning. How do you untangle the partition semantics, repair misplaced data, and document the contract so this stops recurring?

Q125Lakehouse (Delta/Iceberg/Parquet)Senior

A bad deployment wrote corrupted values into your Iceberg table for six hours before detection. Walk through using snapshot rollback versus selective overwrite for repair, what downstream consumers who already read bad snapshots must do, and the postmortem prevention items.

Q126Lakehouse (Delta/Iceberg/Parquet)Mid

Your analysts' ad-hoc queries on the lakehouse filter heavily on customer_id, but the table is partitioned by date only and every query scans everything. Compare repartitioning, Z-ordering or sort-based clustering, and bloom filters for fixing point-lookup pain.

Q127Lakehouse (Delta/Iceberg/Parquet)Senior

Your GDPR deletion job rewrites huge portions of the lake nightly because deletes are scattered across all partitions. Explain how deletion vectors or merge-on-read change the cost equation, and the compaction discipline needed so reads don't degrade.

Q128Lakehouse (Delta/Iceberg/Parquet)Mid

A column was written as INT for a year, then the producer started sending values exceeding INT range, and now some readers crash while others silently overflow. How do you execute the type widening across the table, history, and every consumer?

Q129Lakehouse (Delta/Iceberg/Parquet)Senior

Trino reads your Delta tables through a catalog that lags table commits by minutes, so dashboards occasionally miss the latest partition after loads. Diagnose where metadata caching breaks freshness guarantees and design the publish-notify pattern that fixes it.

Q130Lakehouse (Delta/Iceberg/Parquet)Senior

Parquet files from one producer are 10x larger than expected, and you find high-cardinality strings with dictionary encoding disabled and no compression tuning. Walk through how you'd audit encoding and compression choices and quantify the savings before rolling changes out.

Q131Lakehouse (Delta/Iceberg/Parquet)Senior

Leadership wants to drop the warehouse entirely and serve BI straight off the lakehouse. Using your current workload mix, argue where lakehouse query engines genuinely replace the warehouse and where concurrency, latency, or caching gaps still justify keeping it.

Q132CDC & IngestionMid

Your Debezium connector crashed over the weekend, and on restart the Postgres replication slot had been dropped, breaking the change stream's continuity. Walk through how you detect the gap, resynchronize the table, and prevent silent slot loss.

Q133CDC & IngestionSenior

The DBA reports your CDC replication slot is holding WAL and the production database disk is filling at 2am. Walk through the conversation: what you check on the consumer side, what you can safely drop, and the architectural fix preventing recurrence.

Q134CDC & IngestionMid

Your warehouse copy of a CDC-replicated table disagrees with the source by a few hundred rows that were deleted upstream but still live downstream. How do you determine whether deletes are being missed, tombstones mishandled, or compaction misconfigured?

Q135CDC & IngestionSenior

An upstream team ran a bulk UPDATE touching 80 million rows for a backfill, and your CDC pipeline is now 14 hours behind with downstream SLAs broken. How do you triage, and what contract do you establish with source teams about bulk operations?

Q136CDC & IngestionMid

Applying CDC events out of order after a connector restart left a customer's address showing a stale value despite a newer update existing. Explain how you order and deduplicate change events on apply — LSNs, sequence numbers, last-write-wins keys.

Q137CDC & IngestionSenior

You must onboard 200 source tables to CDC, but only 30 have primary keys and several have frequent schema migrations. Design your onboarding playbook: key strategies for keyless tables, schema-change handling, and the criteria for refusing a table.

Q138CDC & IngestionMid

A daily API ingestion silently started returning paginated results with a changed page size, and you've been losing the tail of every response for a week. How do you detect truncation-style losses generically across your API ingestions?

Q139CDC & IngestionSenior

Your snapshot-plus-stream CDC initialization of a 3TB table keeps failing partway, and each retry re-snapshots from scratch while the change stream backs up. Design a resumable, chunked initial load that converges with the stream without duplicates or gaps.

Q140CDC & IngestionMid

Files from a partner SFTP arrive with inconsistent names, occasional duplicates, and sometimes a day late. Your loader assumes one clean file per day. Redesign the ingestion to be manifest-driven and explain how you'd reconcile completeness with the partner.

Q141CDC & IngestionSenior

Compliance demands you prove that every source-system transaction landed in the warehouse exactly once for an audit. Your stack is CDC through Kafka into a lakehouse. Design the end-to-end completeness evidence — counts, checksums, watermark ledgers — you'd present.

Q142CDC & IngestionMid

An upstream MySQL migration changed a column from DATETIME to TIMESTAMP, and your CDC consumer now writes times shifted by the server timezone. How do you spot timezone corruption quickly, fix the decoder, and repair the rows written during the bad window?

Q143CDC & IngestionSenior

Your CDC feed captures every transient state of an order — sometimes ten updates in a second — and downstream consumers only need meaningful business transitions. Where do you collapse churn into business events, and what do you lose by doing it too early?

Q144CDC & IngestionMid

Your ingestion marks a batch complete when the file count matches the manifest, but a partner started splitting files unpredictably, and completeness checks now block loads daily. How do you renegotiate the handshake — control files, checksums, row counts — pragmatically?

Q145CDC & IngestionSenior

A source team wants to expose data via 'just query our replica' instead of CDC, arguing simplicity. Lay out the failure modes you've seen with replica-scraping — load coupling, missed deletes, schema drift — and when you'd actually accept it.

Q146Python for Data EngineeringMid

A pandas job that processes daily files started OOMing when one customer's file grew to 12GB. Walk through your options — chunked processing, dtype tuning, switching to Polars or DuckDB — and how you choose without rewriting the whole pipeline.

Q147Python for Data EngineeringSenior

Your team's shared Python ETL library has no tests, pinned nothing, and an innocent dependency bump broke datetime parsing in six pipelines overnight. Lay out how you'd retrofit packaging discipline, lockfiles, and a test harness without freezing feature work.

Q148Python for Data EngineeringMid

A script loading API data into the warehouse fails halfway through its loop, and reruns duplicate everything loaded before the crash. Refactor the design for checkpointed progress and idempotent writes, and explain how you'd test the failure path deliberately.

Q149Python for Data EngineeringSenior

Profiling shows your ingestion service spends 80% of its time serializing dicts to JSON and back between internal steps. How do you redesign the data path — Arrow, dataframes end-to-end, schema-typed records — and quantify the win before committing?

Q150Python for Data EngineeringMid

Your float-based revenue aggregation in Python disagrees with the warehouse SQL by a few paise per million rows, and finance notices. Explain where float arithmetic diverges, when Decimal is mandatory, and how you align Python and SQL computation rules.

Q151Python for Data EngineeringSenior

A 3am page: your Python consumer is stuck, not crashed — no errors, no progress. You suspect a hung HTTP call with no timeout deep in a vendor SDK. How do you confirm with a live thread dump, mitigate, and audit the codebase for missing timeouts?

Q152Python for Data EngineeringMid

You parallelized an API extraction with threads and got no speedup, then with processes and blew memory. Explain how you'd reason about IO-bound versus CPU-bound stages here, and where async would actually pay off in this extractor.

Q153Python for Data EngineeringSenior

Your team writes one-off scripts for every new source, and there are now 80 snowflake scripts with copy-pasted retry logic. Design the internal ingestion framework you'd extract — config-driven sources, shared error handling — and how you migrate scripts incrementally.

Q154Python for Data EngineeringMid

A unit-tested transformation passes in CI but corrupts accented customer names in production. You trace it to encoding assumptions reading vendor files. How do you handle encoding detection and normalization robustly, and add regression coverage with real-world samples?

Q155Python for Data EngineeringSenior

A memory leak in a long-running Python ingestion service forces restarts every two days, and the heap growth is in C-extension objects invisible to basic profilers. Walk through your diagnosis toolkit and the mitigation you'd ship while hunting the root cause.

Q156Python for Data EngineeringMid

Your data validation logic is scattered across try/except blocks that swallow errors and log nothing useful. A silent except just ate a week of malformed rows. How do you restructure error handling so bad records are quarantined, counted, and visible?

Q157Python for Data EngineeringMid

A scheduled Python job reads config from environment variables, and a typo'd variable name in one environment made it silently use a default pointing at production from staging. Redesign configuration handling so misconfiguration fails fast and loudly.

Q158Python for Data EngineeringSenior

You're reviewing a PR where a teammate implements a custom date-windowing iterator with off-by-one risks for a billing pipeline. How do you evaluate boundary correctness, what property-based tests would you require, and when do you insist on a library instead?

Q159Performance & CostMid

Your nightly pipeline's runtime crept from 2 to 6 hours over a year with no single regression to blame. Walk through how you'd instrument per-stage timing, identify the top contributors, and decide which fixes buy back the SLA margin cheapest.

Q160Performance & CostSenior

CFO asks why the data platform costs ₹40 lakh a month and what they get for it. You have no showback today. Design cost attribution across warehouse, lake, streaming, and orchestration, and the unit metrics you'd report — cost per pipeline, per dashboard, per team.

Q161Performance & CostSenior

You find a dashboard auto-refreshing a heavy query every 5 minutes for an audience that checks it twice daily, costing more than its team's entire allocation. How do you find more of these systematically and fix the incentive that created them?

Q162Performance & CostSenior

A 30% cloud cost cut is mandated, and your analysis shows the savings are split across hundreds of small inefficiencies, not one big win. Describe your campaign: tooling, team-by-team targets, automated guardrails, and how you avoid the costs creeping back.

Q163Performance & CostMid

Storage costs surprise you: 60% of lake spend is on tables nobody has queried in six months, but every owner claims theirs 'might be needed.' Design your archival and deletion policy, including the access-audit evidence that settles the arguments.

Q164Performance & CostSenior

Your streaming infrastructure costs 5x the batch equivalent and serves three use cases, only one of which truly needs sub-minute latency. How do you tier the latency requirements with stakeholders and re-platform the other two without political fallout?

Q165Performance & CostMid

A pipeline processes the full 3-year history every night because 'incremental was too risky' for a previous team. Outline how you'd derisk the move to incremental processing — reconciliation runs, parallel operation, fallback — and the spend you'd reclaim.

Q166Performance & CostSenior

After right-sizing, your warehouse spend is dominated by a handful of legitimately heavy queries from the ML team's feature generation. Walk through the optimization conversation: precomputation, sampling strategies, dedicated capacity, or chargeback — and how you'd decide.

Q167Performance & CostMid

Egress charges spiked because a new integration copies the same dataset from cloud storage to an external vendor four times daily, mostly unchanged. How do you detect redundant data movement across the platform and implement delta-based sharing?

Q168Performance & CostSenior

You're designing the platform's capacity for a festival-season traffic spike projected at 8x normal volume. Walk through load modeling, which components scale linearly versus break nonlinearly, and the pre-warming and degradation plans you'd have ready.

Q169Performance & CostMid

A team duplicated a 50TB dataset into their own project 'for performance,' and now two copies drift and both incur storage. How do you make the shared, governed copy fast enough that duplication stops being rational?

Q170Performance & CostSenior

Your platform's P95 dashboard latency is fine, but the CEO's Monday 9am dashboard is reliably slow because it cold-starts everything weekly. How do you reason about worst-case versus average performance economics, and what targeted fixes avoid overprovisioning all weekend?

Q171Performance & CostSenior

An engineer proposes caching every intermediate dataset to make reruns instant; storage projections triple. Construct the framework you'd use to decide which intermediates earn materialization — recompute cost, hit rate, staleness risk — and the eviction policy.

Q172Performance & CostMid

Compression looks like free money: switching a hot table's codec saves 70% storage but benchmark reads slow 20%. Walk through how you'd profile the workload's read-write ratio and choose codecs per table tier instead of one global default.

Q173Backfills & ReprocessingMid

A logic bug shipped three weeks ago corrupted one derived column across 21 daily partitions. Walk through scoping the blast radius, choosing between partition rewrites and full rebuild, and sequencing the fix so dashboards don't show mixed states mid-repair.

Q174Backfills & ReprocessingSenior

You must backfill two years of a metric whose upstream sources have changed schema twice and one source no longer exists for the first six months. How do you negotiate what 'historically accurate' means, and document the epochs honestly?

Q175Backfills & ReprocessingMid

Your backfill of 400 partitions is hammering the same warehouse the morning SLA pipelines need. Design the throttling and scheduling approach — off-peak windows, capacity isolation, chunk sizing — and how you'd checkpoint so a failure doesn't restart everything.

Q176Backfills & ReprocessingSenior

A backfill ran with today's code against historical data, but the transformation reads a reference table that has since changed, producing anachronistic results — current tax rates applied to old orders. How do you architect transformations to be replayable with time-correct context?

Q177Backfills & ReprocessingMid

Mid-backfill, you realize the fix itself has a bug and 60 of 150 repaired partitions are wrong in a new way. How do you track repair state per partition, and what would you have put in place before starting that you now wish existed?

Q178Backfills & ReprocessingSenior

Reprocessing a quarter of events through your streaming pipeline would take 11 days at live speed. Design the batch-parallel replay path — same logic, different engine — and how you guarantee both paths produce identical results before trusting it.

Q179Backfills & ReprocessingMid

After a backfill completed, downstream aggregates still show old numbers because their incremental logic only picks up 'new' partitions by date and skipped the rewritten history. How do you propagate reprocessing through a chain of incremental consumers reliably?

Q180Backfills & ReprocessingSenior

Finance closed the books on numbers you must now correct via backfill, and restated figures will differ from reported ones. How do you coordinate the restatement — versioned snapshots of the pre-fix state, audit annotations, stakeholder comms — like an engineer, not just a messenger?

Q181Backfills & ReprocessingSenior

Your team treats every backfill as a bespoke, scary, hand-run event. Describe the self-service backfill tooling you'd build — date-range parameters, dry-run mode, progress tracking, automatic reconciliation — and the guardrails that make it safe for any engineer to use.

Q182Backfills & ReprocessingSenior

A backfill must rewrite a table that's also receiving live writes every few minutes, and pausing ingestion for two days isn't acceptable. Walk through your concurrent strategy — shadow tables, partition swaps, dual writes — and the cutover validation.

Q183Backfills & ReprocessingMid

A source vendor re-delivered six months of 'corrected' files with no changelog of which records differ. How do you diff 200 million rows against what you originally loaded, classify the changes, and decide what merits reprocessing downstream?

Q184Backfills & ReprocessingSenior

Your event schema dropped a field eight months ago, and a new requirement needs history for a metric depending on it. Raw events are archived but expensive to rescan. How do you scope the cheapest sufficient reprocessing and push back on scope creep?

Q185Backfills & ReprocessingMid

You backfilled a fact table successfully, but slowly realize the SCD dimension it joins to was also wrong during the affected window and was fixed separately last week. How do you reason about repair ordering across interdependent tables?

Q186Backfills & ReprocessingSenior

Post-incident, leadership asks for a guaranteed 24-hour repair SLA for any data corruption up to 90 days old. Assess what that promise requires — raw retention, replay capacity, lineage completeness, rehearsals — and what you'd quote as its real cost.

Q187Pipeline Reliability & IdempotencyMid

A pipeline retried after a timeout, but the first attempt had actually succeeded slowly, and the partition now holds doubled data. Redesign the write path so retries are harmless — deterministic overwrite, write-audit-publish — and explain the testing that proves it.

Q188Pipeline Reliability & IdempotencySenior

Your platform's worst pipelines fail silently: they succeed operationally but write subtly wrong data. Reliability dashboards show all green. Design the semantic monitoring layer — output invariants, cross-source reconciliation, distribution drift — that catches what task status can't.

Q189Pipeline Reliability & IdempotencyMid

At 3am the warehouse loader and its retry fired simultaneously due to an alerting-triggered manual rerun, and both INSERT-appended the same batch. What combination of natural keys, batch IDs, and merge semantics makes this load safe under concurrent execution?

Q190Pipeline Reliability & IdempotencySenior

A nine-step pipeline failed at step seven, and the on-call restarted from step one, recomputing four expensive hours and re-sending a webhook to a partner. Redesign for resumability: where do you place commit points, and which steps must be transactional or idempotent?

Q191Pipeline Reliability & IdempotencySenior

Your pipeline depends on five upstream sources with different arrival reliabilities, and one flaky source delays everything daily. Design the partial-readiness strategy — process what's arrived, late-merge stragglers, flag completeness to consumers — and its complexity costs.

Q192Pipeline Reliability & IdempotencySenior

Your team's MTTR for data incidents is two days, mostly spent figuring out what broke and what it touched. Walk through the incident-response capability you'd build for data — lineage-driven blast radius, freshness SLO dashboards, runbooks per failure class — and how you'd measure improvement.

Q193Pipeline Reliability & IdempotencyMid

A transformation reads its own output table from the previous run as input for the next, and one bad run has now poisoned every run since. How do you break the self-referential corruption, restore a good baseline, and redesign to limit feedback contamination?

Q194Pipeline Reliability & IdempotencySenior

You inherit a business-critical pipeline with no tests, no staging environment, and a release process of 'edit in production.' Sequence your first 90 days of hardening — what you stabilize first, what you refuse to touch yet, and how you keep changes shippable throughout.

Q195Pipeline Reliability & IdempotencyMid

An external API your pipeline calls has started rate-limiting aggressively at month-end when everyone's pipelines hit it. Your naive retry storm makes it worse. Design backoff, request budgeting, and caching so your pipeline degrades gracefully instead of amplifying the problem.

Q196Pipeline Reliability & IdempotencySenior

Leadership wants 'five nines for data.' You know compute availability isn't the issue — correctness and freshness are. Reframe the reliability conversation: define data SLOs that actually map to business harm, and the error-budget policy you'd run against them.

Q197Pipeline Reliability & IdempotencyMid

Your alerting fires when pipelines fail, but the worst incident this quarter was a pipeline that didn't run at all — the schedule was accidentally disabled for a week. Design absence-of-success monitoring and explain why it must live outside the orchestrator.

Q198Pipeline Reliability & IdempotencySenior

A disaster-recovery drill reveals your pipelines can't be restarted in the secondary region because of hardcoded endpoints, missing secrets, and stateful checkpoints stranded in the primary. Prioritize the portability fixes and define the recovery drill cadence you'd institutionalize.

Q199Pipeline Reliability & IdempotencyMid

Two pipelines owned by different teams write to the same target table on different schedules, and each occasionally clobbers the other's late-arriving corrections. How do you establish single-writer ownership or partition-level write boundaries without a months-long re-architecture?

Q200Pipeline Reliability & IdempotencySenior

After three quiet months, your team is paged for a cascading failure: one upstream delay caused 40 downstream SLA breaches and a flood of duplicate alerts. Design the dependency-aware alerting and automatic downstream suppression that turns this into one actionable page.

Can you defend these answers under follow-up pressure?

Book a mock interview with a senior Data Engineer mentor — structured scorecard, replay, and a gap plan.

Book a Mock Interview →

Data Engineer Interview Questions

Book your free audit