Data Scientist / ML Engineer Interview Questions

200 scenario-based questions with detailed model answers, organized skill-wise and tool-wise. Filter by topic, level or keyword, reveal the answer — then pressure-test yourself in a real mock.

SKILL / TOOL

LEVEL

200 questions

Q001Statistics & InferenceSenior

Your e-commerce team ran a 2-week A/B test on a new checkout flow and saw a 4% lift in conversion. The p-value is 0.03. Leadership wants to ship immediately, but you notice the test was stopped early on day 10 when results looked good. How do you handle this?

Q002Statistics & InferenceMid

A product analyst says your churn prediction model is 92% accurate and therefore excellent. The dataset has 8% churners and 92% non-churners. How do you explain why this accuracy number is misleading and what metrics you'd use instead?

Q003Feature EngineeringSenior

You're building a real-time fraud detection model at a payments company. Transaction volume is 50k/second. Your data science team proposes computing 200 rolling aggregate features (e.g., spend in last 1h, 24h, 7d) at inference time. What are the risks and how would you architect this?

Q004Feature EngineeringMid

You have a categorical feature 'city' with 3,000 unique values and 20% of them appear fewer than 5 times in your training data. One-hot encoding explodes your feature space. What encoding strategies would you evaluate and what's your decision process?

Q005Model Selection & EvaluationSenior

You trained a gradient boosting model for loan default prediction that achieves AUC-ROC 0.89 on your hold-out set. Three months after deployment, the model's Gini coefficient in production dropped from 0.78 to 0.51. The loan book and economic conditions haven't changed dramatically. What do you investigate first?

Q006Model Selection & EvaluationMid

You're comparing a logistic regression baseline (AUC 0.81) against an XGBoost model (AUC 0.85) for a credit scoring use case. Your manager says 'just ship the XGBoost—it's better.' What questions do you raise before agreeing?

Q007Supervised LearningSenior

You're training a customer lifetime value regression model at a subscription SaaS company. The target variable (12-month LTV) has a heavily right-skewed distribution—most customers have LTV of $50-200, but 2% have LTV above $5,000. A naive RMSE-optimized model mostly ignores the high-value tail. How do you fix this?

Q008Supervised LearningMid

You have a binary classification problem with 1 million training samples. You try a random forest with 500 trees (AUC 0.84) and XGBoost with default hyperparameters (AUC 0.83). Your colleague says random forest is better. You suspect XGBoost is undertuned. How do you proceed?

Q009Unsupervised & ClusteringSenior

You're tasked with segmenting 2 million B2B customers for a telecom company using behavioral usage data (call minutes, SMS, data consumption, roaming frequency). You run K-means with K=8. Your sales director says some segments are 'too broad' and wants finer segments. How do you evaluate and refine the segmentation without just increasing K?

Q010Unsupervised & ClusteringMid

You run K-means on customer transaction data and get well-separated clusters in 2D UMAP plots. When you hand the segments to the marketing team, they say the clusters don't make business sense—high-value and low-value customers are mixed. What went wrong and how do you fix it?

Q011Deep LearningSenior

You're training a ResNet-50 image classifier on a dataset of 50,000 medical X-rays to detect pneumonia. Training accuracy reaches 98% but validation accuracy plateaus at 71%. You've tried dropout and L2 regularization but the gap persists. What's your structured debugging approach?

Q012Deep LearningMid

You're training a text classification model using BERT fine-tuning on a 10,000-sample customer support ticket dataset. After 3 epochs, training loss is 0.05 but validation loss is 1.8. What are three concrete things you'd try?

Q013NLPSenior

You've deployed a BERT-based product review classifier at an e-commerce company. Three months in, a new product category (smart home devices) launches. Accuracy on smart home reviews drops to 62% while overall accuracy stays above 85%. How do you handle this without full retraining every week?

Q014NLPMid

You're building a named entity recognition (NER) system for a legal document processing pipeline. Your SpaCy NER model trained on general web text achieves F1 of 0.55 on legal contracts. A lawyer suggests you just need more data. Is she right? What else could improve this significantly?

Q015Experimentation & A/B TestingSenior

Your growth team runs 15 simultaneous A/B tests on your mobile app. When you analyze results, 4 tests show significant positive effects (p < 0.05). The team wants to ship all 4 winners. You flag that this is likely wrong. Explain why and what corrective analysis you'd run.

Q016Experimentation & A/B TestingMid

You're running an A/B test on a new recommendation algorithm for a music streaming app. Your primary metric is 7-day retention, but you notice that session length increased significantly in the treatment group during the first 3 days. Should you call the test early based on this secondary metric?

Q017ML System DesignSenior

Design a recommendation system for a news app with 5 million daily active users and 50,000 articles published per day. The system must serve personalized feeds with p99 latency under 100ms. Walk me through your architecture from data ingestion to serving.

Q018ML System DesignSenior

A fintech startup asks you to design an ML system to approve or reject personal loan applications in real time (decision within 2 seconds). They have 3 years of historical loan data, no existing ML infrastructure, and a team of 2 data scientists. What would you build first and what would you defer?

Q019Model DeploymentSenior

Your ML team deployed a churn model to production via a REST API. Six weeks later, DevOps reports the inference service is using 4x more memory than provisioned and occasionally OOM-killing. The model hasn't changed. What are the likely causes and how do you debug?

Q020Model DeploymentMid

You've trained a scikit-learn pipeline that includes a TF-IDF vectorizer and a logistic regression model for text classification. You pickle it and deploy to production. Two weeks later the service throws a deserialization error. What went wrong and how do you prevent this going forward?

Q021Time SeriesSenior

You're a data scientist at a retail chain. Your demand forecasting model (Prophet) was achieving MAPE of 12% before COVID. During the pandemic and recovery period (2020-2022), MAPE spiked to 45%. Now in 2024, leadership wants the model fixed. How do you approach rebuilding it?

Q022Time SeriesMid

You're forecasting daily electricity consumption for a utility company. Your ARIMA model has low training error but the residuals show clear autocorrelation in the ACF plot at lag 7. What does this tell you and how do you fix it?

Q023Bias & FairnessSenior

Your hiring algorithm at an HR tech company shows that female candidates are 23% less likely to be shortlisted than male candidates with equivalent qualifications. Legal says this creates disparate impact liability. How do you diagnose the source of the bias and what mitigation strategies are available?

Q024Bias & FairnessMid

You're reviewing a credit scoring model and find that it performs well overall (AUC 0.82) but has AUC of 0.68 for applicants from rural zip codes. A colleague says 'rural customers are a small segment, so overall performance is fine.' How do you respond?

Q025Python & PandasMid

You have a Pandas DataFrame with 50 million rows representing transaction events. When you run `.groupby('customer_id').agg({'amount': 'sum', 'event_time': 'max'})`, the process runs for 25 minutes and eats 32GB of RAM. How do you speed this up?

Q026Python & PandasSenior

You're doing feature engineering on a training dataset with 200 features. You need to compute pairwise correlations, handle missing values, and run outlier detection—all inside a scikit-learn Pipeline to prevent train-val leakage. A junior engineer says 'just preprocess the whole DataFrame before splitting.' How do you explain the leakage risk and structure the pipeline correctly?

Q027SQL for Data ScienceMid

You're tasked with computing 30-day rolling average revenue per user from a raw events table with 2 billion rows in BigQuery. A colleague wrote a solution using a self-join. It's been running for 90 minutes. How do you rewrite it to be efficient?

Q028SQL for Data ScienceSenior

Your data science team needs to compute a 'next purchase date' feature for each customer-product pair in a 500M row orders table for training a repurchase prediction model. Naive implementations keep timing out. Walk me through an efficient approach in SQL and explain the pitfall of the obvious first attempt.

Q029Statistics & InferenceSenior

You're analyzing whether a new pricing algorithm caused a revenue increase of $2M/month. The rollout was not randomized—it was deployed to high-traffic markets first. A stakeholder claims the revenue increase proves the algorithm works. What's wrong with this causal claim and how would you establish a more rigorous estimate?

Q030Feature EngineeringSenior

You're building a real-estate price prediction model. 'Neighborhood' is a key predictor with 800 unique values. Some neighborhoods have only 3 sales in your training data. You try target encoding and observe severe overfitting on rare neighborhoods. How do you address this rigorously?

Q031Model Selection & EvaluationMid

You train five models with 10-fold cross-validation. The best model has mean CV AUC 0.854 ± 0.012. The second-best has 0.849 ± 0.009. Your stakeholder says pick the best CV score. What's the statistically correct approach and what would you actually recommend?

Q032Supervised LearningSenior

You're building a multi-label text classifier to tag customer support tickets with one or more of 40 product categories. 15 of the 40 categories have fewer than 100 training examples. Standard binary cross-entropy fine-tuning on BERT gives micro-F1 of 0.71 but macro-F1 of 0.42. How do you improve the tail-category performance?

Q033Deep LearningSenior

You're training a large language model fine-tune (7B parameter LLaMA variant) for a domain-specific task. The training job uses 8 A100 GPUs. After 3 days of training, loss is still not converging—it oscillates between 2.1 and 2.4. What do you diagnose and fix?

Q034NLPSenior

You've deployed a RAG (retrieval-augmented generation) pipeline for a legal research assistant. Users report that the system confidently answers questions with incorrect case law citations that look plausible but don't exist. How do you diagnose and mitigate this hallucination problem?

Q035Experimentation & A/B TestingSenior

You're running an A/B test at a ride-sharing company where treatment users get a surge pricing warning 10 minutes before a known demand spike. You see that treatment group users have 15% higher ride completion. But your data scientist flags that there may be network effects—treatment users booking ahead might displace control users during the spike. How do you handle interference in this experiment?

Q036ML System DesignMid

A startup wants to build a 'similar products' feature for their e-commerce site with 100,000 SKUs. They have purchase history, product descriptions, and images. They have no ML infrastructure. What's the simplest working system you'd build in 2 weeks?

Q037Model DeploymentSenior

You need to deploy a PyTorch model that runs inference on 4K video frames in real time (30 fps) on edge devices (NVIDIA Jetson AGX) with no internet connectivity. The model currently takes 150ms per frame on a cloud GPU. How do you approach optimization and deployment?

Q038Time SeriesSenior

You're building a predictive maintenance model for 500 industrial machines. Each machine generates sensor readings (temperature, vibration, pressure) at 1Hz for 3 years. Machine failures are rare—2% of machine-days have a failure within the next 7 days. How do you structure this as an ML problem and what architecture would you use?

Q039Bias & FairnessSenior

Your NLP-based resume screening model shows that candidates with names common in certain ethnic groups are 18% less likely to pass the first screen, even controlling for education and experience. The model doesn't use name as a feature. How is this possible and what do you do?

Q040Python & PandasSenior

You need to build a feature engineering function that processes 10,000 individual customer DataFrames (each ~5,000 rows) sequentially in a production batch job. The current implementation takes 4 hours. A teammate suggests switching to Dask. Is Dask the right fix? What would you try?

Q041SQL for Data ScienceSenior

You're building a feature for 'days since last purchase' for a churn model. Your SQL query uses `DATEDIFF(CURRENT_DATE, MAX(order_date))` grouped by customer_id. During model validation, you discover this feature has 0% importance in the tree model. A senior engineer says the feature is fine. What's likely wrong?

Q042Statistics & InferenceMid

Your manager asks you to determine whether men or women have higher average order values on your platform. You run a t-test and get p=0.04. She asks 'so women spend 15% more—can we now target our campaigns accordingly?' What do you tell her?

Q043Unsupervised & ClusteringSenior

You're tasked with anomaly detection on server infrastructure logs at a cloud provider—500 million log entries per day across 10,000 servers. You need to flag unusual server behavior in near real time (within 5 minutes of occurrence). How do you design this system?

Q044Model Selection & EvaluationSenior

You're building a model to predict which of 10,000 daily job applicants to advance to a phone screen. You have 3 years of historical data. Your head of recruiting says calibration is critical because the model score will be used directly to prioritize recruiter time. How do you ensure and validate calibration?

Q045Deep LearningMid

You're building a product image quality classifier to reject blurry or poorly lit product photos. You have 5,000 labeled images (3,500 good, 1,500 bad) and a team asking for 95% precision on the 'bad' class to avoid incorrectly rejecting seller photos. How do you set the decision threshold and validate it?

Q046NLPMid

You need to detect duplicate customer support tickets submitted through multiple channels (email, chat, web form). Tickets are typically 50-200 words. You have no labeled pairs of duplicates. How do you build a duplicate detection system without labeled training data?

Q047Experimentation & A/B TestingMid

You're an analyst at a food delivery app. You run an A/B test where treatment users see restaurant photos on the search results page. After 2 weeks, there's no statistically significant difference in order conversion rate (p=0.41). Your PM says 'photos don't work—let's drop the idea.' What do you tell her?

Q048ML System DesignSenior

You're the lead data scientist at an insurance company designing a claims severity prediction model that will directly influence settlement offers. The model will process 500 new claims per day. How do you design for fairness, auditability, and human oversight without slowing down operations?

Q049Time SeriesMid

Your time series model for forecasting weekly store sales is trained on 3 years of data. When you validate with a random 20% holdout split, MAPE is 8%. When you validate with a temporal holdout (last 6 months), MAPE is 19%. Your manager asks why there's such a big discrepancy. What do you explain?

Q050Bias & FairnessMid

You've built a content recommendation model for a news platform. After deployment, editorial staff notice the model consistently recommends less content from women journalists and underrepresented ethnic groups compared to white male journalists, even when controlling for article engagement. What do you investigate?

Q051Python & PandasMid

You're asked to find all customers who purchased from category A and category B but NOT category C, using a transactions table in Pandas. Write the logic and explain any performance considerations for a 10M row dataset.

Q052SQL for Data ScienceMid

You need to calculate 30-day cohort retention for a mobile app: for each cohort (users by sign-up week), what percentage returned on day 30? You have a user_events table with user_id, event_type, and event_date. Write the query logic and explain the key challenge.

Q053ML System DesignMid

You're asked to build an email spam classifier that will process 10 million emails per day for a mid-size enterprise. The system must have false positive rate under 1% (legitimate emails incorrectly marked as spam). How do you design the model and evaluation strategy?

Q054Statistics & InferenceSenior

You're analyzing a clinical trial dataset where a new drug reduces hospitalization rate from 12% (control) to 8% (treatment). The p-value is 0.001. A physician argues this is clinically insignificant. You argue it might be meaningful. How do you reconcile statistical and clinical significance, and what metrics communicate this best to clinicians?

Q055Supervised LearningMid

You're building a binary classifier using logistic regression. After training, your ROC-AUC on the validation set is 0.79. A data scientist suggests you should definitely try a random forest next. Do you agree? What additional analysis would you do first?

Q056Feature EngineeringMid

You are building a churn model for a SaaS platform. One of your engineers added a feature 'days_to_churn' computed from the future cancellation event. Your validation AUC jumps from 0.78 to 0.97. You're suspicious. How do you detect and confirm this is data leakage?

Q057Statistics & InferenceSenior

Your fintech firm ran a 6-week A/B test on a new loan-approval model. The p-value is 0.048, but the ops team noticed the treatment arm had 15% more high-income applicants due to a routing bug. How do you salvage valid conclusions from this contaminated experiment?

Q058Statistics & InferenceMid

You are a data scientist at a SaaS company. The marketing team asks whether the 8% increase in trial-to-paid conversion last quarter was statistically significant. All you have is a CSV of weekly conversion rates for 12 weeks before and 13 weeks after the campaign launched. Walk through your analysis.

Q059Feature EngineeringSenior

You are building a real-time churn-risk scorer for a B2B SaaS platform. The raw event stream arrives via Kafka. Account-level features like 'days since last login' or 'seats used / seats purchased ratio' must be precomputed and served in under 50 ms. Describe your feature pipeline architecture end-to-end.

Q060Feature EngineeringMid

You join an e-commerce team building a product recommendation model. The item catalog has 2 million SKUs with high cardinality categorical fields: brand (12,000 unique), category (800 unique), and seller (150,000 unique). Raw one-hot encoding is infeasible. What feature encoding strategy do you use, and why?

Q061Model Selection & EvaluationSenior

A healthcare startup asks you to pick between a gradient-boosted tree model (AUC 0.91) and a logistic regression (AUC 0.87) for predicting 30-day hospital readmission. The model output will be shown to clinicians to trigger care-coordination calls. Which model do you recommend, and what factors decide the answer?

Q062Model Selection & EvaluationMid

You trained a binary classifier on imbalanced data (2% positive class). Your colleague reports F1=0.82 and declares success. You are skeptical. What additional metrics and tests do you run to validate whether the model is actually useful?

Q063Supervised LearningSenior

You are building a fraud detection model at a payment processor handling 50,000 transactions per second. The model must score each transaction in under 10 ms. Your current XGBoost model scores in 2 ms on a single core but the team wants to add 300 new features from graph topology (degree centrality, community label). How do you evaluate whether adding these features is feasible without breaking the latency SLA?

Q064Supervised LearningMid

You trained a gradient boosted model to predict employee attrition. During feature importance analysis, you notice 'employee ID' ranks in the top 5 features. The model's validation AUC is 0.93. What does this indicate and what do you do?

Q065Unsupervised & ClusteringSenior

Your retail client wants to segment 8 million customers for personalized marketing campaigns. K-means on the full dataset takes 40 minutes per run and the marketing team wants to re-segment weekly as new transaction data arrives. How do you redesign the clustering pipeline to meet operational requirements?

Q066Unsupervised & ClusteringMid

You applied K-means with K=5 to cluster website users by behavioral features. A business stakeholder reviews the output and complains that two clusters look nearly identical. What steps do you take to diagnose and fix this?

Q067Deep LearningSenior

You are training a ResNet-50 for defect detection on semiconductor wafer images at a chip manufacturer. Training loss decreases steadily but validation loss plateaus after epoch 12 and then diverges. The dataset has 50,000 images, 8 defect classes, and significant class imbalance (one rare defect class has only 200 examples). Diagnose and fix this.

Q068Deep LearningMid

You are training a multi-layer neural network for tabular credit scoring data and find that training is unstable — loss spikes randomly every few batches. Batch size is 64, learning rate is 0.01 with Adam. What do you investigate and how do you stabilize training?

Q069NLPSenior

Your team deployed a BERT-based sentiment classifier for customer support tickets at a telecom company. In production, the model performs well on broadband-related tickets but poorly on new 5G plan complaints. You have 500 labeled 5G tickets and cannot afford full fine-tuning. What is your strategy to adapt the model?

Q070NLPMid

You are building a document classifier to route incoming legal contracts to the right team. You have 2,000 labeled contracts across 12 contract types. A colleague suggests fine-tuning GPT-4 via the OpenAI API. You suggest a different approach. What is it and why?

Q071Experimentation & A/B TestingSenior

You run a pricing A/B test at a subscription streaming service. The test variant increases subscription price by $2/month. After 4 weeks you observe a 1.1% decrease in new subscriptions but a 5.8% increase in revenue per subscriber. Your CEO wants to ship immediately. You recommend waiting. Why?

Q072Experimentation & A/B TestingMid

You are setting up an A/B test for a new checkout flow. The product manager says 'run it for one week and stop when p < 0.05'. You push back. Explain the statistical problem and propose a correct stopping rule.

Q073ML System DesignSenior

Design an end-to-end ML system for a food delivery platform that predicts estimated delivery time (ETA) in real time at the moment a customer places an order. The system must handle 20,000 concurrent orders at peak, update ETAs every 30 seconds as driver location changes, and achieve a median absolute error of under 3 minutes.

Q074ML System DesignSenior

An autonomous vehicle company asks you to design the data flywheel for improving their lane-change maneuver model over time. The fleet generates 1 TB of sensor data per vehicle per day across 2,000 vehicles. How do you select, label, store, and retrain efficiently without drowning in data?

Q075Model DeploymentSenior

Your team shipped a new XGBoost model to production via a canary rollout (10% traffic). Within 6 hours the oncall alert fires: the canary model's positive prediction rate has dropped from 18% to 4%. The old model is still serving 90% of traffic at 17-19% rate. What is your incident response playbook?

Q076Model DeploymentMid

You need to deploy a trained scikit-learn pipeline (StandardScaler + RandomForestClassifier) for batch scoring of 5 million customer records nightly. The scoring job must complete in under 30 minutes. Walk through your deployment and optimization approach.

Q077Time SeriesSenior

You manage demand forecasting for a grocery chain with 500 stores and 30,000 SKUs. A traditional ARIMA-based pipeline runs store-SKU combinations independently and takes 18 hours to generate next-week forecasts. Leadership wants forecasts refreshed daily. How do you redesign the system?

Q078Time SeriesMid

You trained a Prophet model to forecast weekly sales for a retail client. The model performed well for 3 months, then suddenly started under-predicting by 30% for 6 consecutive weeks. The client says no promotions or external events changed. What do you investigate?

Q079Bias & FairnessSenior

Your credit risk model is deployed at a bank. An internal audit shows that Black applicants are approved at 62% rate vs 79% for white applicants, even though both groups have similar median FICO scores. Leadership asks you to 'fix the model' while maintaining the same overall approval rate and default rate. Describe your investigation and the technical and ethical path forward.

Q080Bias & FairnessMid

You are asked to build a resume screening model to help an HR team shortlist candidates for software engineering roles. A colleague says 'just train it on the last 5 years of hiring decisions — those are the ground truth.' Why is this dangerous, and what alternative do you propose?

Q081Python & PandasSenior

You have a pandas DataFrame with 150 million rows representing user clickstream events. Reading the raw CSV takes 8 minutes and groupby operations are slow enough to time out your Jupyter session. Colleagues suggest Dask; your infrastructure has a 32-core machine with 256 GB RAM. What is your end-to-end performance strategy?

Q082Python & PandasMid

A junior team member's data preprocessing script runs for 45 minutes on a 2-million-row DataFrame and uses a for-loop calling apply() row by row to compute a feature. You are asked to review and speed it up. Describe your approach and the vectorized alternative.

Q083SQL for Data ScienceSenior

You are a senior data scientist at a ride-sharing company. Write a SQL query to compute the 7-day rolling retention rate: for each cohort of drivers who completed their first ride on a given day, what percentage were still active (completed at least one ride) in each of the 7 days following their first ride day?

Q084SQL for Data ScienceMid

You have a table of e-commerce orders with columns: order_id, user_id, order_date, revenue. Write a query to identify users who made purchases in three or more consecutive months, and return their user_id and the starting month of their longest consecutive streak.

Q085Statistics & InferenceSenior

Your company runs a multi-armed bandit for content recommendation. After 3 weeks, one arm has a 23% CTR and another has 21% CTR. The engineering lead says 'the bandit already chose the winner, just deploy the 23% arm.' You recommend a held-out A/B test instead. Defend your position with specific statistical reasoning.

Q086Feature EngineeringSenior

You are building a credit card transaction fraud model. The feature 'merchant category code' (MCC) has 750 unique values and is critical for fraud detection but highly imbalanced — 5 MCCs account for 60% of transactions while 400 MCCs have fewer than 100 transactions in training data. How do you encode MCC to maximize signal and handle the tail?

Q087Model Selection & EvaluationMid

You are building a document relevance ranker for an internal enterprise search engine. Your colleague used accuracy as the evaluation metric and reports 94% accuracy. You suspect this is misleading. What metrics would you actually use, and why?

Q088Supervised LearningSenior

You are a principal ML engineer at an insurance company. The actuarial team built a linear regression to predict claim severity. Your task is to replace it with a ML model that outperforms it. However, the actuary team requires the model's predictions to monotonically increase with 'years since last claim' and decrease with 'safety device installed.' How do you enforce these monotonic constraints while preserving model power?

Q089Unsupervised & ClusteringSenior

You are tasked with identifying patient sub-populations in a hospital EHR dataset with 200 clinical features (lab values, diagnoses, medications, vitals) for 80,000 patients. Many features are missing (15-60% missingness per feature) and the features are on very different scales. How do you approach this clustering problem?

Q090Deep LearningSenior

Your team is training a large language model fine-tune (7B parameter LLaMA variant) for document summarization on 4 A100 GPUs (80GB VRAM each). Training crashes after 200 steps with a CUDA out-of-memory error. The batch size is 8 and sequence length is 2048. How do you diagnose and resolve this?

Q091NLPSenior

You deployed a RAG (Retrieval-Augmented Generation) system for an internal knowledge base at a consulting firm. Users complain that the system sometimes returns confident but wrong answers — it pulls irrelevant chunks from the vector store and the LLM still generates a plausible-sounding response. How do you diagnose and improve retrieval precision?

Q092Experimentation & A/B TestingSenior

You work at a social media platform. You want to measure the causal effect of showing users a mental-health resource banner on their engagement patterns. A traditional A/B test is not feasible because showing the banner to some users but not others on the same social graph creates network interference — a treated user's friends may change their behavior too. How do you design this study?

Q093ML System DesignMid

You are asked to design a machine learning pipeline to classify incoming customer support emails into 8 intent categories and route them to the correct team in under 5 seconds. The system must handle 10,000 emails per hour at peak. Sketch the end-to-end architecture.

Q094Model DeploymentSenior

Your ML platform team is asked to support multiple data science teams deploying models with different requirements: sub-100ms latency for real-time fraud, batch scoring nightly for CRM, and GPU inference for an NLP model. Design a unified model serving platform that handles all three patterns without requiring separate bespoke infrastructure for each.

Q095Time SeriesSenior

A logistics company asks you to build an anomaly detection system for shipment transit times. Normal transit time for ground shipping is 3-7 days depending on distance; delays beyond 10 days should trigger an alert. The data has strong day-of-week seasonality, occasional holiday spikes, and a secular upward trend since COVID. How do you approach this?

Q096Bias & FairnessSenior

Your NLP model for automated job description scoring rates descriptions on 'inclusivity' (gender-neutral language, accessibility statements). An internal audit shows the model consistently rates descriptions from job families dominated by women (nursing, teaching, social work) as less inclusive than descriptions from male-dominated fields (software engineering, finance), even when language is identical. What is causing this and how do you fix it?

Q097Python & PandasSenior

You need to join a 10-million-row Spark DataFrame of user events with a 50-row lookup table of campaign metadata. A junior engineer wrote a join operation that triggers a full shuffle across the Spark cluster and takes 45 minutes. How do you fix this and explain the mechanism?

Q098SQL for Data ScienceSenior

You are analyzing the impact of a product feature release at a SaaS company. Write a SQL query to compute a difference-in-differences estimate: compare the change in weekly active users (WAU) for accounts that adopted the feature in week 10 versus accounts that never adopted it, measuring 4 weeks before and 4 weeks after the adoption week.

Q099Statistics & InferenceMid

Your team is testing 12 different UX changes simultaneously via separate A/B tests, all sharing the same user base. Your manager says each test has p < 0.05 so all 12 winners should ship. You disagree. Explain why and propose a correction.

Q100NLPMid

You are building a named entity recognition (NER) system to extract product names, prices, and quantities from unstructured supplier invoices. The invoices are PDFs from 50 different suppliers with different layouts and fonts. You have 300 labeled invoices. Describe your end-to-end approach.

Q101Time SeriesMid

You are asked to build a model to forecast daily revenue for a B2B SaaS company one month ahead. The company has 3 years of daily data. Revenue has a weekly cycle, end-of-quarter spikes (Q-end enterprise deals), and grew 40% year-over-year. What features and model would you use?

Q102Bias & FairnessMid

A colleague proposes removing 'gender' and 'ethnicity' from a hiring model's feature set to make it fair. You point out that removing protected attributes is not sufficient for fairness. Explain why and what you would do instead.

Q103ML System DesignMid

A startup wants to build a system that automatically tags news articles with topics (politics, sports, finance, health, technology) in real time as articles are published. They publish 200 articles per day. How would you design this system from scratch on a limited budget?

Q104Model DeploymentMid

You trained a recommendation model offline using implicit feedback (clicks, views) from your e-commerce platform. The model has been running in production for 4 months and revenue-per-session is declining. How do you determine if the model is the cause and what is your next step?

Q105Deep LearningMid

You are training a convolutional neural network to classify skin lesion images (benign vs malignant). Your training dataset has 5,000 benign and 500 malignant images. After 20 epochs, training accuracy is 98% but validation recall for malignant (the clinically critical class) is only 45%. What are the problems and how do you fix them?

Q106SQL for Data ScienceMid

You have a table 'sessions' with columns session_id, user_id, start_time, end_time. Write a SQL query to find users who had overlapping sessions — sessions where end_time of one session is after start_time of the next session for the same user.

Q107Experimentation & A/B TestingMid

You run an A/B test on a checkout page redesign. The experiment ran for 2 weeks with 50,000 users per arm. Conversion rate in treatment is 4.8% vs 4.5% in control (p=0.03). Your product manager wants to ship. Before approving, what are the 5 checks you would do?

Q108Feature EngineeringMid

You are working on a price elasticity model for a hotel chain. The raw data includes hotel rates, competitor rates, occupancy, and booking windows. Your model performance degrades whenever a hotel opens a new property nearby. How do you engineer features that are robust to this type of structural change?

Q109Model Selection & EvaluationSenior

You are evaluating a new ML model for detecting sepsis risk in ICU patients. The model outputs a probability score every hour per patient. The existing rule-based system has 75% sensitivity and 60% specificity. Your ML model has AUC=0.91. The clinical team asks whether AUC alone tells them whether to switch. What is your answer and what evaluation framework do you propose?

Q110Python & PandasMid

You are working with a pandas DataFrame where a column contains JSON strings: each cell has a JSON object with varying keys like '{"city": "Mumbai", "pin": "400001"}'. You need to expand this into separate columns efficiently. The DataFrame has 1 million rows. How do you do this?

Q111Unsupervised & ClusteringMid

You ran DBSCAN on a set of 2D customer location coordinates. DBSCAN labeled 40% of your points as noise (-1). Your manager asks why so many points are noise and whether you should switch to K-means instead. How do you explain and what do you recommend?

Q112Supervised LearningMid

You trained a random forest to predict whether a software project will exceed budget. In production, the model performs well on waterfall projects but poorly on agile projects. Your training data was 80% waterfall projects. What is the problem and how do you fix it?

Q113ML System DesignSenior

You are the lead ML engineer at a bank building a real-time personalized offer recommendation system. The system must generate a ranked list of 5 offers for each customer when they open the mobile app. The bank has 8 million customers, 200 active offers, and strict regulatory requirements that every recommendation must be explainable to the customer. Design the system.

Q114Statistics & InferenceSenior

Your growth team ran a 4-week A/B test on a checkout flow change. Results show p=0.03 but the conversion lift is only 0.2%. Finance wants to ship it immediately. How do you advise them, and what statistical issues do you flag?

Q115Statistics & InferenceMid

A healthcare startup asks you to build a readmission risk model. They want 99% recall at any cost. How do you explain the precision-recall trade-off and what threshold policy do you recommend?

Q116Feature EngineeringSenior

You're building a real-time fraud detection model at a fintech. Feature computation latency must be under 10ms. You have 200+ raw transaction fields and historical user behavior going back 2 years. Walk through your feature engineering architecture.

Q117Feature EngineeringMid

Your e-commerce recommendation model has a 'days since last purchase' feature. After deployment you notice its distribution in production is heavily skewed right with many users at 0 and a long tail to 730. What transformations do you apply and why?

Q118Model Selection & EvaluationSenior

You're leading ML for a B2B SaaS churn prediction project. The dataset has 50,000 accounts, 18-month history, heavy class imbalance (5% churn), and 120 features. Your manager wants a single model by Friday. How do you choose, evaluate, and justify?

Q119Model Selection & EvaluationMid

You compare a neural network with 88% ROC-AUC against a logistic regression with 84% ROC-AUC for a medical diagnosis task. The product team wants to ship the neural network. What other evaluations do you run before agreeing?

Q120Supervised LearningSenior

A logistics company's delivery time prediction model was trained on pre-pandemic data. Post-2020, RMSE jumped from 12 minutes to 38 minutes in production. Diagnose the issue and design a remediation plan.

Q121Supervised LearningMid

You're training a random forest for credit scoring and notice that test accuracy is 91% but training accuracy is 99%. What's happening and how do you fix it?

Q122Unsupervised & ClusteringSenior

A retail bank wants to cluster its 2 million customers for personalized marketing. You've run K-means with k=8 and silhouette=0.31. Leadership asks if 8 is the right number and if clusters are actionable. What do you do next?

Q123Unsupervised & ClusteringMid

You apply DBSCAN to a geolocation dataset of 500k points to identify delivery zones. DBSCAN returns 80% of points as noise. How do you diagnose and fix this?

Q124Deep LearningSenior

You're training a ResNet-50 for defect detection on an assembly line. Training loss drops to 0.08 but validation loss oscillates between 0.3 and 0.6 across epochs. The dataset has 12,000 images with 8 defect categories. What's your diagnosis and plan?

Q125Deep LearningMid

Your team is training a binary classification neural network and the loss is stuck at 0.693 from epoch 1. What is happening and how do you debug it?

Q126NLPSenior

An insurance company wants to classify claim narratives (avg 300 words) into 45 damage categories. They have 8,000 labeled examples. GPT-4 zero-shot accuracy is 61%, fine-tuned BERT is 74%. They want to get to 85%. What's your plan?

Q127NLPMid

You're building a sentiment analysis model for product reviews in Hindi-English code-mixed text. Standard English BERT performs at 64% accuracy. How do you improve it?

Q128Experimentation & A/B TestingSenior

Your e-commerce platform runs 30 concurrent A/B tests. The data science team reports 12 winners at p<0.05. The CTO asks why so many tests are winning simultaneously. How do you explain and fix the multiple testing problem?

Q129Experimentation & A/B TestingMid

You run an A/B test for a new onboarding flow. The test ran for 3 days and your manager declares the variant winner with p=0.04. You're skeptical. What problems do you raise?

Q130ML System DesignSenior

Design an ML system to power a personalized job recommendation engine for a platform with 5 million users and 200,000 active job postings, with sub-200ms p99 latency requirements.

Q131ML System DesignSenior

A streaming platform wants to detect anomalies in real-time viewing patterns to flag potential account sharing violations. Design the detection system for 50 million concurrent events per hour.

Q132Model DeploymentSenior

You've trained a transformer model for document classification that achieves 91% accuracy. Inference latency is 340ms per document. The SLA requires 150ms. Walk through your optimization strategy.

Q133Model DeploymentMid

Your model works perfectly in the development environment but produces wrong predictions in production. After investigation you find feature distributions differ between environments. What caused this and how do you prevent it?

Q134Time SeriesSenior

A telecom company's monthly subscriber churn forecasting model has MAPE of 4% on validation but 22% on the last 3 months of actuals. Investigate and redesign the forecasting approach.

Q135Time SeriesMid

You're forecasting daily retail sales for 500 stores using a single global model. Some stores are new and have only 30 days of history. How do you handle this cold-start problem?

Q136Bias & FairnessSenior

Your resume screening model shows 73% pass-through rate for men and 58% for women applying for engineering roles. Legal flags this as potential disparate impact. How do you investigate and remediate?

Q137Bias & FairnessMid

You build a loan default prediction model. Accuracy is 87% overall, but for applicants from rural postcodes it's 71%. How do you diagnose and address the performance gap?

Q138Python & PandasSenior

You have a Pandas pipeline processing 50GB of transaction data daily on a single machine with 16GB RAM. The job crashes with MemoryError. How do you redesign it without moving to a distributed system?

Q139Python & PandasMid

A colleague's data cleaning script runs in 45 minutes on a 2M-row DataFrame. You're asked to optimize it. After reviewing the code you find a `df.iterrows()` loop with conditional logic. What do you do?

Q140SQL for Data ScienceSenior

You need to compute 7-day rolling retention cohorts for a mobile app with 10 million users and 500 million events in BigQuery. The query is timing out after 10 minutes. How do you redesign it?

Q141SQL for Data ScienceMid

A product manager asks you to identify users who made at least 3 purchases within any 30-day rolling window using a PostgreSQL events table. Write the query approach and explain the logic.

Q142Statistics & InferenceSenior

A pharmaceutical company uses Bayesian A/B testing to compare two drug formulations for pain relief. The posterior shows 87% probability that formulation B reduces pain more than A. The clinical lead wants to halt the trial and switch to B. How do you advise?

Q143Feature EngineeringSenior

You're building a credit card default prediction model for a bank. Your dataset has 3 years of monthly payment history per customer. How do you encode this temporal payment sequence as features for a gradient boosted tree?

Q144Model Selection & EvaluationMid

You're evaluating a regression model to predict apartment rental prices. You report R² = 0.89 to stakeholders. Your manager says R² alone is not enough. What else do you report and why?

Q145Supervised LearningMid

You're training a multi-label text classifier to tag support tickets with issue categories (billing, technical, shipping, etc.). Each ticket can have multiple tags. What loss function and evaluation metrics do you use?

Q146Unsupervised & ClusteringSenior

An online education platform wants to use clustering to identify student learning archetypes from event logs. You have 500 features from clickstream, quiz scores, video watch rates, and forum activity. Describe your end-to-end approach.

Q147Deep LearningSenior

You're training a recommendation model using two-tower architecture (user tower, item tower with dot-product similarity). At production serving, embedding updates for new items need to propagate within 5 minutes. How do you design this?

Q148NLPSenior

You deployed a GPT-4-based customer service bot for a bank. After 2 weeks, the compliance team flags that the bot occasionally gives incorrect account balance information and cites wrong policy clauses. How do you remediate this?

Q149Experimentation & A/B TestingSenior

Your streaming platform wants to test a new thumbnail algorithm that might increase click-through rate. However, thumbnails also affect watch time and subscriber growth. How do you design an A/B test that handles multiple correlated metrics?

Q150ML System DesignMid

A startup asks you to design an ML pipeline for predicting customer lifetime value (LTV) for a SaaS product with 10,000 customers and monthly billing data. They have no MLOps infrastructure. What do you build?

Q151Model DeploymentSenior

Your model serving infrastructure processes 50,000 requests per second at peak. You need to deploy a new model version with zero downtime and the ability to instantly roll back. Describe your deployment strategy.

Q152Time SeriesSenior

You're forecasting hourly electricity demand for a national grid operator. The series has daily, weekly, and annual seasonality, plus external drivers like temperature and public holidays. How do you model all these components jointly?

Q153Time SeriesMid

You use ARIMA to forecast weekly web traffic for a news site. The model works well normally but produces wildly wrong forecasts during major news events like elections. How do you handle this?

Q154Bias & FairnessSenior

Your team is auditing a hiring algorithm used by a large tech company. You find equalized odds are violated: the model's TPR for women is 62% vs. 78% for men. The company argues overall accuracy is equal. How do you respond and what remediation do you recommend?

Q155Python & PandasSenior

You need to run a hyperparameter tuning job over 500 configurations of a neural network training job. Each training run takes 2 hours on a single GPU. How do you organize and execute this efficiently?

Q156SQL for Data ScienceSenior

You're analyzing funnel drop-off in Snowflake for a SaaS product with steps: signup, email verification, profile setup, first action, subscription. Users can take days between steps. Write the query design for computing per-step conversion and median time-to-complete.

Q157Statistics & InferenceMid

You're told your logistic regression model has p-values below 0.001 for all 15 features. A senior colleague says this is suspicious. Why might this be and what does it actually mean?

Q158Model Selection & EvaluationSenior

You're comparing five ML models on a leaderboard for an internal competition using test set accuracy. A team member updates their submission 30 times before the deadline, improving from 81% to 91%. Why is this a problem and how should the competition be redesigned?

Q159Deep LearningMid

You're training a sentiment classifier using a pre-trained BERT model. After fine-tuning for 10 epochs you notice validation accuracy peaked at epoch 3 and has been declining since. What's happening and what do you change?

Q160NLPMid

You're building a named entity recognition (NER) system to extract drug names and dosages from clinical notes. You have 3,000 labeled sentences. BERT-base NER achieves F1=0.71. How do you improve performance?

Q161Experimentation & A/B TestingMid

You're running an A/B test on a pricing page. After 2 weeks the control conversion rate is 4.2% and variant is 4.5%. The test has 10,000 users per arm. Is this result statistically significant and what is your recommendation?

Q162ML System DesignSenior

A ride-sharing company wants to predict surge pricing in real time across 500 cities simultaneously. The model must incorporate supply (driver locations, availability) and demand (ride requests, events) signals updated every 60 seconds. Design this system.

Q163Model DeploymentMid

After deploying a new recommendation model, user engagement metrics start declining gradually over 3 weeks. No errors appear in logs. How do you investigate and what monitoring would have caught this earlier?

Q164Bias & FairnessMid

Your team trains a content moderation model on historical human-moderated decisions. You notice the model flags AAVE (African American Vernacular English) text at 2x the rate of Standard American English for the same underlying content. How do you address this?

Q165Python & PandasMid

You need to merge two DataFrames: one with 5 million customer records and one with 50 million transaction records, joining on customer_id. The merge is taking 8 minutes. How do you speed this up?

Q166SQL for Data ScienceMid

You're asked to find the second highest salary in each department from an employees table. Write the approach and explain why a naive LIMIT 1 OFFSET 1 fails in this context.

Q167Feature EngineeringMid

You're building a price prediction model for used cars. The 'car model' feature has 4,200 unique values with many rare models seen only once in training. How do you encode this high-cardinality categorical feature?

Q168Supervised LearningSenior

A hospital uses your sepsis early-warning model in the ICU. A nurse reports the model fires alerts every 2-3 hours per patient, causing staff to ignore them. How do you redesign the alerting system without reducing true positive recall?

Q169Statistics & InferenceMid

You fit a linear regression model and find that your residuals are autocorrelated (Durbin-Watson statistic of 1.2). What does this mean and how do you fix it?

Q170Statistics & InferenceSenior

Your e-commerce company ran a 90-day pricing experiment. The p-value is 0.03, but your finance partner says revenue impact is only $2K/month versus a $500K engineering cost to productionize. How do you reconcile statistical significance with business significance, and what do you recommend?

Q171Statistics & InferenceMid

A healthcare startup's logistic regression model outputs probabilities, but the calibration plot shows predicted probabilities cluster around 0.6-0.8 even for patients who rarely get the disease (true prevalence 5%). Why does this happen and how do you fix it?

Q172Statistics & InferenceSenior

You are analyzing a 3-year panel dataset of retail store sales across 200 locations. A naive OLS regression finds that adding more staff increases sales by $400/hour. Your manager wants to use this to justify hiring. What causal identification problem exists and how would you address it?

Q173Feature EngineeringMid

You're building a churn prediction model for a SaaS product. Your raw data includes user event logs with timestamps. The product team insists on using 'days since signup' as a feature, but you suspect it introduces target leakage. Walk through your analysis.

Q174Feature EngineeringSenior

A fintech's fraud detection model is trained monthly on the prior 6 months of transactions. After 3 months in production, precision drops from 87% to 61% with no change in recall. The fraud team reports fraudsters are using new merchant category codes not seen in training. Design a feature strategy that degrades gracefully against distribution shift.

Q175Feature EngineeringSenior

Your recommendation system for a streaming platform uses user-item interaction history, but 60% of your catalog consists of items with fewer than 10 interactions ('cold items'). Standard collaborative filtering performs poorly here. What feature engineering approaches bridge the cold-start gap?

Q176Model Selection & EvaluationMid

You trained a gradient boosting classifier for loan default prediction. AUC-ROC is 0.91, but when you deploy to production the business team says the model approves too many bad loans. What evaluation metrics should you have used and how do you fix this now?

Q177Model Selection & EvaluationSenior

Two ML teams at your company are debating model selection for a medical image classification task: one team favors a fine-tuned ResNet-50, the other a custom 3-layer CNN trained from scratch. Both achieve 94% test accuracy. How do you break the tie and which would you choose?

Q178Model Selection & EvaluationSenior

You're selecting a model for a real-time bidding system where predictions must complete in under 5ms and you process 500,000 requests per second. Your offline evaluation shows XGBoost at AUC 0.84 and a deep neural network at AUC 0.87. Walk through your production model selection decision.

Q179Supervised LearningMid

You're training a random forest on a dataset with 500 features for customer lifetime value prediction. Training AUC is 0.95 but validation AUC is 0.72. Feature importance shows the top 3 features explain 80% of the model. What do you do?

Q180Supervised LearningSenior

A supply chain company wants to predict whether a shipment will be delayed, but 95% of shipments arrive on time. You try logistic regression, it achieves 95% accuracy by predicting 'no delay' for everything. How do you build a model that's actually useful?

Q181Supervised LearningSenior

You've inherited a mortgage default model from a team that used gradient boosting with 200 features. Regulators now require the model to be 'explainable' in writing to individual applicants. How do you handle this without starting from scratch, and what are the risks of your approach?

Q182Unsupervised & ClusteringMid

You ran k-means clustering on customer purchase data and got k=5 clusters. Your marketing manager asks 'are these the real customer segments?' How do you evaluate the quality of these clusters and give an honest answer?

Q183Unsupervised & ClusteringSenior

You're clustering 50 million user sessions for a news platform to create behavior-based audience segments. In-memory algorithms like k-means fail due to memory constraints. How do you scale clustering to this volume without losing cluster quality?

Q184Unsupervised & ClusteringSenior

A telecom's network operations team uses clustering to detect anomalous tower behavior. They complain that k-means misses rare failure patterns affecting fewer than 0.1% of towers. What unsupervised approach better handles this use case, and how do you tune it?

Q185Deep LearningSenior

You're training a ResNet-50 on a medical imaging dataset of 15,000 images. After 20 epochs, training loss is 0.08 but validation loss starts increasing after epoch 8. You've tried dropout 0.5 and weight decay 1e-4. What else do you try and in what order?

Q186Deep LearningMid

You're debugging a multi-layer neural network where gradients are vanishing—the loss barely changes after epoch 1, and gradients in the first two layers are on the order of 1e-8. Your network uses sigmoid activations and 12 layers. What do you change?

Q187Deep LearningSenior

Your company wants to train a large language model for internal code search with 200GB of proprietary code. You have a budget of 8 A100 GPUs for 2 weeks. How do you design the training strategy to make the most of these constraints?

Q188NLPSenior

You built a BERT-based sentiment classifier for customer reviews, achieving 91% accuracy on a balanced test set. After deploying it to analyze support tickets, the model consistently misclassifies technical complaints ('the API returns 500 errors') as neutral, missing negative sentiment. How do you diagnose and fix this?

Q189NLPMid

You're building a named entity recognition system for a legal document platform. Your training data has 10,000 labeled sentences, but the entity types ('contract clause', 'liability term') rarely appear in general NLP pre-training corpora. How do you maximize performance with limited data?

Q190NLPSenior

A retail company deploys a RAG-based product Q&A system using OpenAI embeddings and a vector database. After 2 months, customers complain the system gives outdated answers about product availability despite the database being refreshed nightly. What are the root causes and how do you fix them?

Q191Experimentation & A/B TestingSenior

Your e-commerce platform's A/B test shows a new checkout flow increases conversion rate by 2.3% (p=0.01). However, your data scientist colleague argues the result is invalid because users were exposed to both variants over multiple sessions. Explain the problem and propose a valid experimental design.

Q192Experimentation & A/B TestingMid

You're running an A/B test on a recommendation algorithm. After 5 days, conversion rate is 4.1% in control and 4.6% in variant (p=0.04). Your product manager wants to call it and ship. What concerns do you raise?

Q193Experimentation & A/B TestingSenior

A gaming company runs experiments on matchmaking algorithms, but the network effects make user-level randomization invalid—changing matchmaking for 10% of users affects everyone in their matches. How do you design a valid experiment in this network-effect setting?

Q194ML System DesignSenior

Design an ML system for a rideshare company that predicts driver supply and rider demand at 15-minute intervals for the next 2 hours across 500 city zones, to inform surge pricing. What features, models, infrastructure, and monitoring would you build?

Q195ML System DesignSenior

A bank wants to build a real-time credit limit increase recommendation system. Customers should receive proactive limit increase offers within 30 seconds of a qualifying event (e.g., salary credited). Design the end-to-end ML and data pipeline.

Q196ML System DesignMid

You're asked to build a content moderation ML system that flags toxic comments on a user-generated content platform with 5 million comments per day. You have a team of 10 moderators. How do you design the human-in-the-loop workflow?

Q197Model DeploymentSenior

You're deploying a deep learning model for real-time image classification on edge devices (mobile phones) in a developing market with intermittent connectivity. The model is 450MB and takes 800ms on a mid-range phone. How do you make this deployable?

Q198Model DeploymentMid

Your ML model serving layer on Kubernetes has P99 latency of 250ms on Monday mornings due to traffic spikes. The average latency is 45ms. How do you diagnose the cause and what solutions do you implement?

Q199Model DeploymentSenior

Your team is debating between a shadow mode deployment and a canary deployment for rolling out a new fraud detection model in a payments company. The old model has been in production for 2 years. Make the case for the right approach and describe what you monitor.

Q200Time SeriesSenior

A manufacturing plant's equipment failure prediction model was built using 3 years of sensor data. After 6 months in production, it starts generating false alarms 3x more frequently than at launch, while missing real failures. The plant has added no new equipment. What do you investigate?

Can you defend these answers under follow-up pressure?

Book a mock interview with a senior Data Scientist / ML Engineer mentor — structured scorecard, replay, and a gap plan.

Book a Mock Interview →

Data Scientist / ML Engineer Interview Questions

Book your free audit