Data Scientist / ML Engineer Interview Questions
200 scenario-based questions with detailed model answers, organized skill-wise and tool-wise. Filter by topic, level or keyword, reveal the answer — then pressure-test yourself in a real mock.
Your e-commerce team ran a 2-week A/B test on a new checkout flow and saw a 4% lift in conversion. The p-value is 0.03. Leadership wants to ship immediately, but you notice the test was stopped early on day 10 when results looked good. How do you handle this?
A product analyst says your churn prediction model is 92% accurate and therefore excellent. The dataset has 8% churners and 92% non-churners. How do you explain why this accuracy number is misleading and what metrics you'd use instead?
You're building a real-time fraud detection model at a payments company. Transaction volume is 50k/second. Your data science team proposes computing 200 rolling aggregate features (e.g., spend in last 1h, 24h, 7d) at inference time. What are the risks and how would you architect this?
You have a categorical feature 'city' with 3,000 unique values and 20% of them appear fewer than 5 times in your training data. One-hot encoding explodes your feature space. What encoding strategies would you evaluate and what's your decision process?
You trained a gradient boosting model for loan default prediction that achieves AUC-ROC 0.89 on your hold-out set. Three months after deployment, the model's Gini coefficient in production dropped from 0.78 to 0.51. The loan book and economic conditions haven't changed dramatically. What do you investigate first?
You're comparing a logistic regression baseline (AUC 0.81) against an XGBoost model (AUC 0.85) for a credit scoring use case. Your manager says 'just ship the XGBoost—it's better.' What questions do you raise before agreeing?
You're training a customer lifetime value regression model at a subscription SaaS company. The target variable (12-month LTV) has a heavily right-skewed distribution—most customers have LTV of $50-200, but 2% have LTV above $5,000. A naive RMSE-optimized model mostly ignores the high-value tail. How do you fix this?
You have a binary classification problem with 1 million training samples. You try a random forest with 500 trees (AUC 0.84) and XGBoost with default hyperparameters (AUC 0.83). Your colleague says random forest is better. You suspect XGBoost is undertuned. How do you proceed?
You're tasked with segmenting 2 million B2B customers for a telecom company using behavioral usage data (call minutes, SMS, data consumption, roaming frequency). You run K-means with K=8. Your sales director says some segments are 'too broad' and wants finer segments. How do you evaluate and refine the segmentation without just increasing K?
You run K-means on customer transaction data and get well-separated clusters in 2D UMAP plots. When you hand the segments to the marketing team, they say the clusters don't make business sense—high-value and low-value customers are mixed. What went wrong and how do you fix it?
You're training a ResNet-50 image classifier on a dataset of 50,000 medical X-rays to detect pneumonia. Training accuracy reaches 98% but validation accuracy plateaus at 71%. You've tried dropout and L2 regularization but the gap persists. What's your structured debugging approach?
You're training a text classification model using BERT fine-tuning on a 10,000-sample customer support ticket dataset. After 3 epochs, training loss is 0.05 but validation loss is 1.8. What are three concrete things you'd try?
You've deployed a BERT-based product review classifier at an e-commerce company. Three months in, a new product category (smart home devices) launches. Accuracy on smart home reviews drops to 62% while overall accuracy stays above 85%. How do you handle this without full retraining every week?
You're building a named entity recognition (NER) system for a legal document processing pipeline. Your SpaCy NER model trained on general web text achieves F1 of 0.55 on legal contracts. A lawyer suggests you just need more data. Is she right? What else could improve this significantly?
Your growth team runs 15 simultaneous A/B tests on your mobile app. When you analyze results, 4 tests show significant positive effects (p < 0.05). The team wants to ship all 4 winners. You flag that this is likely wrong. Explain why and what corrective analysis you'd run.
You're running an A/B test on a new recommendation algorithm for a music streaming app. Your primary metric is 7-day retention, but you notice that session length increased significantly in the treatment group during the first 3 days. Should you call the test early based on this secondary metric?
Design a recommendation system for a news app with 5 million daily active users and 50,000 articles published per day. The system must serve personalized feeds with p99 latency under 100ms. Walk me through your architecture from data ingestion to serving.
A fintech startup asks you to design an ML system to approve or reject personal loan applications in real time (decision within 2 seconds). They have 3 years of historical loan data, no existing ML infrastructure, and a team of 2 data scientists. What would you build first and what would you defer?
Your ML team deployed a churn model to production via a REST API. Six weeks later, DevOps reports the inference service is using 4x more memory than provisioned and occasionally OOM-killing. The model hasn't changed. What are the likely causes and how do you debug?
You've trained a scikit-learn pipeline that includes a TF-IDF vectorizer and a logistic regression model for text classification. You pickle it and deploy to production. Two weeks later the service throws a deserialization error. What went wrong and how do you prevent this going forward?
You're a data scientist at a retail chain. Your demand forecasting model (Prophet) was achieving MAPE of 12% before COVID. During the pandemic and recovery period (2020-2022), MAPE spiked to 45%. Now in 2024, leadership wants the model fixed. How do you approach rebuilding it?
You're forecasting daily electricity consumption for a utility company. Your ARIMA model has low training error but the residuals show clear autocorrelation in the ACF plot at lag 7. What does this tell you and how do you fix it?
Your hiring algorithm at an HR tech company shows that female candidates are 23% less likely to be shortlisted than male candidates with equivalent qualifications. Legal says this creates disparate impact liability. How do you diagnose the source of the bias and what mitigation strategies are available?
You're reviewing a credit scoring model and find that it performs well overall (AUC 0.82) but has AUC of 0.68 for applicants from rural zip codes. A colleague says 'rural customers are a small segment, so overall performance is fine.' How do you respond?
You have a Pandas DataFrame with 50 million rows representing transaction events. When you run `.groupby('customer_id').agg({'amount': 'sum', 'event_time': 'max'})`, the process runs for 25 minutes and eats 32GB of RAM. How do you speed this up?
You're doing feature engineering on a training dataset with 200 features. You need to compute pairwise correlations, handle missing values, and run outlier detection—all inside a scikit-learn Pipeline to prevent train-val leakage. A junior engineer says 'just preprocess the whole DataFrame before splitting.' How do you explain the leakage risk and structure the pipeline correctly?
You're tasked with computing 30-day rolling average revenue per user from a raw events table with 2 billion rows in BigQuery. A colleague wrote a solution using a self-join. It's been running for 90 minutes. How do you rewrite it to be efficient?
Your data science team needs to compute a 'next purchase date' feature for each customer-product pair in a 500M row orders table for training a repurchase prediction model. Naive implementations keep timing out. Walk me through an efficient approach in SQL and explain the pitfall of the obvious first attempt.
You're analyzing whether a new pricing algorithm caused a revenue increase of $2M/month. The rollout was not randomized—it was deployed to high-traffic markets first. A stakeholder claims the revenue increase proves the algorithm works. What's wrong with this causal claim and how would you establish a more rigorous estimate?
You're building a real-estate price prediction model. 'Neighborhood' is a key predictor with 800 unique values. Some neighborhoods have only 3 sales in your training data. You try target encoding and observe severe overfitting on rare neighborhoods. How do you address this rigorously?
You train five models with 10-fold cross-validation. The best model has mean CV AUC 0.854 ± 0.012. The second-best has 0.849 ± 0.009. Your stakeholder says pick the best CV score. What's the statistically correct approach and what would you actually recommend?
You're building a multi-label text classifier to tag customer support tickets with one or more of 40 product categories. 15 of the 40 categories have fewer than 100 training examples. Standard binary cross-entropy fine-tuning on BERT gives micro-F1 of 0.71 but macro-F1 of 0.42. How do you improve the tail-category performance?
You're training a large language model fine-tune (7B parameter LLaMA variant) for a domain-specific task. The training job uses 8 A100 GPUs. After 3 days of training, loss is still not converging—it oscillates between 2.1 and 2.4. What do you diagnose and fix?
You've deployed a RAG (retrieval-augmented generation) pipeline for a legal research assistant. Users report that the system confidently answers questions with incorrect case law citations that look plausible but don't exist. How do you diagnose and mitigate this hallucination problem?
You're running an A/B test at a ride-sharing company where treatment users get a surge pricing warning 10 minutes before a known demand spike. You see that treatment group users have 15% higher ride completion. But your data scientist flags that there may be network effects—treatment users booking ahead might displace control users during the spike. How do you handle interference in this experiment?
A startup wants to build a 'similar products' feature for their e-commerce site with 100,000 SKUs. They have purchase history, product descriptions, and images. They have no ML infrastructure. What's the simplest working system you'd build in 2 weeks?
You need to deploy a PyTorch model that runs inference on 4K video frames in real time (30 fps) on edge devices (NVIDIA Jetson AGX) with no internet connectivity. The model currently takes 150ms per frame on a cloud GPU. How do you approach optimization and deployment?
You're building a predictive maintenance model for 500 industrial machines. Each machine generates sensor readings (temperature, vibration, pressure) at 1Hz for 3 years. Machine failures are rare—2% of machine-days have a failure within the next 7 days. How do you structure this as an ML problem and what architecture would you use?
Your NLP-based resume screening model shows that candidates with names common in certain ethnic groups are 18% less likely to pass the first screen, even controlling for education and experience. The model doesn't use name as a feature. How is this possible and what do you do?
You need to build a feature engineering function that processes 10,000 individual customer DataFrames (each ~5,000 rows) sequentially in a production batch job. The current implementation takes 4 hours. A teammate suggests switching to Dask. Is Dask the right fix? What would you try?
You're building a feature for 'days since last purchase' for a churn model. Your SQL query uses `DATEDIFF(CURRENT_DATE, MAX(order_date))` grouped by customer_id. During model validation, you discover this feature has 0% importance in the tree model. A senior engineer says the feature is fine. What's likely wrong?
Your manager asks you to determine whether men or women have higher average order values on your platform. You run a t-test and get p=0.04. She asks 'so women spend 15% more—can we now target our campaigns accordingly?' What do you tell her?
You're tasked with anomaly detection on server infrastructure logs at a cloud provider—500 million log entries per day across 10,000 servers. You need to flag unusual server behavior in near real time (within 5 minutes of occurrence). How do you design this system?
You're building a model to predict which of 10,000 daily job applicants to advance to a phone screen. You have 3 years of historical data. Your head of recruiting says calibration is critical because the model score will be used directly to prioritize recruiter time. How do you ensure and validate calibration?
You're building a product image quality classifier to reject blurry or poorly lit product photos. You have 5,000 labeled images (3,500 good, 1,500 bad) and a team asking for 95% precision on the 'bad' class to avoid incorrectly rejecting seller photos. How do you set the decision threshold and validate it?
You need to detect duplicate customer support tickets submitted through multiple channels (email, chat, web form). Tickets are typically 50-200 words. You have no labeled pairs of duplicates. How do you build a duplicate detection system without labeled training data?
You're an analyst at a food delivery app. You run an A/B test where treatment users see restaurant photos on the search results page. After 2 weeks, there's no statistically significant difference in order conversion rate (p=0.41). Your PM says 'photos don't work—let's drop the idea.' What do you tell her?
You're the lead data scientist at an insurance company designing a claims severity prediction model that will directly influence settlement offers. The model will process 500 new claims per day. How do you design for fairness, auditability, and human oversight without slowing down operations?
Your time series model for forecasting weekly store sales is trained on 3 years of data. When you validate with a random 20% holdout split, MAPE is 8%. When you validate with a temporal holdout (last 6 months), MAPE is 19%. Your manager asks why there's such a big discrepancy. What do you explain?
You've built a content recommendation model for a news platform. After deployment, editorial staff notice the model consistently recommends less content from women journalists and underrepresented ethnic groups compared to white male journalists, even when controlling for article engagement. What do you investigate?
You're asked to find all customers who purchased from category A and category B but NOT category C, using a transactions table in Pandas. Write the logic and explain any performance considerations for a 10M row dataset.
You need to calculate 30-day cohort retention for a mobile app: for each cohort (users by sign-up week), what percentage returned on day 30? You have a user_events table with user_id, event_type, and event_date. Write the query logic and explain the key challenge.
You're asked to build an email spam classifier that will process 10 million emails per day for a mid-size enterprise. The system must have false positive rate under 1% (legitimate emails incorrectly marked as spam). How do you design the model and evaluation strategy?
You're analyzing a clinical trial dataset where a new drug reduces hospitalization rate from 12% (control) to 8% (treatment). The p-value is 0.001. A physician argues this is clinically insignificant. You argue it might be meaningful. How do you reconcile statistical and clinical significance, and what metrics communicate this best to clinicians?
You're building a binary classifier using logistic regression. After training, your ROC-AUC on the validation set is 0.79. A data scientist suggests you should definitely try a random forest next. Do you agree? What additional analysis would you do first?
You are building a churn model for a SaaS platform. One of your engineers added a feature 'days_to_churn' computed from the future cancellation event. Your validation AUC jumps from 0.78 to 0.97. You're suspicious. How do you detect and confirm this is data leakage?
Your fintech firm ran a 6-week A/B test on a new loan-approval model. The p-value is 0.048, but the ops team noticed the treatment arm had 15% more high-income applicants due to a routing bug. How do you salvage valid conclusions from this contaminated experiment?
You are a data scientist at a SaaS company. The marketing team asks whether the 8% increase in trial-to-paid conversion last quarter was statistically significant. All you have is a CSV of weekly conversion rates for 12 weeks before and 13 weeks after the campaign launched. Walk through your analysis.
You are building a real-time churn-risk scorer for a B2B SaaS platform. The raw event stream arrives via Kafka. Account-level features like 'days since last login' or 'seats used / seats purchased ratio' must be precomputed and served in under 50 ms. Describe your feature pipeline architecture end-to-end.
You join an e-commerce team building a product recommendation model. The item catalog has 2 million SKUs with high cardinality categorical fields: brand (12,000 unique), category (800 unique), and seller (150,000 unique). Raw one-hot encoding is infeasible. What feature encoding strategy do you use, and why?
A healthcare startup asks you to pick between a gradient-boosted tree model (AUC 0.91) and a logistic regression (AUC 0.87) for predicting 30-day hospital readmission. The model output will be shown to clinicians to trigger care-coordination calls. Which model do you recommend, and what factors decide the answer?
You trained a binary classifier on imbalanced data (2% positive class). Your colleague reports F1=0.82 and declares success. You are skeptical. What additional metrics and tests do you run to validate whether the model is actually useful?
You are building a fraud detection model at a payment processor handling 50,000 transactions per second. The model must score each transaction in under 10 ms. Your current XGBoost model scores in 2 ms on a single core but the team wants to add 300 new features from graph topology (degree centrality, community label). How do you evaluate whether adding these features is feasible without breaking the latency SLA?
You trained a gradient boosted model to predict employee attrition. During feature importance analysis, you notice 'employee ID' ranks in the top 5 features. The model's validation AUC is 0.93. What does this indicate and what do you do?
Your retail client wants to segment 8 million customers for personalized marketing campaigns. K-means on the full dataset takes 40 minutes per run and the marketing team wants to re-segment weekly as new transaction data arrives. How do you redesign the clustering pipeline to meet operational requirements?
You applied K-means with K=5 to cluster website users by behavioral features. A business stakeholder reviews the output and complains that two clusters look nearly identical. What steps do you take to diagnose and fix this?
You are training a ResNet-50 for defect detection on semiconductor wafer images at a chip manufacturer. Training loss decreases steadily but validation loss plateaus after epoch 12 and then diverges. The dataset has 50,000 images, 8 defect classes, and significant class imbalance (one rare defect class has only 200 examples). Diagnose and fix this.
You are training a multi-layer neural network for tabular credit scoring data and find that training is unstable — loss spikes randomly every few batches. Batch size is 64, learning rate is 0.01 with Adam. What do you investigate and how do you stabilize training?
Your team deployed a BERT-based sentiment classifier for customer support tickets at a telecom company. In production, the model performs well on broadband-related tickets but poorly on new 5G plan complaints. You have 500 labeled 5G tickets and cannot afford full fine-tuning. What is your strategy to adapt the model?
You are building a document classifier to route incoming legal contracts to the right team. You have 2,000 labeled contracts across 12 contract types. A colleague suggests fine-tuning GPT-4 via the OpenAI API. You suggest a different approach. What is it and why?
You run a pricing A/B test at a subscription streaming service. The test variant increases subscription price by $2/month. After 4 weeks you observe a 1.1% decrease in new subscriptions but a 5.8% increase in revenue per subscriber. Your CEO wants to ship immediately. You recommend waiting. Why?
You are setting up an A/B test for a new checkout flow. The product manager says 'run it for one week and stop when p < 0.05'. You push back. Explain the statistical problem and propose a correct stopping rule.
Design an end-to-end ML system for a food delivery platform that predicts estimated delivery time (ETA) in real time at the moment a customer places an order. The system must handle 20,000 concurrent orders at peak, update ETAs every 30 seconds as driver location changes, and achieve a median absolute error of under 3 minutes.
An autonomous vehicle company asks you to design the data flywheel for improving their lane-change maneuver model over time. The fleet generates 1 TB of sensor data per vehicle per day across 2,000 vehicles. How do you select, label, store, and retrain efficiently without drowning in data?
Your team shipped a new XGBoost model to production via a canary rollout (10% traffic). Within 6 hours the oncall alert fires: the canary model's positive prediction rate has dropped from 18% to 4%. The old model is still serving 90% of traffic at 17-19% rate. What is your incident response playbook?
You need to deploy a trained scikit-learn pipeline (StandardScaler + RandomForestClassifier) for batch scoring of 5 million customer records nightly. The scoring job must complete in under 30 minutes. Walk through your deployment and optimization approach.
You manage demand forecasting for a grocery chain with 500 stores and 30,000 SKUs. A traditional ARIMA-based pipeline runs store-SKU combinations independently and takes 18 hours to generate next-week forecasts. Leadership wants forecasts refreshed daily. How do you redesign the system?
You trained a Prophet model to forecast weekly sales for a retail client. The model performed well for 3 months, then suddenly started under-predicting by 30% for 6 consecutive weeks. The client says no promotions or external events changed. What do you investigate?
Your credit risk model is deployed at a bank. An internal audit shows that Black applicants are approved at 62% rate vs 79% for white applicants, even though both groups have similar median FICO scores. Leadership asks you to 'fix the model' while maintaining the same overall approval rate and default rate. Describe your investigation and the technical and ethical path forward.
You are asked to build a resume screening model to help an HR team shortlist candidates for software engineering roles. A colleague says 'just train it on the last 5 years of hiring decisions — those are the ground truth.' Why is this dangerous, and what alternative do you propose?
You have a pandas DataFrame with 150 million rows representing user clickstream events. Reading the raw CSV takes 8 minutes and groupby operations are slow enough to time out your Jupyter session. Colleagues suggest Dask; your infrastructure has a 32-core machine with 256 GB RAM. What is your end-to-end performance strategy?
A junior team member's data preprocessing script runs for 45 minutes on a 2-million-row DataFrame and uses a for-loop calling apply() row by row to compute a feature. You are asked to review and speed it up. Describe your approach and the vectorized alternative.
You are a senior data scientist at a ride-sharing company. Write a SQL query to compute the 7-day rolling retention rate: for each cohort of drivers who completed their first ride on a given day, what percentage were still active (completed at least one ride) in each of the 7 days following their first ride day?
You have a table of e-commerce orders with columns: order_id, user_id, order_date, revenue. Write a query to identify users who made purchases in three or more consecutive months, and return their user_id and the starting month of their longest consecutive streak.
Your company runs a multi-armed bandit for content recommendation. After 3 weeks, one arm has a 23% CTR and another has 21% CTR. The engineering lead says 'the bandit already chose the winner, just deploy the 23% arm.' You recommend a held-out A/B test instead. Defend your position with specific statistical reasoning.
You are building a credit card transaction fraud model. The feature 'merchant category code' (MCC) has 750 unique values and is critical for fraud detection but highly imbalanced — 5 MCCs account for 60% of transactions while 400 MCCs have fewer than 100 transactions in training data. How do you encode MCC to maximize signal and handle the tail?
You are building a document relevance ranker for an internal enterprise search engine. Your colleague used accuracy as the evaluation metric and reports 94% accuracy. You suspect this is misleading. What metrics would you actually use, and why?
You are a principal ML engineer at an insurance company. The actuarial team built a linear regression to predict claim severity. Your task is to replace it with a ML model that outperforms it. However, the actuary team requires the model's predictions to monotonically increase with 'years since last claim' and decrease with 'safety device installed.' How do you enforce these monotonic constraints while preserving model power?
You are tasked with identifying patient sub-populations in a hospital EHR dataset with 200 clinical features (lab values, diagnoses, medications, vitals) for 80,000 patients. Many features are missing (15-60% missingness per feature) and the features are on very different scales. How do you approach this clustering problem?
Your team is training a large language model fine-tune (7B parameter LLaMA variant) for document summarization on 4 A100 GPUs (80GB VRAM each). Training crashes after 200 steps with a CUDA out-of-memory error. The batch size is 8 and sequence length is 2048. How do you diagnose and resolve this?
You deployed a RAG (Retrieval-Augmented Generation) system for an internal knowledge base at a consulting firm. Users complain that the system sometimes returns confident but wrong answers — it pulls irrelevant chunks from the vector store and the LLM still generates a plausible-sounding response. How do you diagnose and improve retrieval precision?
You work at a social media platform. You want to measure the causal effect of showing users a mental-health resource banner on their engagement patterns. A traditional A/B test is not feasible because showing the banner to some users but not others on the same social graph creates network interference — a treated user's friends may change their behavior too. How do you design this study?
You are asked to design a machine learning pipeline to classify incoming customer support emails into 8 intent categories and route them to the correct team in under 5 seconds. The system must handle 10,000 emails per hour at peak. Sketch the end-to-end architecture.
Your ML platform team is asked to support multiple data science teams deploying models with different requirements: sub-100ms latency for real-time fraud, batch scoring nightly for CRM, and GPU inference for an NLP model. Design a unified model serving platform that handles all three patterns without requiring separate bespoke infrastructure for each.
A logistics company asks you to build an anomaly detection system for shipment transit times. Normal transit time for ground shipping is 3-7 days depending on distance; delays beyond 10 days should trigger an alert. The data has strong day-of-week seasonality, occasional holiday spikes, and a secular upward trend since COVID. How do you approach this?
Your NLP model for automated job description scoring rates descriptions on 'inclusivity' (gender-neutral language, accessibility statements). An internal audit shows the model consistently rates descriptions from job families dominated by women (nursing, teaching, social work) as less inclusive than descriptions from male-dominated fields (software engineering, finance), even when language is identical. What is causing this and how do you fix it?
You need to join a 10-million-row Spark DataFrame of user events with a 50-row lookup table of campaign metadata. A junior engineer wrote a join operation that triggers a full shuffle across the Spark cluster and takes 45 minutes. How do you fix this and explain the mechanism?
You are analyzing the impact of a product feature release at a SaaS company. Write a SQL query to compute a difference-in-differences estimate: compare the change in weekly active users (WAU) for accounts that adopted the feature in week 10 versus accounts that never adopted it, measuring 4 weeks before and 4 weeks after the adoption week.
Your team is testing 12 different UX changes simultaneously via separate A/B tests, all sharing the same user base. Your manager says each test has p < 0.05 so all 12 winners should ship. You disagree. Explain why and propose a correction.
You are building a named entity recognition (NER) system to extract product names, prices, and quantities from unstructured supplier invoices. The invoices are PDFs from 50 different suppliers with different layouts and fonts. You have 300 labeled invoices. Describe your end-to-end approach.
You are asked to build a model to forecast daily revenue for a B2B SaaS company one month ahead. The company has 3 years of daily data. Revenue has a weekly cycle, end-of-quarter spikes (Q-end enterprise deals), and grew 40% year-over-year. What features and model would you use?
A colleague proposes removing 'gender' and 'ethnicity' from a hiring model's feature set to make it fair. You point out that removing protected attributes is not sufficient for fairness. Explain why and what you would do instead.
A startup wants to build a system that automatically tags news articles with topics (politics, sports, finance, health, technology) in real time as articles are published. They publish 200 articles per day. How would you design this system from scratch on a limited budget?
You trained a recommendation model offline using implicit feedback (clicks, views) from your e-commerce platform. The model has been running in production for 4 months and revenue-per-session is declining. How do you determine if the model is the cause and what is your next step?
You are training a convolutional neural network to classify skin lesion images (benign vs malignant). Your training dataset has 5,000 benign and 500 malignant images. After 20 epochs, training accuracy is 98% but validation recall for malignant (the clinically critical class) is only 45%. What are the problems and how do you fix them?
You have a table 'sessions' with columns session_id, user_id, start_time, end_time. Write a SQL query to find users who had overlapping sessions — sessions where end_time of one session is after start_time of the next session for the same user.
You run an A/B test on a checkout page redesign. The experiment ran for 2 weeks with 50,000 users per arm. Conversion rate in treatment is 4.8% vs 4.5% in control (p=0.03). Your product manager wants to ship. Before approving, what are the 5 checks you would do?
You are working on a price elasticity model for a hotel chain. The raw data includes hotel rates, competitor rates, occupancy, and booking windows. Your model performance degrades whenever a hotel opens a new property nearby. How do you engineer features that are robust to this type of structural change?
You are evaluating a new ML model for detecting sepsis risk in ICU patients. The model outputs a probability score every hour per patient. The existing rule-based system has 75% sensitivity and 60% specificity. Your ML model has AUC=0.91. The clinical team asks whether AUC alone tells them whether to switch. What is your answer and what evaluation framework do you propose?
You are working with a pandas DataFrame where a column contains JSON strings: each cell has a JSON object with varying keys like '{"city": "Mumbai", "pin": "400001"}'. You need to expand this into separate columns efficiently. The DataFrame has 1 million rows. How do you do this?
You ran DBSCAN on a set of 2D customer location coordinates. DBSCAN labeled 40% of your points as noise (-1). Your manager asks why so many points are noise and whether you should switch to K-means instead. How do you explain and what do you recommend?
You trained a random forest to predict whether a software project will exceed budget. In production, the model performs well on waterfall projects but poorly on agile projects. Your training data was 80% waterfall projects. What is the problem and how do you fix it?
You are the lead ML engineer at a bank building a real-time personalized offer recommendation system. The system must generate a ranked list of 5 offers for each customer when they open the mobile app. The bank has 8 million customers, 200 active offers, and strict regulatory requirements that every recommendation must be explainable to the customer. Design the system.
Your growth team ran a 4-week A/B test on a checkout flow change. Results show p=0.03 but the conversion lift is only 0.2%. Finance wants to ship it immediately. How do you advise them, and what statistical issues do you flag?
A healthcare startup asks you to build a readmission risk model. They want 99% recall at any cost. How do you explain the precision-recall trade-off and what threshold policy do you recommend?
You're building a real-time fraud detection model at a fintech. Feature computation latency must be under 10ms. You have 200+ raw transaction fields and historical user behavior going back 2 years. Walk through your feature engineering architecture.
Your e-commerce recommendation model has a 'days since last purchase' feature. After deployment you notice its distribution in production is heavily skewed right with many users at 0 and a long tail to 730. What transformations do you apply and why?
You're leading ML for a B2B SaaS churn prediction project. The dataset has 50,000 accounts, 18-month history, heavy class imbalance (5% churn), and 120 features. Your manager wants a single model by Friday. How do you choose, evaluate, and justify?
You compare a neural network with 88% ROC-AUC against a logistic regression with 84% ROC-AUC for a medical diagnosis task. The product team wants to ship the neural network. What other evaluations do you run before agreeing?
A logistics company's delivery time prediction model was trained on pre-pandemic data. Post-2020, RMSE jumped from 12 minutes to 38 minutes in production. Diagnose the issue and design a remediation plan.
You're training a random forest for credit scoring and notice that test accuracy is 91% but training accuracy is 99%. What's happening and how do you fix it?
A retail bank wants to cluster its 2 million customers for personalized marketing. You've run K-means with k=8 and silhouette=0.31. Leadership asks if 8 is the right number and if clusters are actionable. What do you do next?
You apply DBSCAN to a geolocation dataset of 500k points to identify delivery zones. DBSCAN returns 80% of points as noise. How do you diagnose and fix this?
You're training a ResNet-50 for defect detection on an assembly line. Training loss drops to 0.08 but validation loss oscillates between 0.3 and 0.6 across epochs. The dataset has 12,000 images with 8 defect categories. What's your diagnosis and plan?
Your team is training a binary classification neural network and the loss is stuck at 0.693 from epoch 1. What is happening and how do you debug it?
An insurance company wants to classify claim narratives (avg 300 words) into 45 damage categories. They have 8,000 labeled examples. GPT-4 zero-shot accuracy is 61%, fine-tuned BERT is 74%. They want to get to 85%. What's your plan?
You're building a sentiment analysis model for product reviews in Hindi-English code-mixed text. Standard English BERT performs at 64% accuracy. How do you improve it?
Your e-commerce platform runs 30 concurrent A/B tests. The data science team reports 12 winners at p<0.05. The CTO asks why so many tests are winning simultaneously. How do you explain and fix the multiple testing problem?
You run an A/B test for a new onboarding flow. The test ran for 3 days and your manager declares the variant winner with p=0.04. You're skeptical. What problems do you raise?
Design an ML system to power a personalized job recommendation engine for a platform with 5 million users and 200,000 active job postings, with sub-200ms p99 latency requirements.
A streaming platform wants to detect anomalies in real-time viewing patterns to flag potential account sharing violations. Design the detection system for 50 million concurrent events per hour.
You've trained a transformer model for document classification that achieves 91% accuracy. Inference latency is 340ms per document. The SLA requires 150ms. Walk through your optimization strategy.
Your model works perfectly in the development environment but produces wrong predictions in production. After investigation you find feature distributions differ between environments. What caused this and how do you prevent it?
A telecom company's monthly subscriber churn forecasting model has MAPE of 4% on validation but 22% on the last 3 months of actuals. Investigate and redesign the forecasting approach.
You're forecasting daily retail sales for 500 stores using a single global model. Some stores are new and have only 30 days of history. How do you handle this cold-start problem?
Your resume screening model shows 73% pass-through rate for men and 58% for women applying for engineering roles. Legal flags this as potential disparate impact. How do you investigate and remediate?
You build a loan default prediction model. Accuracy is 87% overall, but for applicants from rural postcodes it's 71%. How do you diagnose and address the performance gap?
You have a Pandas pipeline processing 50GB of transaction data daily on a single machine with 16GB RAM. The job crashes with MemoryError. How do you redesign it without moving to a distributed system?
A colleague's data cleaning script runs in 45 minutes on a 2M-row DataFrame. You're asked to optimize it. After reviewing the code you find a `df.iterrows()` loop with conditional logic. What do you do?
You need to compute 7-day rolling retention cohorts for a mobile app with 10 million users and 500 million events in BigQuery. The query is timing out after 10 minutes. How do you redesign it?
A product manager asks you to identify users who made at least 3 purchases within any 30-day rolling window using a PostgreSQL events table. Write the query approach and explain the logic.
A pharmaceutical company uses Bayesian A/B testing to compare two drug formulations for pain relief. The posterior shows 87% probability that formulation B reduces pain more than A. The clinical lead wants to halt the trial and switch to B. How do you advise?
You're building a credit card default prediction model for a bank. Your dataset has 3 years of monthly payment history per customer. How do you encode this temporal payment sequence as features for a gradient boosted tree?
You're evaluating a regression model to predict apartment rental prices. You report R² = 0.89 to stakeholders. Your manager says R² alone is not enough. What else do you report and why?
You're training a multi-label text classifier to tag support tickets with issue categories (billing, technical, shipping, etc.). Each ticket can have multiple tags. What loss function and evaluation metrics do you use?
An online education platform wants to use clustering to identify student learning archetypes from event logs. You have 500 features from clickstream, quiz scores, video watch rates, and forum activity. Describe your end-to-end approach.
You're training a recommendation model using two-tower architecture (user tower, item tower with dot-product similarity). At production serving, embedding updates for new items need to propagate within 5 minutes. How do you design this?
You deployed a GPT-4-based customer service bot for a bank. After 2 weeks, the compliance team flags that the bot occasionally gives incorrect account balance information and cites wrong policy clauses. How do you remediate this?
Your streaming platform wants to test a new thumbnail algorithm that might increase click-through rate. However, thumbnails also affect watch time and subscriber growth. How do you design an A/B test that handles multiple correlated metrics?
A startup asks you to design an ML pipeline for predicting customer lifetime value (LTV) for a SaaS product with 10,000 customers and monthly billing data. They have no MLOps infrastructure. What do you build?
Your model serving infrastructure processes 50,000 requests per second at peak. You need to deploy a new model version with zero downtime and the ability to instantly roll back. Describe your deployment strategy.
You're forecasting hourly electricity demand for a national grid operator. The series has daily, weekly, and annual seasonality, plus external drivers like temperature and public holidays. How do you model all these components jointly?
You use ARIMA to forecast weekly web traffic for a news site. The model works well normally but produces wildly wrong forecasts during major news events like elections. How do you handle this?
Your team is auditing a hiring algorithm used by a large tech company. You find equalized odds are violated: the model's TPR for women is 62% vs. 78% for men. The company argues overall accuracy is equal. How do you respond and what remediation do you recommend?
You need to run a hyperparameter tuning job over 500 configurations of a neural network training job. Each training run takes 2 hours on a single GPU. How do you organize and execute this efficiently?
You're analyzing funnel drop-off in Snowflake for a SaaS product with steps: signup, email verification, profile setup, first action, subscription. Users can take days between steps. Write the query design for computing per-step conversion and median time-to-complete.
You're told your logistic regression model has p-values below 0.001 for all 15 features. A senior colleague says this is suspicious. Why might this be and what does it actually mean?
You're comparing five ML models on a leaderboard for an internal competition using test set accuracy. A team member updates their submission 30 times before the deadline, improving from 81% to 91%. Why is this a problem and how should the competition be redesigned?
You're training a sentiment classifier using a pre-trained BERT model. After fine-tuning for 10 epochs you notice validation accuracy peaked at epoch 3 and has been declining since. What's happening and what do you change?
You're building a named entity recognition (NER) system to extract drug names and dosages from clinical notes. You have 3,000 labeled sentences. BERT-base NER achieves F1=0.71. How do you improve performance?
You're running an A/B test on a pricing page. After 2 weeks the control conversion rate is 4.2% and variant is 4.5%. The test has 10,000 users per arm. Is this result statistically significant and what is your recommendation?
A ride-sharing company wants to predict surge pricing in real time across 500 cities simultaneously. The model must incorporate supply (driver locations, availability) and demand (ride requests, events) signals updated every 60 seconds. Design this system.
After deploying a new recommendation model, user engagement metrics start declining gradually over 3 weeks. No errors appear in logs. How do you investigate and what monitoring would have caught this earlier?
Your team trains a content moderation model on historical human-moderated decisions. You notice the model flags AAVE (African American Vernacular English) text at 2x the rate of Standard American English for the same underlying content. How do you address this?
You need to merge two DataFrames: one with 5 million customer records and one with 50 million transaction records, joining on customer_id. The merge is taking 8 minutes. How do you speed this up?
You're asked to find the second highest salary in each department from an employees table. Write the approach and explain why a naive LIMIT 1 OFFSET 1 fails in this context.
You're building a price prediction model for used cars. The 'car model' feature has 4,200 unique values with many rare models seen only once in training. How do you encode this high-cardinality categorical feature?
A hospital uses your sepsis early-warning model in the ICU. A nurse reports the model fires alerts every 2-3 hours per patient, causing staff to ignore them. How do you redesign the alerting system without reducing true positive recall?
You fit a linear regression model and find that your residuals are autocorrelated (Durbin-Watson statistic of 1.2). What does this mean and how do you fix it?
Your e-commerce company ran a 90-day pricing experiment. The p-value is 0.03, but your finance partner says revenue impact is only $2K/month versus a $500K engineering cost to productionize. How do you reconcile statistical significance with business significance, and what do you recommend?
A healthcare startup's logistic regression model outputs probabilities, but the calibration plot shows predicted probabilities cluster around 0.6-0.8 even for patients who rarely get the disease (true prevalence 5%). Why does this happen and how do you fix it?
You are analyzing a 3-year panel dataset of retail store sales across 200 locations. A naive OLS regression finds that adding more staff increases sales by $400/hour. Your manager wants to use this to justify hiring. What causal identification problem exists and how would you address it?
You're building a churn prediction model for a SaaS product. Your raw data includes user event logs with timestamps. The product team insists on using 'days since signup' as a feature, but you suspect it introduces target leakage. Walk through your analysis.
A fintech's fraud detection model is trained monthly on the prior 6 months of transactions. After 3 months in production, precision drops from 87% to 61% with no change in recall. The fraud team reports fraudsters are using new merchant category codes not seen in training. Design a feature strategy that degrades gracefully against distribution shift.
Your recommendation system for a streaming platform uses user-item interaction history, but 60% of your catalog consists of items with fewer than 10 interactions ('cold items'). Standard collaborative filtering performs poorly here. What feature engineering approaches bridge the cold-start gap?
You trained a gradient boosting classifier for loan default prediction. AUC-ROC is 0.91, but when you deploy to production the business team says the model approves too many bad loans. What evaluation metrics should you have used and how do you fix this now?
Two ML teams at your company are debating model selection for a medical image classification task: one team favors a fine-tuned ResNet-50, the other a custom 3-layer CNN trained from scratch. Both achieve 94% test accuracy. How do you break the tie and which would you choose?
You're selecting a model for a real-time bidding system where predictions must complete in under 5ms and you process 500,000 requests per second. Your offline evaluation shows XGBoost at AUC 0.84 and a deep neural network at AUC 0.87. Walk through your production model selection decision.
You're training a random forest on a dataset with 500 features for customer lifetime value prediction. Training AUC is 0.95 but validation AUC is 0.72. Feature importance shows the top 3 features explain 80% of the model. What do you do?
A supply chain company wants to predict whether a shipment will be delayed, but 95% of shipments arrive on time. You try logistic regression, it achieves 95% accuracy by predicting 'no delay' for everything. How do you build a model that's actually useful?
You've inherited a mortgage default model from a team that used gradient boosting with 200 features. Regulators now require the model to be 'explainable' in writing to individual applicants. How do you handle this without starting from scratch, and what are the risks of your approach?
You ran k-means clustering on customer purchase data and got k=5 clusters. Your marketing manager asks 'are these the real customer segments?' How do you evaluate the quality of these clusters and give an honest answer?
You're clustering 50 million user sessions for a news platform to create behavior-based audience segments. In-memory algorithms like k-means fail due to memory constraints. How do you scale clustering to this volume without losing cluster quality?
A telecom's network operations team uses clustering to detect anomalous tower behavior. They complain that k-means misses rare failure patterns affecting fewer than 0.1% of towers. What unsupervised approach better handles this use case, and how do you tune it?
You're training a ResNet-50 on a medical imaging dataset of 15,000 images. After 20 epochs, training loss is 0.08 but validation loss starts increasing after epoch 8. You've tried dropout 0.5 and weight decay 1e-4. What else do you try and in what order?
You're debugging a multi-layer neural network where gradients are vanishing—the loss barely changes after epoch 1, and gradients in the first two layers are on the order of 1e-8. Your network uses sigmoid activations and 12 layers. What do you change?
Your company wants to train a large language model for internal code search with 200GB of proprietary code. You have a budget of 8 A100 GPUs for 2 weeks. How do you design the training strategy to make the most of these constraints?
You built a BERT-based sentiment classifier for customer reviews, achieving 91% accuracy on a balanced test set. After deploying it to analyze support tickets, the model consistently misclassifies technical complaints ('the API returns 500 errors') as neutral, missing negative sentiment. How do you diagnose and fix this?
You're building a named entity recognition system for a legal document platform. Your training data has 10,000 labeled sentences, but the entity types ('contract clause', 'liability term') rarely appear in general NLP pre-training corpora. How do you maximize performance with limited data?
A retail company deploys a RAG-based product Q&A system using OpenAI embeddings and a vector database. After 2 months, customers complain the system gives outdated answers about product availability despite the database being refreshed nightly. What are the root causes and how do you fix them?
Your e-commerce platform's A/B test shows a new checkout flow increases conversion rate by 2.3% (p=0.01). However, your data scientist colleague argues the result is invalid because users were exposed to both variants over multiple sessions. Explain the problem and propose a valid experimental design.
You're running an A/B test on a recommendation algorithm. After 5 days, conversion rate is 4.1% in control and 4.6% in variant (p=0.04). Your product manager wants to call it and ship. What concerns do you raise?
A gaming company runs experiments on matchmaking algorithms, but the network effects make user-level randomization invalid—changing matchmaking for 10% of users affects everyone in their matches. How do you design a valid experiment in this network-effect setting?
Design an ML system for a rideshare company that predicts driver supply and rider demand at 15-minute intervals for the next 2 hours across 500 city zones, to inform surge pricing. What features, models, infrastructure, and monitoring would you build?
A bank wants to build a real-time credit limit increase recommendation system. Customers should receive proactive limit increase offers within 30 seconds of a qualifying event (e.g., salary credited). Design the end-to-end ML and data pipeline.
You're asked to build a content moderation ML system that flags toxic comments on a user-generated content platform with 5 million comments per day. You have a team of 10 moderators. How do you design the human-in-the-loop workflow?
You're deploying a deep learning model for real-time image classification on edge devices (mobile phones) in a developing market with intermittent connectivity. The model is 450MB and takes 800ms on a mid-range phone. How do you make this deployable?
Your ML model serving layer on Kubernetes has P99 latency of 250ms on Monday mornings due to traffic spikes. The average latency is 45ms. How do you diagnose the cause and what solutions do you implement?
Your team is debating between a shadow mode deployment and a canary deployment for rolling out a new fraud detection model in a payments company. The old model has been in production for 2 years. Make the case for the right approach and describe what you monitor.
A manufacturing plant's equipment failure prediction model was built using 3 years of sensor data. After 6 months in production, it starts generating false alarms 3x more frequently than at launch, while missing real failures. The plant has added no new equipment. What do you investigate?
Book a mock interview with a senior Data Scientist / ML Engineer mentor — structured scorecard, replay, and a gap plan.