MLOps Engineer Interview Questions
200 scenario-based questions with detailed model answers, organized skill-wise and tool-wise. Filter by topic, level or keyword, reveal the answer — then pressure-test yourself in a real mock.
Your e-commerce recommendation model serving 50K RPS suddenly spikes to 800ms p99 latency after a routine model update. The model size grew from 200MB to 1.2GB. Kubernetes pods are not OOMKilled. Walk me through your diagnosis and remediation within a 30-minute SLA window.
A fraud detection model at a fintech company starts producing alerts at 3x the normal rate on a Monday morning. Business says no product change shipped. Your feature store shows all features are populated. How do you determine if this is concept drift, data drift, or a pipeline bug within two hours?
You inherited an Airflow DAG that runs a nightly model retraining pipeline. It fails silently about 20% of the time — no alerts fire, the DAG shows green, but the model artifact produced is identical to the previous day's version. How do you fix this?
Your team uses Feast with Redis for online serving and BigQuery for the offline store. The data science team reports that models trained offline score differently in production for the same entity IDs, even after controlling for data freshness. Describe how you would diagnose and fix this training-serving skew.
A data scientist on your team ran 200 experiments in MLflow over three weeks, trying different architectures for a churn prediction model. Now they need to identify the top-5 runs by validation AUC, reproduce the exact environment of run #47, and promote the best model to staging. Walk through the exact steps.
Your ML CI pipeline runs model training end-to-end on every pull request, taking 45 minutes and costing $800/month on GPU spot instances. The team has 12 engineers pushing 3-5 PRs daily. Design a tiered CI strategy that keeps feedback fast and costs under $200/month without sacrificing safety.
Your team uses the MLflow Model Registry and has 15 models in production across 4 environments. A compliance audit requires you to demonstrate that every model in production can be traced back to the exact training data, code commit, and experiment run that produced it. What gaps do you expect to find and how do you close them?
A teammate accidentally ran dvc push without committing the corresponding dvc.lock changes to Git, and now main branch is inconsistent — the remote DVC cache has data that no Git commit references. Three other engineers are blocked. How do you recover?
You are running distributed PyTorch training jobs on Kubernetes using Kubeflow's PyTorchJob operator. Jobs complete successfully in a dev cluster but fail with NCCL timeout errors in prod after about 20 minutes of training. The clusters have the same node types. Describe your debugging approach.
Your team serves a BERT-base model for document classification via TorchServe on A10G GPUs. GPU utilization sits at 12% even during peak traffic, yet p95 latency is 180ms, well above the 50ms SLA. The product team says buying more GPUs is not an option. What do you do?
You need to roll out a new recommendation model for a streaming platform's homepage. The model was trained on six additional months of data and shows a 4% lift in offline NDCG. Design a canary deployment strategy that validates online metrics before full rollout, accounting for novelty bias in user behavior.
A data scientist reports that running the exact same training script with the same data on two different machines produces models with a 3% accuracy difference. The script sets all random seeds. How do you track down the source of non-determinism?
Your ML platform serves 20 models across 8 teams. You have basic infrastructure metrics (CPU, memory, latency) but no model-level observability. The CTO wants a single dashboard showing model health across all teams within one sprint. What do you build, in what order?
Your ML training and inference infrastructure costs $45K/month on AWS. Leadership wants a 40% reduction without degrading model quality or SLAs. You have three months. What's your structured approach?
You need to deploy a scikit-learn model as a REST API that needs to handle 500 RPS with a 30ms p99 latency SLA. The model is a gradient boosted tree with 300 estimators. You have a single 4-core VM. How do you architect the serving stack?
You are migrating a team from ad-hoc Jupyter notebooks to a production Kubeflow Pipelines setup. The team has 6 data scientists, none of whom have used KFP before. Training currently takes 2 hours and runs manually once a week. Design the migration strategy and the final pipeline architecture.
Your team is designing a feature store for a ride-sharing platform. You need to serve driver location and traffic features at under 5ms for surge pricing decisions. The team is debating between building on Redis vs. Apache Cassandra. How do you drive the decision?
You are setting up drift monitoring for a loan approval model. The model uses 45 input features including income, credit score, employment type, and loan amount. You have 30 days to implement monitoring from scratch. What do you build first and why?
You need to set up a GitHub Actions workflow that trains a model, evaluates it against the current production model, and only promotes it if it beats the champion by at least 1% on F1. The model trains in 40 minutes on GPU. How do you structure the workflow?
A Kubeflow training job is consuming all GPU resources on the cluster, blocking other teams' workloads for six hours. The cluster has 8 A100 nodes. How do you implement resource isolation going forward without requiring teams to change their job specs?
You want to shadow-test a new NLP model for customer support ticket classification before exposing it to any real users. The current model classifies 10,000 tickets per day. Describe the shadow deployment architecture and what you measure.
Six months after a model was deployed to production, a regulatory body requests a complete reproduction of the model training process — exact data, code, and environment. The original engineer has left the company. The MLflow run exists but the model card is sparse. Walk through your investigation and reproduction attempt.
You're designing a multi-tenant model serving platform for an enterprise SaaS company. Each tenant needs model isolation, separate SLAs (gold: 50ms p95, silver: 200ms p95), and cost attribution per tenant. You have a shared GPU cluster. How do you architect this?
Your organization has three teams each running their own MLflow tracking server. The model registry is fragmented — the same model family exists in all three servers under different names. Leadership wants a unified model catalog with cross-team discoverability. How do you consolidate without disrupting active experiments?
Your ML team processes 500GB of training data monthly. Every retraining run requires the full dataset, but 90% of the data is unchanged between runs. DVC pulls are taking 45 minutes and blocking pipeline runs. How do you optimize?
After an incident where a model's feature pipeline silently started producing NaN values for 6 hours before anyone noticed, your team is asked to implement real-time data quality monitoring on the feature pipeline. The pipeline processes 200,000 records per hour. Design the solution.
Your team runs a daily batch inference job that scores 10 million customer records for a marketing propensity model. The job currently runs on 20 on-demand p3.2xlarge instances and takes 3 hours, costing $900/run. Reduce the cost by at least 50% without missing the morning delivery SLA.
Your Airflow DAG orchestrates a 12-step ML pipeline. The fourth step (feature engineering) is the most expensive at 2 hours. When the pipeline fails at step 10, the team restarts from step 1, wasting 3 hours every time. The failure rate at step 10 is 15%. Design a robust retry and resumption strategy.
Your team is debating whether to implement point-in-time correct feature retrieval for an NLP model that predicts customer churn. Some engineers argue it adds unnecessary complexity. Make the case for or against it, citing specific scenarios where getting it wrong causes business harm.
Your team has a model that powers a critical pricing engine. The model is updated weekly. A post-deployment incident revealed that last week's model was promoted without any human approval because the CI pipeline had a bug in the gate condition. Design a model governance framework that prevents unauthorized promotions.
Your team deploys a computer vision model using PyTorch on NVIDIA T4 GPUs. During load testing you notice that increasing batch size from 1 to 32 improves throughput by 8x but increases p99 latency from 12ms to 95ms. The SLA requires both high throughput (1000 RPS) and under 50ms p99. How do you resolve the tension?
You are rolling out an updated credit scoring model at a bank. Regulatory requirements mandate that for any individual applicant, you must be able to explain why they received a different decision versus the old model. Design the canary strategy accounting for these explainability requirements.
Two data scientists ran the same experiment independently using different laptops and got AUC scores of 0.84 and 0.81 on the same dataset. Management is questioning the team's process. What systemic changes do you implement to prevent this from happening again?
Your team's model serving platform processes 5 million inferences per day across 12 models. You need to design a distributed tracing strategy that lets you trace a single inference request end-to-end from the API gateway through feature retrieval, model scoring, and post-processing, with under 1% performance overhead.
You are building a Kubeflow Pipeline for a time-series forecasting model that trains on the last 12 months of data. How do you handle the data partitioning to avoid temporal leakage while also ensuring the pipeline is idempotent and can be re-run for any target date?
You run a content moderation model that classifies social media posts. The model performs well on English content but your platform just expanded to three new markets: Portuguese, Hindi, and Japanese. You're receiving complaints about inconsistent moderation decisions. Design a multi-lingual drift and performance monitoring strategy.
Your team wants to implement model performance regression testing in CI that catches accuracy drops before deployment. The full test set has 500,000 samples and takes 90 minutes to evaluate. Balancing speed and sensitivity, design a practical CI evaluation strategy.
You need to serve 15 different models (different architectures, sizes, and frameworks) efficiently on a small cluster of 4 GPUs without spinning up 15 separate serving processes. Walk through your multi-model serving design.
A data scientist accidentally logged 10,000 runs into the wrong MLflow experiment over the past month. The runs need to be moved to the correct experiment without losing any metadata, metrics, or artifacts. How do you do this programmatically?
Your team wants to implement autoscaling for model inference workloads on Kubernetes. The traffic pattern is highly bursty — 10x spikes for 15-minute windows during business hours. Standard HPA based on CPU is too slow to respond. Design a custom autoscaling solution.
Your team wants to track not just raw data but also intermediate pipeline outputs (featurized data, train/test splits) in DVC to make pipeline stages independently reproducible. The intermediate files are 50-200GB each. Design the DVC pipeline configuration.
You deployed a new version of a recommendation model as a canary at 5% traffic. After 2 hours you notice the canary has 0.3% higher click-through rate but also 15% higher inference latency. How do you decide whether to continue rollout, pause, or roll back?
You are onboarding a new ML use case to an existing Feast feature store. The new model for predicting delivery ETA needs real-time driver location features updated every 10 seconds and historical trip features updated nightly. How do you design the feature view configuration?
Your team has an online feature store backed by Redis running 24/7 for a model that serves 200 RPS during business hours but drops to 5 RPS overnight. The Redis cluster costs $3,000/month. Design a cost-optimized architecture that maintains the serving SLA.
Your team needs to serve a large language model (13B parameters) for document summarization at under 2 seconds p95 latency. You have a budget for 4 A100 GPUs. The model currently runs at 8 seconds p95 on a single A100. Walk through your optimization and serving architecture.
Your team is introducing a new requirement: every model deployed to production must have a model card committed to the repository. You need to enforce this automatically without adding manual steps for the data scientists. How do you implement this?
After a production model starts generating garbage predictions silently for 48 hours before anyone notices, your team is mandated to implement alerting. You have Prometheus and Grafana already deployed. Design three distinct alert rules that would have caught this incident.
Two teams are using the same MLflow Model Registry. Team A has a model named 'UserSegmentation' in production. Team B wants to register a completely different model with the same name for a different use case. How do you resolve the naming conflict and establish a naming governance policy?
You are profiling a PyTorch inference pipeline and find that GPU utilization is 85% but p99 latency is still above SLA. The GPU is not the bottleneck. What are the top three places to look and how do you instrument each one?
Your company is preparing for SOC 2 Type II audit. The auditors want evidence that your ML models were trained on authorized data only, that training jobs are isolated from each other, and that any given model version can be reproduced within 30 days of its training date. How do you design the evidence trail?
You are asked to build a real-time ML pipeline that processes streaming data from Kafka, runs feature computation, and updates a feature store every 30 seconds. Airflow is your current orchestration tool. A colleague suggests using Airflow for this too. Evaluate the approach and recommend an architecture.
Your team deployed a model six months ago that used to perform at 0.89 AUC. The model has not been retrained since. Business metrics have degraded but the model's confidence scores look normal. The drift detection system shows no feature drift. Explain what could be happening and what you would investigate.
Your team stores 50TB of ML artifacts (model checkpoints, training datasets, feature exports) in S3. The monthly storage bill is $1,200. An audit reveals that 60% of stored objects are over 2 years old and were never accessed again after initial training. Design a cost reduction plan.
A data scientist is frustrated that their PyTorch training job crashes with OOMKilled after 45 minutes, but GPU memory metrics from nvidia-smi show only 60% utilization at the time of the crash. Diagnose the issue.
Your e-commerce recommendation model running on TensorFlow Serving starts returning P99 latency of 800ms instead of the usual 120ms after a routine Kubernetes node pool upgrade. No code changed. How do you triage and resolve this within the SLA window?
A data science team wants to deploy a PyTorch model for real-time fraud detection at a fintech startup. They have no existing serving infrastructure. You need to have something production-ready in two weeks. What stack do you pick and why?
Your Kubeflow Pipeline for a weekly churn-prediction model retrain is silently producing stale models — it completes successfully but the output model artifact is identical to last week's. No errors appear in the UI. How do you diagnose and fix the root cause?
You are migrating a batch ML scoring pipeline from a weekly cron script to Apache Airflow. The script reads from Snowflake, runs a scikit-learn model, and writes predictions to PostgreSQL. A data engineer says Airflow is overkill. How do you make the case and structure the DAG?
Your company uses Feast for online feature serving. During a production incident, engineers discover that the fraud model is reading feature values that are 47 minutes stale due to a materialization job backlog. Transactions worth $2M were misscored. Walk through how you prevent this class of incident.
A startup is building its first feature store using Feast with Redis online store and BigQuery offline store. A data scientist asks why they need both — why not just query BigQuery for everything including online serving? How do you explain the architectural trade-off?
Six weeks after deploying a credit scoring model at a lending platform, the approval rate drops 18% with no model change. The business team is escalating. You suspect covariate shift. Walk through your investigation and the decision tree for remediation.
You are implementing model monitoring for a demand-forecasting model at a grocery chain. The model has 40 input features and predicts 14-day demand. Your manager asks you to pick a single metric to monitor. What do you choose and how do you set the alert threshold?
A data scientist on your team has been running experiments for three months and storing results in a shared Jupyter notebook. You are asked to migrate to MLflow. The scientist is resistant, saying MLflow adds overhead. How do you make the migration smooth and prove the value?
Your organization runs MLflow at scale with 50 data scientists, 200 active experiments, and a PostgreSQL backend. Experiment queries have become slow — simple run comparisons take 45 seconds. You need to fix this without migrating off MLflow. What do you do?
Your ML team's GitHub Actions pipeline takes 55 minutes per merge to main, mostly in model training and evaluation steps. Engineers are merging PRs once a day instead of continuously. How do you redesign the CI pipeline to get merge-to-deploy under 15 minutes?
You join a team where data scientists push model code directly to the main branch and manually deploy by copying files to a server. You need to introduce a CI/CD pipeline without disrupting their workflow. What is your 30-day plan?
Your company has three teams using the same MLflow Model Registry with 200+ registered models. A production incident reveals that a model in the 'Production' stage was silently overwritten by a different team's model with the same name. How do you redesign the registry governance?
A data scientist trained a model three months ago, registered it in MLflow, and it has been in production since. They now want to retrain with new data but are worried about losing the old model. Explain model versioning semantics and the safe workflow.
Your ML team uses DVC with an S3 remote for all training datasets. During a model debugging session, you discover that two engineers ran the same DVC-tracked pipeline on what they believed was identical data, but got different model performance. The DVC hashes match. What happened?
You inherit a project where training data is stored as 50GB CSV files on a shared NFS drive with no versioning. Multiple team members modify the files in place. You are tasked with introducing DVC without breaking the current workflow. How do you proceed?
You are running 30 inference pods on GKE. During a load test, you observe that five pods are receiving 80% of the traffic while 25 pods are nearly idle. The Kubernetes service is using the default round-robin load balancing. Why is this happening and how do you fix it?
A data scientist wants to run a GPU training job on your Kubernetes cluster. They submit a pod spec requesting 4 GPUs, but the pod stays in 'Pending' state indefinitely. The cluster has GPU nodes with available capacity. How do you debug this?
Your BERT-based document classification model running on an A100 GPU has GPU utilization at 12% and throughput of 45 requests/second. The product team needs 400 requests/second from the same hardware. Walk through your optimization approach.
You are deploying a fine-tuned LLaMA-7B model for an internal documentation Q&A tool at a mid-size tech company. Budget is limited — you have one A10 GPU. A developer asks why you are not using the full FP32 model. Explain your quantization choices.
You are rolling out a new recommendation model at a streaming service with 50M daily active users. The new model has better offline AUC but the product team is nervous about click-through rate in production. Design a canary strategy that gives statistical confidence while limiting blast radius.
A team wants to shadow-deploy a new NLP model for customer support ticket routing. The current model runs in production. Explain how you implement shadow mode technically and what pitfalls to watch for.
A regulatory audit at your financial services firm requires you to reproduce the exact predictions made by a credit model 18 months ago on a specific batch of loan applications. The model was trained in 2023 and has since been retrained. How do you approach this?
Two data scientists ran the same experiment code on the same data with the same hyperparameters but got different test AUC results (0.847 vs 0.851). They are arguing about whose result is correct. How do you diagnose the non-determinism?
Your ML serving platform runs 15 models across three business units. The SRE team is asking for a unified observability solution that covers both infrastructure metrics and model-quality metrics in one place. How do you design this and what are the key instrumentation points?
You are responsible for a churn prediction model that scores 500K customers nightly. The data science team says the model 'feels wrong' based on customer feedback but the batch job completes successfully with no errors. What observability do you add?
Your organization's AWS ML infrastructure bill hit $280K last month. The CFO is asking for a 40% reduction within 90 days without reducing model coverage or SLAs. Walk through your cost audit and reduction strategy.
Your team runs four separate Kubernetes clusters for ML workloads (dev, staging, prod, and a dedicated training cluster). A cloud architect suggests merging them into two clusters to cut costs. What are the trade-offs and how do you decide?
Your team is moving from a monolithic FastAPI inference service to a microservices architecture where each model is a separate service. After the migration, end-to-end prediction latency jumps from 95ms to 340ms. No individual model service got slower. Diagnose and resolve.
You are designing an Airflow-based ML pipeline for a healthcare AI company. The pipeline must comply with HIPAA: no PHI in logs, all data encrypted at rest and in transit, and an audit trail of who accessed what data and when. How do you architect the pipeline?
Your team is debating whether to build a custom feature store or adopt Tecton for a real-time ML platform. You have 20 models in production, a Kafka streaming infrastructure, and a Spark batch ETL pipeline. Make the build-vs-buy recommendation with concrete criteria.
A natural language processing model for intent classification at a telecom call center starts degrading silently after a product rebrand. The model accuracy drops from 0.91 to 0.73 over six weeks without any alert firing. How do you implement retroactive monitoring and prevent recurrence?
You are designing an experiment tracking system for a team of 15 data scientists who run 50-100 experiments per day. After six months, the MLflow artifact store on S3 has grown to 40TB and costs $1,200/month. How do you reduce costs without losing reproducibility?
Your model retraining pipeline is fully automated and triggers on new data. Last Tuesday, a corrupted data batch caused the pipeline to retrain and auto-deploy a model with 0.61 AUC (down from 0.87). The bad model served production traffic for 11 hours before anyone noticed. Design the safeguards.
Your team has model versions stored in MLflow Registry, Docker images in ECR, and Helm chart values in a Git repo. A junior engineer asks how to roll back the production model to the previous version. Walk through the full rollback procedure.
Your data science team uses DVC but an external data vendor delivers weekly CSV updates by overwriting the same S3 path. Every week DVC reports the data as changed even when the content is identical. The team wastes GPU time retraining on unchanged data. How do you fix this?
Your ML platform team needs to support both GPU training jobs (bursty, can wait) and real-time inference pods (constant load, latency-sensitive). Both run on the same GKE cluster. How do you architect node pools, scheduling, and resource quotas to prevent them from interfering?
A vision model for medical image analysis has strict latency requirements of under 200ms per scan. Running on a T4 GPU, batch size 1, it takes 380ms. You cannot change the model architecture. What optimization techniques do you apply and in what order?
You are deploying a new pricing model at a retail company. Business stakeholders insist on zero risk — they want to test the model before any customer sees a new price. How do you structure the validation without a live canary?
A data scientist reports that a model they trained six months ago cannot be reproduced. The original training script exists in Git but the model metrics are different when rerun. Describe your systematic checklist for tracking down the irreproducibility.
Your company runs 25 ML models in production. After a major Kubernetes version upgrade, three models started producing incorrect predictions but no alerts fired. The issue was traced to a subtle preprocessing bug introduced by a library transitive dependency update. How do you design observability to catch this class of issue?
Your team runs a weekly batch scoring pipeline on a cluster of 50 large EC2 instances. The pipeline runs 6 hours every Sunday morning and the instances sit idle the other 166 hours per week. Your manager asks for a cost-optimized solution.
A machine learning model for document processing is deployed behind a REST API. Under load testing, it handles 200 requests per second fine, but at 250 requests per second the error rate jumps to 40% with HTTP 503 responses. How do you diagnose and fix this capacity issue?
Your Airflow DAG for ML preprocessing is failing intermittently with OOM errors on a 16GB RAM node. The task reads a 4GB Parquet file and performs feature engineering. Memory usage should be under 12GB but spikes to 20GB+ during a join operation. How do you debug and fix this?
A junior data scientist on your team discovers they can get slightly better model performance by computing slightly different versions of the same feature in their model compared to the feature store definition. They propose maintaining two versions of the feature. What is the risk and how do you handle this?
You join a company with a loan default prediction model that has been in production for two years with no monitoring. The business team reports model accuracy has been 'declining for a while.' How do you quickly assess the current state without access to real-time monitoring infrastructure?
You notice that data scientists on your team are logging 150+ metrics per training run in MLflow, including per-class precision and recall for a 100-class classification model. The MLflow UI becomes unusable for comparison. How do you improve the experiment tracking hygiene?
Your team's GitHub Actions workflow for model training and testing uses a 40GB Docker image that takes 18 minutes to pull before every CI job. How do you reduce the Docker image overhead in CI?
You are building an ML platform for a large enterprise with five separate business units, each with their own data governance and model ownership requirements. Design a multi-tenant model registry architecture.
Your team's DVC pipeline takes 4 hours to run end-to-end: 30 minutes data preprocessing, 3 hours training, and 30 minutes evaluation. A data scientist just changed one line in the evaluation script. How does DVC's caching help, and what are its limits?
You need to run a distributed PyTorch training job across 4 nodes with 8 GPUs each. You choose Kubernetes with the PyTorch Operator (kubeflow/training-operator). Walk through the configuration and the common failure modes.
A team deploys two separate model endpoints on the same A100 GPU server — one for sentiment analysis (low traffic) and one for NER (high traffic). GPU utilization for sentiment analysis is 3% and NER is 65%. A product manager asks if you can add a third model. Explain MIG and how it helps.
You are running a canary at 5% traffic for a new recommendation model. After 48 hours, the canary model shows CTR +2.3% (positive) but average session length -1.8% (negative). The A/B testing dashboard shows both are statistically significant. How do you make the promotion decision?
Your team wants to publish a research paper based on results from your production ML system. A reviewer asks for a complete reproducibility package. What does this include and how do you generate it from your existing MLOps infrastructure?
You are setting up logging for a real-time fraud detection ML service running on Kubernetes. The security team requires that no personally identifiable information appears in logs, but the ML team needs enough context to debug model failures. How do you balance these requirements?
Your team has 15 ML experiments running simultaneously on a shared GPU cluster. You notice that three experiments have been running for 96 hours with no progress — the loss curve flatlined at hour 8. The cluster costs $12/hour. How do you address this operationally and prevent it going forward?
You are deploying a multi-modal model (vision + text) that requires two different preprocessing pipelines running on different hardware (GPU for image preprocessing, CPU for text tokenization). The product team wants a single unified API endpoint. How do you architect the serving layer?
Your Airflow ML pipeline triggers a retraining job whenever new data arrives in S3. Last week, 47 small files arrived within 5 minutes, triggering 47 separate retraining runs that overwhelmed the training cluster. How do you redesign the trigger logic?
A global e-commerce company runs a recommendation model on TensorFlow Serving behind an NGINX load balancer. Black Friday traffic spikes 40x and p99 latency jumps from 18ms to 620ms. The model is 2.1GB. Walk through your triage and resolution approach.
Your team uses FastAPI to serve a scikit-learn churn model. The data science team wants to deploy a new XGBoost version. A junior engineer suggests just overwriting the pickle file on the server and restarting the container. What is wrong with this approach and how do you fix it?
A fintech company's Kubeflow Pipeline for credit-risk model retraining runs nightly and occasionally fails silently — it completes with a green status but the deployed model has lower AUC than the previous version. The team only discovers this after complaints from the risk desk the next morning. Diagnose and fix.
Your Airflow DAG for a retail demand-forecasting pipeline has a task that runs a Python training script, but it keeps failing with memory errors on the Airflow worker node, which only has 8GB RAM. The model trains fine locally on a 32GB laptop. What is your approach?
A ride-hailing company uses Feast as a feature store. The online store (Redis) serves real-time predictions, but data scientists report that features used during serving differ from features used during training by a subtle timestamp skew, causing the model's offline metrics to be 4% better than online. How do you debug and resolve this?
Your team is building a fraud detection system. A data scientist wants to add a feature: number of transactions by a card in the last 5 minutes. She plans to compute this in the inference service at request time with a database query. Why is this problematic, and what is the MLOps-correct architecture?
A healthcare imaging startup deploys a radiology triage model. Six months after launch, sensitivity drops from 94% to 87% with no model change. The engineering team suspects data drift. You have prediction logs but no access to ground truth labels in real time (radiologists review asynchronously). How do you monitor for drift without labels?
You inherit a production NLP sentiment model serving customer support tickets. The team has never set up monitoring. You have three weeks to implement a monitoring system before a major product launch doubles ticket volume. What do you build and in what order?
A team of 12 data scientists all use the same MLflow tracking server. After six months, the server has 50,000 runs, and queries are timing out. The artifact store (S3) has 4TB of logged model artifacts including many redundant checkpoints. How do you address performance and storage hygiene?
A data scientist on your team says she cannot reproduce a production model because her MLflow experiment was accidentally deleted. You have the model pickle in S3 and the training code in git, but the hyperparameters and dataset version used are lost. How do you prevent this in the future?
A logistics company has a route-optimization model that takes 6 hours to train on a 32-GPU cluster. The CI pipeline runs full training on every pull request, making the cycle time unworkable and costing $2,400 per PR. Design a tiered CI strategy that keeps quality gates without this cost.
Your team's GitHub Actions workflow for a text classification model fails intermittently with OOM errors during the testing phase. The test matrix includes a small unit test and an integration test that loads the full 1.2GB BERT model. How do you restructure this?
A media streaming company manages 40+ models across recommendation, ads, and content moderation in a shared MLflow Model Registry. The data science and platform teams disagree on promotion criteria — DS wants to promote any model with better offline metrics; platform wants a mandatory 7-day canary. How do you design a governance framework?
You are setting up an MLflow Model Registry for a team transitioning from an ad-hoc deployment process where model files were copied to servers manually. Three models are already in production with no tracking. How do you safely backfill these models into the registry?
A computer vision team uses DVC with an S3 remote for dataset versioning. The dataset is 800GB of labeled images. After a contractor incorrectly relabeled 15,000 images, the team needs to roll back the dataset to a specific git commit from three weeks ago. Walk through the exact steps and failure modes to watch for.
Your team's DVC pipeline for a tabular ML model has five stages: ingest, clean, featurize, train, and evaluate. A data engineer modified the cleaning step, but `dvc repro` re-runs all five stages instead of just the affected ones. Why, and how do you fix the pipeline definition?
A data science platform team runs training jobs as Kubernetes Jobs with GPU node pools on GKE. Engineers report that jobs frequently queue for 20-30 minutes waiting for a GPU node to provision, even though the cluster autoscaler is enabled. Diagnose and optimize the scheduling latency.
Your team deploys ML inference services as Kubernetes Deployments. A new model requires 12GB GPU memory, but all nodes have 16GB GPUs. After deploying 2 replicas, a third replica fails to schedule with 'Insufficient nvidia.com/gpu'. The cluster has 4 GPU nodes. Explain the failure and how you resolve it.
A real-time bidding platform runs a CTR prediction model on GPU. The model is a 3-layer MLP with 128-dim embeddings for 200 categorical features. Inference p99 latency must be under 5ms. Currently p99 is 14ms. The model is served via TorchServe on V100 GPUs with a batch size of 1. Walk through your optimization approach.
Your team wants to deploy a 7B-parameter LLM for internal document summarization. The model requires 14GB in FP16. You have access to two A10G GPUs (24GB each). A data scientist suggests just loading the model on one GPU and leaving the other idle. What are the trade-offs, and what is a better configuration?
A food delivery platform is deploying a new delivery-time estimation model. The previous canary rollout caused a bad user experience because the canary model predicted significantly longer times for certain cuisine types, leading to order cancellations. How do you design a safer canary strategy that catches cuisine-specific regressions early?
You are implementing a shadow deployment for a new credit-scoring model alongside the existing production model. Both models receive the same loan application data. A colleague suggests logging both predictions to the same database table to simplify analysis. What are the problems with this, and how do you structure the logging?
A regulatory audit requires your team to reproduce the exact predictions made by a fraud detection model deployed 18 months ago for a specific set of transactions. The model has since been replaced twice. How do you retrieve and run the historical model, and what gaps might make exact reproduction impossible?
A data scientist on your team reports that rerunning the same training script with the same dataset produces models with slightly different accuracy each time. The team needs reproducible training for compliance. What are all the sources of non-determinism you need to address?
A senior platform engineer at a telecom company notices that the ML serving cluster has high CPU utilization but low GPU utilization at the same time. The inference service serves a transformer model. This seems contradictory. What is happening and how do you diagnose it?
Your ML inference service has no structured logging. When a prediction goes wrong, the oncall engineer has to grep through raw application logs to find the input features, model version, and output for a specific request ID. Design an observability schema for ML inference that makes debugging tractable.
A startup's ML infrastructure bill is $85,000/month, primarily from always-on GPU inference servers serving a recommendation model that sees 95% of its traffic between 8AM and 11PM. The CTO wants to cut costs by 40% without degrading user experience. Design a cost reduction architecture.
Your team trains a large NLP model weekly using a 16-GPU cluster on AWS EC2. The training job takes 11 hours. An engineer suggests switching to spot instances to cut costs but is worried about job failure. How do you architect a fault-tolerant training setup that uses spot instances safely?
A trading platform serves a market-risk scoring model that must deliver predictions within 2ms at the 99.9th percentile. The model is a gradient boosted tree (XGBoost) with 500 trees and 8 input features. Current p99.9 is 11ms on CPU. GPU is not used because the model is too small. What are your optimization levers?
A machine learning platform team at a bank needs to run the same training pipeline for 200 different customer-segment models, each with a slightly different dataset and hyperparameter configuration. Running them sequentially takes 3 days. How do you parallelize this using Kubeflow Pipelines?
A retail company's data science team wants to compute a feature: customer's purchase frequency in the last 30 days, segmented by product category. The feature needs to be available for both online inference (latency < 50ms) and offline training (historical data going back 3 years). Describe the full feature store architecture for this feature.
An insurance company deploys a property valuation model. After a regional earthquake, the model begins underestimating claim values by 35% for affected properties. The monitoring system did not alert. Conduct a post-mortem on the monitoring gap and redesign the monitoring strategy.
A data scientist asks you to help them compare two models trained with different preprocessing pipelines on the same dataset. One used standard scaling, the other used quantile normalization. Both have similar MLflow run IDs but their preprocessing code is not logged. How do you retroactively associate code versions with MLflow runs?
A data science team at a healthcare company wants to implement automated model retraining triggered by data drift. The compliance team requires every model update to have a human sign-off before going to production. Design a CI/CD workflow that satisfies both requirements.
Your company is consolidating three separately managed model registries (one each from data science, ML engineering, and the AI product team) into a single enterprise registry. Each team uses different metadata schemas, different promotion criteria, and different naming conventions. How do you execute this migration?
A data science team is starting a new NLP project with a dataset that is updated weekly by a data engineering team. They ask whether to version the dataset with DVC or just pull from S3 with a fixed prefix. Make the case for DVC and describe how the handoff workflow should work.
A platform team is deploying an online learning model that updates weights every 30 minutes based on incoming data. The model state is stored in a Kubernetes Pod's memory. A pod restart (from OOM, node eviction, or deployment) causes the model to lose recent learning and reset to the last checkpoint, which is 2 hours stale. Redesign the state management.
A research team at a pharmaceutical company wants to run molecular docking simulations using a graph neural network on 10,000 molecules per batch overnight. Currently each molecule takes 4 seconds on a single A100 GPU. Total time: 11 hours. They need to scale to 1 million molecules within the same overnight window. Redesign the computation.
You are deploying a new version of a content moderation model on a social media platform. The previous model had a false-positive rate of 3% (incorrectly flagging legitimate content). The new model reduces FPR to 1.5% but the recall drops from 96% to 93% on the evaluation set. How do you design the canary to validate the trade-off in production?
A data science team at a bank trains a credit model that passed model validation 8 months ago. The risk committee now requires the model to be retrained from scratch to verify that original results can be reproduced before they extend the model's use to a new product line. The original training environment was a single analyst's laptop. What is your worst-case assessment and recovery plan?
A platform team builds an MLOps platform used by 50 data scientists. The platform team has excellent infrastructure observability (pod CPU/memory, latency, error rates) but the data scientists complain they have no visibility into their model's business performance. How do you add ML-specific observability without giving data scientists access to Kubernetes or Prometheus internals?
A recommendation engine at a streaming service runs 15 separate ML models, each served by its own dedicated Kubernetes Deployment with GPU pods. Average GPU utilization across all 15 is 8%. The total GPU bill is $120,000/month. Design a model consolidation architecture.
You deploy a new version of an image classification model as a Kubernetes rolling update. During the rollout, users report inconsistent results — the same image sometimes returns different labels. The rollout is 50% complete. What is happening, and what are the immediate and long-term fixes?
Your team's Airflow-managed training pipeline starts failing every Monday morning. The failure is intermittent — sometimes it completes on the second attempt. Looking at the logs, you see database connection timeout errors at the data ingestion step. The pipeline runs on a weekly schedule. Diagnose the most likely causes.
A financial services firm wants to shadow-test a new credit risk model for 90 days before any production traffic is moved. The shadow model receives all inputs but its outputs are never used for decisions. After 45 days, the compliance team wants to see a statistical comparison report. Design the data collection and analysis framework.
Two data scientists on your team train the same model architecture with the same hyperparameters and claim to use the same dataset, but consistently get different evaluation metrics. You suspect the issue is in how the training/test split is performed. How do you investigate and fix this?
Your ML inference service returns HTTP 200 for every request, including ones where the model fails to compute a prediction and returns a default value. The ops team has zero visibility into how often this fallback is triggered. Describe the observability changes you would make.
A startup is spending $12,000/month on MLflow tracking server infrastructure and artifact storage. The tracking server runs on a dedicated EC2 r5.4xlarge and S3 stores 8TB of model artifacts and experiment data. A founder asks if this is reasonable for a 5-person data science team. Diagnose and right-size.
A data scientist asks you to set up alerting for their house price prediction model. They want an alert when 'the model starts performing badly'. How do you translate this vague requirement into concrete monitoring specifications?
A data science team wants to use GitHub Actions to automate testing of their scikit-learn pipeline. Currently, all tests run on a GitHub-hosted runner that has 2 vCPU and 7GB RAM. The full test suite including a training step takes 45 minutes and regularly times out. How do you restructure?
Your team is designing a feature store for a new loan approval model. A compliance officer says all feature transformations must be auditable and reproducible for any historical loan decision. How does this compliance requirement change your feature store design?
A data science team submits ad-hoc GPU training jobs as Kubernetes Jobs using kubectl. Platform engineers notice that a single user submitted 20 concurrent GPU jobs, starving other teams. How do you implement fair resource sharing without blocking legitimate batch workflows?
A real-time bidding platform's CTR prediction model starts degrading in performance every Friday afternoon without any model or infrastructure changes. Engineers suspect concept drift related to weekend user behavior. How do you confirm this hypothesis and build a monitoring system that automatically adapts to this weekly pattern?
Your company is migrating from MLflow 1.x to MLflow 2.x and also switching artifact storage from local NFS to S3. There are 200,000 existing runs with artifacts on NFS. The migration must have zero downtime and allow continued new experiment logging during the migration. Design the migration plan.
A platform team is designing a multi-tenant ML platform on Kubernetes for 10 internal teams. Each team needs isolated training environments, shared inference infrastructure, and the ability to use GPUs. Design the namespace and RBAC architecture with security isolation without creating excessive operational overhead.
Your team wants to use DVC to version a streaming dataset that grows by 50GB per day. You cannot store the full historical dataset in S3 because it grows by 18TB per year and the cost would be prohibitive. How do you version an ever-growing streaming dataset with DVC without storing every historical version?
A platform team discovers that data scientists are manually promoting models to production by directly editing the model registry stage via the MLflow UI, bypassing all CI/CD gates. This has caused two production incidents. How do you enforce that only the automated pipeline can promote models to production?
A retail recommendation model's click-through rate drops sharply over two weeks during the holiday shopping season. The data science team argues this is expected behavior and not a model problem. The platform team wants to roll back the model. How do you adjudicate this disagreement and establish a process for future seasonal events?
Your e-commerce recommendation model deployed on TorchServe handles 12K RPS during flash sales, but p99 latency spikes from 40ms to 380ms under peak load. The model itself is a 200M-parameter transformer. Walk through your diagnosis and remediation plan.
You're asked to deploy a scikit-learn fraud detection model as a REST API. Your team currently uses FastAPI, Docker, and Kubernetes. The model file is 2.4GB. What deployment strategy do you choose and why?
Your Kubeflow Pipeline for a weekly churn model retraining runs for 6 hours but starts failing at the feature engineering step after 3 hours due to a Kubernetes pod OOM kill. The step processes 500GB of raw events. How do you fix this without simply giving the pod more memory?
An Airflow DAG for your daily model retraining pipeline intermittently fails with a 'BrokenPipeError' when writing Parquet files to GCS mid-task. The task has been running for 45 minutes before the error. What is the likely cause and how do you make the pipeline resilient?
Your team uses Feast with a Redis online store. During a post-deployment review, you discover the fraud model is reading stale features in production—sometimes 8 hours old—while your batch materialization job runs hourly. Explain the root cause and redesign the feature freshness architecture.
Your data science team wants to reuse features computed for model A in model B, but the two teams store features in separate Pandas DataFrames and there's no shared catalog. How do you introduce a feature store with minimal disruption to existing workflows?
A retail demand forecasting model's MAE degrades from 12% to 31% over six weeks following a product catalog expansion from 10K to 45K SKUs. No data drift alert fired. Why might your drift detection have missed this, and how do you redesign the monitoring?
Your production NLP sentiment classifier's prediction distribution shifts from 60% positive / 40% negative to 45% / 55% over two weeks following a new product launch. Your monitoring flags this as concept drift. How do you determine if this is real drift or a legitimate change in customer sentiment?
Your ML team runs 50-100 experiments per week across 8 data scientists, all tracked in a shared MLflow server backed by PostgreSQL and S3. Query performance is degrading—experiment comparison queries now take 45 seconds. How do you diagnose and fix the performance problem?
A data scientist on your team says 'I ran the same training script twice with the same MLflow run ID and now I can't tell which metrics belong to which run.' What went wrong and how do you prevent this class of problem?
You're designing a CI/CD pipeline for a credit scoring model that must meet regulatory requirements: every model in production must have an auditable lineage from raw data to deployed artifact, and any model update must be reviewed and signed off by a compliance officer before deployment. How do you build this into your pipeline?
Your team's ML CI pipeline takes 2.5 hours to run because it retrains the model from scratch on every pull request. Engineers are bypassing CI to merge faster. How do you redesign the pipeline to be fast enough that engineers actually use it?
Your organization has 15 models across three business units all registered in MLflow Model Registry. A senior data scientist wants to roll back a production fraud model to version 12 from version 15 after a performance regression, but the deployment pipeline only reads the 'Production' stage. Describe the safe rollback procedure and what process gaps this incident reveals.
A new team member accidentally registered a model trained on PII data to your shared MLflow Model Registry and pushed it to the 'Staging' stage. You need to remove all traces of this model before it gets promoted to production. What steps do you take?
Your team adopted DVC with an S3 remote six months ago. A data scientist reports that running 'dvc repro' produces different model metrics than the last tagged experiment, even though the code and dvc.lock file are unchanged. Diagnose this.
Your data team stores raw training data in an S3 bucket. You want to start using DVC to version this data, but the bucket already has 2TB of existing data in various folders with no versioning history. How do you onboard this data into DVC without disrupting existing workflows?
You run ML training jobs as Kubernetes Jobs on a GPU cluster. After a cluster autoscaler upgrade, GPU jobs are now stuck in Pending state for 20-30 minutes before nodes provision. Previously, provisioning took under 3 minutes. What do you investigate?
A team wants to run hyperparameter tuning using Optuna with 50 parallel trials, each requiring 1 GPU. Your Kubernetes cluster has 20 GPU nodes with 4 GPUs each (80 GPUs total), but other teams also use this cluster. How do you schedule this fairly without starving other workloads?
A computer vision model serving on A100 GPUs achieves only 35% GPU utilization during peak traffic (8K images/second). The model is a ResNet-50 for image classification. Your team lead asks why you're wasting 65% of a $3/hour GPU. How do you diagnose and improve utilization?
Your team deploys a BERT-base model for intent classification on NVIDIA T4 GPUs. The model handles 200 requests/second adequately, but latency spikes when request length varies widely (from 10 to 512 tokens). Explain why and how you mitigate this.
You're rolling out a new version of a revenue-critical product recommendation model using a canary deploy. After 2 days at 10% canary traffic, click-through rate (CTR) on canary is 3.1% vs. 2.8% on baseline—a seemingly positive signal. Your manager wants to accelerate to 50%. What concerns do you raise and what additional analysis do you require?
You want to deploy a shadow copy of your new demand forecasting model alongside production to validate its predictions before going live. The production model runs in a Kubernetes Deployment serving a gRPC API. How do you architect the shadow deployment without affecting production latency?
Six months after deploying a customer lifetime value (CLV) model, a regulatory audit requires you to reproduce the exact predictions made on a specific date. Your team used no experiment tracking at the time. What systems do you check and what is the honest assessment of whether full reproduction is possible?
Two data scientists independently run the same training script on the same dataset and get different model weights, despite both setting random seeds at the top of the script. How do you systematically find and fix all sources of non-determinism?
Your ML platform serves 40 models across 12 teams. The SRE team complains that ML model health is a black box—they can't tell from their existing Prometheus/Grafana stack whether a model is 'healthy.' Design a model observability framework that integrates with their existing tooling.
Your model serving logs are in CloudWatch, but your ML engineering team uses Python and finds CloudWatch queries painful for debugging model behavior. How do you build a lightweight debugging workflow without replacing CloudWatch?
Your ML training infrastructure on AWS costs $180K/month. A cost review shows 40% of that is idle compute between training runs. Your training jobs run 3-4 times per day, each lasting 2-4 hours, on p3.8xlarge instances. Design a cost reduction strategy targeting a 35% reduction without degrading training throughput.
Your team's Kubeflow Pipelines runs on a GKE cluster. The cluster's 20 CPU nodes stay provisioned 24/7 even though training jobs only run from 6 AM to 10 PM. How do you automate scaling to zero during off-hours and validate that it doesn't break anything?
You're deploying a large language model (7B parameters, LLaMA-based) for an internal enterprise search tool. The P99 latency requirement is under 2 seconds for a 500-token generation. You have 4 A10G GPUs in a Kubernetes cluster. How do you architect the serving infrastructure?
You need to build a nightly batch inference pipeline that scores 50 million customer records using a trained XGBoost model. The current approach uses a single Airflow task that loads all data into a Pandas DataFrame and runs predict(). This works for 5M records but runs out of memory at 50M. Redesign the pipeline.
Book a mock interview with a senior MLOps Engineer mentor — structured scorecard, replay, and a gap plan.