Data Analyst / BI Analyst Interview Questions
200 scenario-based questions with detailed model answers, organized skill-wise and tool-wise. Filter by topic, level or keyword, reveal the answer — then pressure-test yourself in a real mock.
Your e-commerce company's revenue report shows a 15% drop last Tuesday, but the engineering team says no deployments happened. You have access to raw transaction tables. Walk through your SQL-based investigation—what queries do you write first, and what are the most likely culprits?
A product manager asks you to calculate 7-day rolling retention for a mobile gaming app. The events table has user_id, event_name, and event_timestamp. She wants to see cohort retention by the week users first installed. How do you structure this query?
You're auditing a finance team's SQL report used for monthly revenue recognition. You discover the query uses a LEFT JOIN between invoices and payments without a WHERE clause on payment status, effectively double-counting partially paid invoices. The report has been used in board presentations for six months. How do you handle this?
A marketing analyst built a stacked bar chart showing channel attribution across 12 months, with 9 color-coded channels. The VP of Marketing says the chart is 'impossible to read.' What's wrong with this design, and how would you redesign it?
Your team is building a real-time operational dashboard for a logistics company showing package delivery status across 50 cities. The data refreshes every 5 minutes. A product manager wants animated transitions and a globe visualization. You have limited front-end resources. How do you push back and what do you build instead?
Your Tableau Server extract refreshes are failing silently every night for a key sales dashboard. By morning, executives are looking at two-day-old data but the dashboard shows no error state. Walk through how you diagnose and fix both the immediate failure and the alerting gap.
A colleague built a Power BI report with 15 visuals on one page, and it takes 45 seconds to load. Your manager asks you to optimize it. What's your diagnostic process, and what are the three most common causes of slow Power BI reports?
Your company is deciding between Tableau and Power BI for a new enterprise BI platform serving 300 analysts, half of whom are in finance (heavy Excel users) and half in product and engineering (Python/SQL users). The IT team prefers Azure. How do you frame this decision?
You're designing a weekly performance dashboard for a SaaS customer success team. The team lead wants to track 22 different metrics. How do you prioritize which metrics to show, and how do you structure the information hierarchy?
You've delivered a beautifully designed executive dashboard for a retail chain's COO. Three months later, adoption is near zero—the COO checks it once a week at most, and regional managers don't use it at all. What went wrong and how do you recover?
A business unit head wants a single 'company health score' shown prominently on the executive dashboard as one number. You think this is a bad idea. How do you handle this conversation, and what do you build instead?
Your fintech company's fraud detection model flags 0.3% of transactions. The compliance team says the false positive rate is unacceptably high and is causing customer complaints. Your data science colleague says the model is 99.7% accurate. Who's right, and how do you reconcile this?
You're analyzing weekly active users for a subscription app. The metric has been flat for three months at 50,000 WAU, but your CEO says growth is stalling. A junior analyst says 'it's fine, the numbers haven't changed.' What deeper analysis would you run?
An e-commerce team reports that average order value increased by $12 after a UX redesign. The sample size is 500,000 orders. The p-value is 0.001. A VP declares it a huge win and wants to roll out the redesign company-wide immediately. What do you check before agreeing?
A finance analyst shares an Excel model used for monthly P&L reporting. You discover formulas like =C14+C15+C16+C17 instead of =SUM(C14:C17), hardcoded values scattered in formula cells, and no documentation. You're asked to audit and improve it. Where do you start?
Your company's sales operations team runs a 15-tab Excel workbook with 200 VLOOKUP formulas connecting five data sources. Recalculation takes 3 minutes and crashes once a week. You've been asked to fix it without rebuilding it entirely in a BI tool. What do you do?
You're running a nightly dbt pipeline that transforms raw Salesforce CRM data into a marketing attribution model. One morning you find that 40,000 lead records silently disappeared from the output table overnight. No dbt test failed. How do you investigate and prevent this in the future?
You're joining two datasets: a user activity log (100 million rows, daily events) and a user profile table (5 million rows). Your Spark job is timing out. What are the likely causes and how do you fix them?
Your company is migrating from an on-premise Informatica ETL pipeline to dbt + Snowflake. The legacy pipeline has 300 transformation jobs, some of which have undocumented business logic buried in stored procedures. How do you approach the migration without breaking downstream reports?
A senior VP asks you for a report showing 'everything about our customer churn.' You have two weeks and a data warehouse full of relevant tables. How do you handle the initial scoping conversation, and what do you deliver?
You've delivered an analysis showing that a major product initiative your company spent $2M on over 6 months has had zero measurable impact on the target KPI. The product VP is about to present this initiative as a success to the board. How do you handle this?
Two business units are in conflict over which one 'owns' a customer who uses both products. Each unit's analyst has produced a different revenue attribution report, and the CFO has asked you, as the central data team lead, to resolve it. What's your process?
You've joined a B2B SaaS company where the sales team tracks 'revenue' as TCV (total contract value) at signing, but the finance team tracks it as recognized revenue. The marketing team tracks MQL volume. At the weekly leadership meeting, everyone quotes different numbers and the CEO is frustrated. How do you fix this?
Your product team wants to use 'number of logins' as the primary engagement metric for a B2B project management tool. You think this is a bad choice. How do you make your case and what do you propose instead?
You're defining the North Star Metric for a two-sided marketplace connecting freelance designers with startups. The head of growth wants DAU; the head of revenue wants GMV; the head of product wants 'matches made.' How do you facilitate this decision?
You're running an A/B test on a checkout flow redesign for an online retailer. After 2 weeks, treatment shows +8% conversion rate, p=0.04. Your VP of Product wants to ship immediately. What do you check before agreeing, and what would make you refuse to call it a win?
Your email marketing team ran an A/B test on subject lines. Version A got 22% open rate, Version B got 24% open rate, with 5,000 emails per variant. They declare B the winner and plan to use it for the next 200,000-person send. What concerns do you raise?
Your growth team runs 50 A/B tests per quarter. The experiment lead reports an 'experiment win rate of 70%.' Your data science VP suspects p-hacking. How do you audit the experimentation program?
You've completed a six-week analysis showing that the company's most profitable customer segment is shrinking while the company is growing revenue overall. The CEO needs to present this to investors next month. How do you structure the narrative?
You're presenting monthly marketing attribution results to a team of brand managers who are not data-literate. Your analysis shows that performance marketing is under-credited in last-touch models and brand campaigns are over-credited. How do you communicate this without the room going defensive?
You've been asked to present a complex cohort retention analysis to a board of directors who have 8 minutes for your section. The full analysis is 40 slides. How do you prepare your 8-minute version, and what do you leave out?
You have a Pandas DataFrame with 10 million rows of clickstream data. Filtering and aggregation operations are taking 4 minutes each. Your colleague says 'just use Spark.' Before rewriting everything, what Pandas-native optimizations do you try?
You're building an automated weekly report pipeline in Python that pulls data from a Postgres database, runs a cohort analysis, and emails a formatted Excel report to 20 stakeholders. It's been running in production for 3 months. What can go wrong, and how do you build it to be resilient?
A data analyst on your team wrote a 600-line Python script that generates the company's most-watched sales report. It has no tests, uses global variables, and mixes data fetching, transformation, and rendering in one function. You're asked to refactor it. How do you approach this without breaking production?
After a major data migration to a new warehouse, your business users start reporting that revenue figures in dashboards don't match the ERP system. The discrepancy is about 3% and appears only in certain regions. How do you systematically investigate and resolve this?
Your daily user activity table suddenly has 40% NULL values in the device_type column starting two days ago. No one reported a change. How do you investigate?
You're implementing a data quality framework for a healthcare analytics team. The data includes patient records, claims data, and lab results, with regulatory requirements for completeness and accuracy. How do you design the DQ framework?
You're building the data model for a multi-tenant SaaS platform that bills customers on a usage-based model (per API call, per GB stored, per active seat). The billing team needs monthly invoices; the product team needs daily usage analytics; the finance team needs ARR forecasting. How do you design the core tables?
You're asked to design a data model to track sales rep performance at a company with complex hierarchical territories (regions → districts → teams → reps) and a product hierarchy (category → subcategory → SKU). The VP of Sales wants to slice quota attainment at every level. How do you structure this?
A media company has three content types (articles, videos, podcasts), each with different engagement metrics. The analytics team is struggling because separate tables exist per content type with incompatible schemas. You're asked to unify them. What are the trade-offs between different modeling approaches?
Your team's analytics database has a slow-running query that joins an orders table (50M rows) with a customers table (2M rows) on customer_id. The query takes 8 minutes in Postgres. No indexes exist. What do you do?
Your manager asks you to calculate the 'average time to purchase' for an e-commerce site. When you report the mean as 4.2 days, the product team says it doesn't match their intuition. What's going on, and what's the better metric to report?
You're asked to build a customer 360 table that joins CRM data, web analytics, support tickets, and billing data, all with slightly different customer identifiers. Email is in CRM, user_id is in web analytics, account_id is in billing, and ticket_email is in support. How do you approach the identity resolution problem?
A sales analyst built a Tableau dashboard that filters to show only closed-won deals. She now wants to add a conversion funnel showing all pipeline stages. Adding the new chart breaks all the existing filters. How do you help her fix this?
Your team is launching a public data transparency report for a healthcare nonprofit showing COVID outcome disparities across racial groups. You're the lead analyst. What are the visualization ethics considerations, and how do they change your design choices?
You're asked to build a single dashboard for both a data-savvy engineering VP and a less technical HR director. Both use the same workforce attrition data but have very different needs. How do you design for both audiences in one tool?
A growth team asks you to track 'virality' as a KPI. They define it as 'number of referrals generated per month.' You think this definition is insufficient. What's wrong with it, and what do you propose?
You want to test whether adding a progress bar to an onboarding flow improves completion rates. You only have 200 new signups per week. How do you design an experiment that gives you a reliable answer within 8 weeks?
You've been asked to present a complex multi-year trend analysis to a team of non-technical regional sales managers at a quarterly business review. The analysis shows three interacting variables: headcount growth, quota per rep, and attainment rate. How do you structure a 10-minute presentation?
You inherit a Jupyter notebook that generates a weekly churn report. Every cell takes forever to re-run because the raw data query returns 50 million rows and Pandas loads them all into memory. How do you fix this without a full rewrite?
You discover that 8% of rows in your transaction table have negative amounts. Your initial assumption is data corruption, but your colleague says it might be legitimate. How do you investigate before deciding what to do?
You're modeling subscription revenue for a SaaS company with monthly and annual plans, mid-cycle upgrades, pro-rated billing, and mid-cycle cancellations. The finance team wants MRR to be accurate to the dollar. How do you design the MRR table?
You're tasked with building a near-real-time anomaly detection system for a payment processing company that needs to flag unusual merchant transaction volumes within 5 minutes of occurrence. You have Kafka and a Postgres analytical database. How do you architect the detection logic?
Your data shows a strong negative correlation (r = -0.72) between customer support ticket volume and retention rate across your 50 customer accounts. Your VP says this proves that bad support causes churn. How do you challenge this interpretation?
You've delivered a report to a regional sales director who responds with 'these numbers don't look right' but can't tell you specifically what's wrong. How do you handle this feedback?
Your e-commerce company runs a Redshift cluster. A critical daily revenue report that used to finish in 4 minutes now takes 47 minutes. The query joins orders, order_items, products, and users across 200M rows. Walk through your complete investigation and remediation approach.
A product manager asks you to calculate 30-day rolling retention for a mobile game with 5M daily active users. Your events table has (user_id, event_date, event_type). Write the approach and describe the key SQL constructs you'd use.
A fintech startup's fraud team asks you to flag accounts where the total transaction amount in any rolling 7-day window exceeds $50,000. The transactions table has 800M rows. They need results in under 10 minutes on Snowflake. Design the solution.
Your head of growth reviews a line chart you built showing weekly signups. She says the chart is 'hard to read' but can't explain why. You look at it and see 12 overlapping colored lines for 12 acquisition channels. How do you redesign it?
You are designing a real-time operations dashboard for a logistics company that tracks 10,000 active shipments. The NOC team needs to identify delayed shipments within seconds of viewing. Describe your complete visualization design decisions, including chart types, color, and data update strategy.
Your Power BI report serving 400 users refreshes from a 50GB Azure Synapse table every 30 minutes, but refresh jobs are failing after 20 minutes with a gateway timeout. The dataset cannot be moved. What is your remediation plan?
A Tableau dashboard you published last week shows different totals than the source Excel file that a regional sales manager maintains manually. The difference is $240K. How do you systematically debug this discrepancy?
You are migrating 80 Tableau dashboards to Power BI for a bank. Halfway through, stakeholders discover that Power BI's DAX-calculated measures produce different numbers than Tableau's LOD expressions for the same KPIs. How do you manage this?
You are asked to build a single dashboard for both the CEO (weekly strategic review, 5 minutes of attention) and the growth team (daily deep-dives, 45-minute sessions). They insist on one URL. How do you design this?
A healthcare SaaS company's customer success team uses a Looker dashboard to monitor patient engagement scores. After a UI refresh, the team's CSAT dropped 18 points because they said the dashboard was 'confusing.' Diagnose and fix this.
You are designing a self-serve analytics portal for 500 non-technical business users at a retail chain. Usage data shows that 90% of users view only 3 of the 40 available reports. How do you redesign the portal?
You work at a ride-share company. The data science team claims that a new driver incentive program increased average earnings per trip by $1.20 based on a two-sample t-test. You are suspicious. What would you audit before accepting this result?
A B2B SaaS company tracks monthly active users but the metric doubled after engineering changed the definition of 'active' from 'logged in' to 'performed any API call.' The old CEO dashboard shows a 2x spike. How do you handle this?
Your company's NPS score dropped from 52 to 44 in Q2. The VP of Customer Success insists the product team broke something. You have response-level data with timestamps, respondent segments, and verbatim comments. How do you investigate the root cause?
A finance analyst sends you a 15,000-row Excel pricing model with nested IFs, VLOOKUP chains, and hardcoded values scattered across six sheets. She asks you to add a new discount tier without breaking anything. What is your process?
Your sales operations team uses a shared Excel file to track pipeline across 12 regional reps. Each week, two reps accidentally overwrite each other's data. Leadership refuses to move to a proper CRM. What do you do?
An airline's data engineering team runs an Airflow DAG that loads 12 million flight records daily from an SFTP file drop into Snowflake. Three times in the past month, the file arrived 2 hours late, causing downstream BI reports to show yesterday's data. How do you redesign the pipeline to handle this reliably?
You receive a CSV file from a third-party vendor containing customer address data that needs to be merged into your CRM. The CSV has 40,000 rows with inconsistent address formatting, duplicate emails, and mixed-case names. Walk through your cleaning process.
Your dbt project runs 280 models daily on BigQuery. A recent PR added 3 new models that caused the full run time to increase from 22 minutes to 68 minutes. How do you diagnose and fix the performance regression?
A regional VP emails you at 9 PM saying the Q3 sales dashboard shows his region underperforming by 22% and he wants to present it to the board tomorrow morning. You know the data has a known processing lag. How do you respond?
Two senior business partners — the head of marketing and the head of finance — are each claiming your data supports their conflicting positions on whether a campaign was profitable. They want you to adjudicate. How do you handle this politically charged situation?
The CEO asks you to present 'the key takeaway' from a 6-month customer churn analysis to the board in 3 minutes. You have 40 slides of detailed analysis. How do you prepare and structure those 3 minutes?
A subscription edtech startup asks you to define their North Star metric. They have course completions, revenue, DAU, and NPS. Walk through your reasoning for choosing one and what makes it a good North Star.
A marketplace platform defines GMV as its headline metric. The board is questioning whether GMV growth is real since cancellations and returns are growing equally fast. How do you redesign the metric framework?
Your company measures engineering team productivity using 'story points delivered per sprint.' The engineering VP says this metric is being gamed — teams inflate story point estimates. How do you help them design a better metric system?
You are running an A/B test on a checkout flow for a furniture e-commerce site with 20,000 monthly transactions. After 3 weeks, the test variant shows a 14% lift in conversion but p=0.06. The CPO wants to ship it. How do you respond?
You are asked to design an A/B test for a new email subject line to improve open rates. The current open rate is 22%. How do you determine sample size and duration, and what pitfalls do you flag to the team?
Your company runs 15 simultaneous A/B tests across the homepage. A post-hoc analysis finds 6 of them 'significant' at p<0.05. The growth team is celebrating. What is the problem and how do you fix the analysis?
You have just completed a 3-month analysis showing that the company's highest-revenue segment is actually its least profitable when factoring in support costs and churn. Senior leadership has always publicly championed this segment. How do you present this finding?
You are presenting a demand-forecasting model to a retail board. The model has an 8% MAPE which is technically excellent, but in absolute terms it was off by $4.2 million in last quarter's peak season. A board member says 'your model was wrong by $4.2 million.' How do you respond?
You have three months of data showing a clear 2-standard-deviation anomaly in returns for one product category. You believe it is an early signal of quality defect, but the ops team dismisses it as 'normal variation.' How do you build a compelling case?
You are loading a 4GB CSV of clickstream data into pandas on a 16GB RAM laptop. The read_csv call throws a MemoryError. Walk through three concrete approaches to handle this without moving to a different tool.
A data science team at an insurance company runs a weekly pandas pipeline that takes 6 hours on a single machine. The pipeline reads 80M policy records, computes rolling features, and outputs a model training dataset. How would you optimize it?
You need to merge two dataframes: transactions (5M rows, transaction_id, customer_id, amount) and customers (200K rows, customer_id, segment, region). After the merge you notice the transactions dataframe grew to 7M rows. What happened and how do you fix it?
You are the lead analyst for a media company. The ad revenue team discovers that their revenue reporting system has been double-counting a specific advertiser's spend for 8 months due to a pipeline bug. The error amounts to $3.2M in overstated revenue. How do you manage the remediation?
A marketing analyst tells you that the new user sign-up count in your dashboard is 30% higher than what Salesforce shows for the same period. Both systems say they are the source of truth. How do you resolve the discrepancy?
You join a healthcare analytics team and discover that the patient data warehouse has no documented data lineage, no freshness SLAs, and no quality checks. Leadership wants a data quality framework built in 60 days. What do you build and in what order?
A telecom company wants to model customer lifetime value for its 8 million prepaid subscribers. Prepaid subscribers have no contract, irregular top-ups, and high churn. How do you approach the LTV model differently than for postpaid contract customers?
A hotel chain asks you to model occupancy rate at the property level. You have 3 years of booking data but the chain has been expanding — 12 new hotels opened in the last year. How do you handle the cold-start problem for new properties?
A two-sided marketplace for freelancers defines 'a successful match' differently in its product team, finance team, and legal team. This causes three different match-rate metrics to diverge. You are asked to create a single canonical definition. How do you proceed?
A SaaS company wants to analyze funnel conversion: users go through Trial → Onboarding → Active → Churned. Some users skip steps. Write the approach to compute step-by-step and overall funnel conversion from an events table.
You build a model to predict which customers will churn in the next 30 days. The model's accuracy is 96%. The product manager is thrilled. You are not. Explain why and what you would report instead.
You are asked to design a mobile-first weekly business review dashboard for store managers at a grocery chain. Each manager oversees 1 store. The dashboard must load in under 3 seconds on a 4G connection. What design choices do you make?
A healthcare provider sends you patient appointment data in a flat CSV where multiple appointment dates for the same patient are stored as comma-separated values in a single field (e.g., '2024-01-05,2024-01-19,2024-02-03'). You need one row per appointment. Describe the transformation.
A business analyst on your team built a Power BI report with 14 slicers on a single page. Users say the report is slow and confusing. How do you refactor it?
You need to present a year-over-year revenue decline of 12% to a sales team that is already demoralized. How do you frame and structure the presentation to be honest, motivating, and actionable?
A marketplace for professional services uses average session duration as a proxy for user engagement. You argue this is the wrong metric. Make the case and propose two better alternatives.
A travel booking platform ran an experiment where the treatment group saw social proof ('47 people viewed this hotel today') on the booking page. The experiment ran for 4 weeks. Conversion rate improved 8% but revenue per booking dropped 4.3%. Overall revenue is unclear. How do you analyze and present this?
The head of engineering says your analysis proving a data pipeline failure caused a revenue impact is wrong because the code was correct. You have query-level evidence showing the impact. How do you handle this disagreement?
You are building a reusable Python library for your analytics team that standardizes how they compute cohort retention tables. Describe the API design, key implementation decisions, and how you would test it.
A financial services firm runs a monthly regulatory reporting process entirely in Excel, with 3 analysts manually copying data between 22 spreadsheets. Errors occurred twice in the past year, leading to regulatory resubmissions. You are asked to improve the process without replacing Excel. How?
You run a dbt test suite and notice that a 'not_null' test on orders.customer_id has been passing for 6 months. Today you discover that the ETL pipeline has been inserting a placeholder value of -1 for missing customer IDs rather than NULL. How do you respond?
You have built a rigorous churn prediction model. Legal raises concerns that the model's output could be used to discriminate against protected-class customers in service prioritization. How do you engage with legal and what changes do you make to the model or its use?
A food delivery startup wants to measure 'delivery reliability' as a KPI. Different teams define it differently. Walk through how you would create a single operational definition and implement it in data.
An energy company tracks power consumption per building every 15 minutes, producing 96 readings per building per day across 5,000 buildings. A downstream report needs the daily peak 30-minute demand (average of the two consecutive highest readings) per building. Design the query.
You are presenting user growth data for an investor update. MAU grew from 200K to 600K over 18 months. However, 70% of the growth came in the last 2 months due to a viral campaign. An investor asks whether growth is 'organic and sustainable.' How do you build the chart that honestly answers this?
Your e-commerce company's monthly revenue SQL query suddenly returns results 18% lower than the finance team's Excel model for the same period. Both claim to be correct. How do you resolve the discrepancy and prevent it from recurring?
You are asked to calculate 7-day rolling average daily active users in BigQuery for a mobile gaming app. The event table has 2 billion rows and your query times out. Walk through how you would optimize it.
A fintech startup asks you to build a cohort retention analysis showing 90-day retention by acquisition channel. The transactions table has 50 million rows with no user-level cohort table. How do you design and execute this from scratch in Snowflake?
A product manager insists on using a 3D pie chart to present feature adoption rates across 8 product features in a board presentation. How do you handle this request?
You are designing a real-time operations dashboard for a logistics company tracking 12,000 daily shipments. The operations team complained the previous dashboard caused 'alert fatigue' — too many red indicators for issues that resolved themselves. How do you redesign it?
Your bar chart comparing regional sales performance is being misread because one region serves enterprise accounts with deal sizes 10x larger than other regions. How do you visually communicate both volume and deal size simultaneously?
Your Power BI report connected to a 200 million row Azure Synapse Analytics table takes 45 seconds to load. The business wants sub-5-second load times. Describe your optimization strategy without losing analytical depth.
A Tableau dashboard you published last week suddenly shows data from two days ago instead of today. Your data source is a live connection to a PostgreSQL database. Walk through how you diagnose this.
Your company is migrating 40 Tableau dashboards to Power BI after acquiring a firm that standardized on Microsoft stack. You have 8 weeks. What is your migration strategy and how do you handle the inevitable LOD calculation rewrites?
The CEO asks for a single 'company health dashboard' that shows everything important. Marketing wants CAC and LTV. Finance wants ARR and burn rate. Product wants DAU/MAU. How do you scope and design this without creating an unreadable mess?
You are building a supply chain dashboard for warehouse managers who work 12-hour shifts and often access it on a 10-inch tablet. What specific design decisions do you make for this form factor and use case?
Six months after launch, your customer support dashboard has been abandoned by the team it was built for. Usage analytics show it peaked at 40 views on day 1 and now gets 3 views per week. How do you investigate and revive it?
Your recommendation engine team claims a 15% lift in click-through rate from their new model. When you audit the experiment, you find the holdout group received no recommendations at all, not a previous model. Why is this a problem and how do you design a proper evaluation?
A marketing team reports that email open rate jumped from 22% to 31% last month. Your manager asks if this is statistically significant or noise. How do you test this?
Your NPS score dropped from 52 to 44 between Q2 and Q3. The VP wants to know if this is a real signal or sampling variance. You ran surveys with 800 respondents in Q2 and 600 in Q3. Walk through your analysis.
You inherit a 15-tab Excel financial model from a departing analyst. You need to verify its accuracy before presenting results to the CFO next week. What is your audit approach?
The sales operations team manages a 50,000-row CRM export in Excel. It is slow, breaks formulas when filtered, and loses data when multiple people edit. You've been asked to 'fix it'. What do you propose?
Your dbt pipeline that transforms raw Stripe payment events into a revenue model starts failing silently — it completes without errors but produces revenue numbers 8% lower than expected. How do you diagnose this?
You receive a 2 GB CSV of customer transaction history from a vendor. It has 200,000 rows with inconsistent date formats (some MM/DD/YYYY, some YYYY-MM-DD, some 'Jan 15, 2024'). Walk through your cleaning approach.
You are building an Airflow pipeline to load daily sales data from an on-premise Oracle database into Snowflake. The Oracle team warns that queries during business hours slow down their OLTP system. How do you design around this constraint?
You present analysis showing that a key product feature's engagement has been declining 12% month-over-month for 5 consecutive months. The product VP dismisses your finding as 'seasonal' without providing any seasonality evidence. How do you handle this professionally?
A sales director asks you to 'prove' that their team's new training program increased deal close rates. Before looking at the data, what questions do you ask to scope this analysis properly?
Finance asks for a report by Friday. Product asks for a dashboard by Thursday. Your manager has given you a high-priority data quality investigation that is already two days in. How do you manage this conflict?
Your SaaS company's Customer Success team reports 94% retention. The CEO's board deck shows 87% retention for the same period. Both teams swear their number is correct. How do you resolve this and establish a single source of truth?
You are asked to create a single KPI to measure the health of a two-sided marketplace (buyers and sellers). What metric do you propose and why?
The growth team wants to measure 'activation' for a B2B project management SaaS. Three team members have three different definitions. How do you facilitate alignment and land on a measurable, predictive definition?
Your experimentation platform shows a pricing page A/B test reached 95% statistical significance after 3 days with a 12% lift in conversions. Your head of growth wants to ship immediately. What is your recommendation and why?
You run an A/B test on an email subject line and get a 28% open rate for variant A and 24% for variant B, but your sample size was only 300 per variant. How do you communicate this result to the marketing team?
You are analyzing an experiment where the control and treatment groups have significantly different pre-experiment conversion rates due to a randomization bug that assigned more high-intent users to treatment. How do you salvage the analysis?
You have 3 months of churn analysis that clearly shows customer support response time above 4 hours is the strongest predictor of churn. The support team lead disagrees and attributes churn to pricing. How do you present this finding persuasively without creating organizational conflict?
Your analysis is technically correct but the 15-slide deck you prepared is putting executives to sleep during your review. What do you change for the next presentation?
You are presenting an uncomfortable finding: a flagship product's core metric has been flat for two years despite significant investment. How do you present this to a room of senior leaders who are emotionally invested in the product?
You are reading a 5 GB CSV into pandas and your laptop's kernel crashes from memory. How do you handle a dataset larger than RAM in Python without switching to a different language?
You need to merge two pandas DataFrames: 10 million rows of transaction data with 500,000 rows of customer demographics. After the merge, your row count jumps from 10 million to 14 million. Diagnose and fix this.
You are using a for-loop to apply a function row-by-row to a 2 million row pandas DataFrame. It takes 45 minutes to run. How do you reduce execution time without rewriting in a different language?
Three weeks after a data pipeline migration to a new vendor's warehouse, the head of marketing reports that their campaign attribution numbers look 'off'. You don't know what 'off' means yet. How do you investigate systematically?
You notice that 15% of rows in a customer table have NULL values in the 'country' field. Before deciding how to handle them, what questions do you ask?
You are implementing data quality monitoring for a critical revenue reporting pipeline that processes $50M in monthly transactions. What framework do you put in place, and how do you avoid alert fatigue?
A healthcare analytics team asks you to model patient 'engagement' as a single metric. Patients include both preventive care users and chronic disease management users. Why is a single engagement metric problematic and how do you propose to model it?
An edtech startup wants to model 'learning progress' for students taking asynchronous video courses. Three PMs have three definitions. How do you determine the right one?
You join a Series B marketplace startup as their first senior analyst. There is no data warehouse, no dbt, no standard metric definitions. The CEO wants a 'north star metric' by end of month. How do you approach this?
You are asked to identify the top 10% of customers by lifetime value, but the orders table contains duplicated order IDs due to a known ETL bug. How do you write the query correctly despite the duplicates?
A regional sales manager modifies a published Tableau dashboard to fit their team's view and accidentally breaks a key calculated field for all 200 other users. How should your team have prevented this?
Your company's average order value (AOV) increased by $12 last quarter. The CEO is excited and mentions it in an earnings call. You notice the number might be misleading. What issue are you spotting?
You are designing a marketing performance dashboard. The CMO wants 'real-time' data. Your data warehouse refreshes every 4 hours. How do you handle this expectation gap?
Your company tracks Monthly Active Users (MAU) as its primary engagement metric. A product manager argues this metric is masking declining quality engagement. How do you evaluate this claim?
You receive a customer data file from a partner where all monetary values are stored as strings with currency symbols ('$1,234.56', '€2,100.00', '£890.00'). You need to convert these to numeric values for analysis. Describe your approach.
You run an A/B test on a checkout flow redesign. Conversion rate improves 8% but average order value drops 5%. How do you present this result to the product team?
You are presenting monthly sales performance to a regional team of 25 non-technical account managers. They are intimidated by data. How do you structure the session to make it engaging and actionable?
You are asked to automate a monthly reporting process that currently takes 3 hours of manual copy-paste between Excel and a Salesforce export. What options do you evaluate?
An executive sponsors a data initiative that your analysis shows will not achieve its stated ROI target of $2M based on your modeled assumptions. They have already announced it internally. How do you handle this?
You need to calculate day-over-day retention for a mobile app across 90 days for 500,000 users. The raw events table has 300 million rows. Write the query strategy and explain the performance choices.
Your user registration table shows 2,000 users with the email address 'test@test.com'. How do you handle these in an analysis of user acquisition trends?
A media company wants to model 'content performance' across articles, podcasts, and short videos. Each format has different consumption patterns and monetization models. How do you create a unified performance framework?
Your e-commerce company's weekly revenue SQL query runs in 45 seconds on Redshift, blocking the Monday morning dashboard refresh. The query joins orders, customers, and product tables across 500M rows. Walk through your systematic optimization approach and the specific changes you'd make.
A product manager asks you to calculate 30-day rolling retention for a mobile gaming app. The events table has user_id, event_date, and event_type columns with about 2M rows per day. Write the logic and explain the edge cases you must handle.
You're a senior analyst at a SaaS company. Finance needs MRR waterfall: new MRR, expansion MRR, contraction MRR, and churn MRR broken out monthly. Your subscriptions table has plan changes but no explicit churn flag. How do you derive the four components reliably?
You're presenting monthly sales trends to a regional VP. You've built a line chart showing 12 months of data, but the VP says it's 'hard to read' and asks for a bar chart instead. How do you decide which chart type to use, and how do you handle the disagreement professionally?
A fintech startup's investor deck shows a chart with a dual Y-axis comparing weekly transaction volume and average ticket size. The axes are scaled so both lines appear to move in near-perfect sync. The CEO wants to use this chart to argue causality. What's the problem and how do you fix it?
Your Power BI report used by 200 sales reps refreshes once daily from an Azure Synapse warehouse. Reps are complaining that numbers don't match what they see in Salesforce. You trace the issue to a complex DAX measure. Walk through your debugging process.
You've built a Tableau dashboard for a logistics company showing on-time delivery rates by region. Users report that clicking a region filter doesn't update the KPI summary tiles at the top of the dashboard. What's most likely causing this and how do you fix it?
A healthcare analytics team wants to migrate 40 Tableau dashboards to Power BI in 90 days. You're leading the migration. What are the three highest-risk areas and how do you mitigate them?
You're designing a real-time operations dashboard for a call center with 500 agents. The operations manager wants to show 25 metrics on a single screen. How do you push back and structure the design?
Six months after launch, your executive dashboard is used by only 3 of the 12 intended users. The other nine say it's 'not useful.' You have two weeks to diagnose and fix this. What's your process?
You're designing a multi-tenant SaaS dashboard where each customer sees only their own data but the underlying dataset is shared. The engineering team wants to use row-level security in Tableau. What are the implementation risks and how do you test it correctly?
Your streaming platform's data science team ran an A/B test on a new recommendation algorithm. The test shows a statistically significant 4% lift in 7-day retention (p=0.02). The product lead wants to ship immediately. You're skeptical. What red flags do you probe for before agreeing?
A marketing analyst presents a cohort analysis showing that users who receive email campaigns have 40% higher LTV than users who don't. The CMO wants to triple the email budget based on this finding. What's the statistical problem and what analysis would you run instead?
You're analyzing North Star metric data for a B2B SaaS product. The weekly active accounts metric is flat for three months despite new features shipping. Engineering says the metric is broken; product says users just aren't adopting. How do you diagnose which is true?
You inherit an Excel model from a colleague who left the company. The monthly P&L workbook has 15 sheets, circular references, and volatile functions like INDIRECT across thousands of cells. The CFO needs it updated by Friday. How do you approach this safely?
Your team's weekly sales reconciliation process takes three analysts half a day each using Excel. It involves VLOOKUPs across five workbooks, manual copy-paste, and frequent errors. You have a mandate to fix it but no engineering resources. What do you build?
You're responsible for an Airflow pipeline that ingests Salesforce data into Snowflake nightly. Last night the DAG failed at 2 AM and the sales team has no data for their 8 AM standup. Walk through your incident response and the changes you make to prevent recurrence.
You're ingesting a vendor CSV file that arrives daily with inconsistent date formats — sometimes MM/DD/YYYY, sometimes YYYY-MM-DD, sometimes written as 'January 5, 2024.' How do you handle this robustly in a production pipeline?
Your dbt project has grown to 300 models across five domains. Build times are exceeding 40 minutes and the data engineering team is complaining. You have one week to cut build time by 50%. What's your plan?
You've completed a two-week analysis showing that a major product initiative has had zero measurable impact on revenue. The product VP who championed the initiative is in the room when you present. How do you deliver this finding?
Three different business units — marketing, sales, and finance — are each requesting conflicting definitions for 'monthly active customer.' You're the analytics lead. How do you resolve this without losing credibility with any of the three teams?
An executive sponsor forwards you a competitor's quarterly report showing they grew 40% YoY while your company grew 10%. They want you to explain the gap by end of week. You have no access to their internal data. What analysis do you actually run?
You're joining a Series B marketplace startup as their first data analyst. The CEO uses 'GMV,' the CFO uses 'net revenue,' and the head of growth uses 'paid GMV.' None of these are defined anywhere. In your first 30 days, how do you establish a metric governance foundation?
Your subscription box company defines churn as 'customers who cancel their subscription.' A new analyst joins and defines it as 'customers who don't renew.' The monthly churn numbers differ by 3 percentage points. Which is right and how do you adjudicate?
Your company runs 50 concurrent A/B tests on the same user base. The VP of Product wants to ship the top 5 winning tests this week. As the analytics lead, what problems does this create and how do you manage it?
You're analyzing an A/B test on a checkout flow. The test showed a 12% lift in conversion rate but a 4% drop in average order value. Overall revenue per visitor is essentially flat. How do you interpret this and what do you recommend?
A growth team at a ride-hailing company ran a surge pricing A/B test. After 3 weeks they found a significant positive effect on rides completed but cannot get the experiment unit right — some users appear in both control and treatment groups. How do you salvage the analysis?
You've done a 3-month analysis on customer churn. Your findings are in a 45-slide deck. Your time slot in the leadership QBR is 10 minutes. How do you prepare and what do you actually show?
You discover that a supply chain disruption is driving a metric drop your company has been attributing to a recent product change. Engineering has already rolled back the product change based on the original (incorrect) analysis. How do you handle this situation?
You're presenting a multi-year customer LTV model to the board of a consumer fintech. The model has 12 assumptions. One board member with a quant background starts drilling into your discount rate assumption. How do you handle the scrutiny without losing the room?
You're reading a 4GB CSV file into pandas and your machine runs out of memory halfway through. You need to compute the average transaction amount grouped by merchant_category. Walk through your solutions in order of practicality.
You're building a Python-based data quality pipeline for a retail company's daily sales feed. The feed has 50 columns and ~500K rows per day. How do you design automated checks that are both comprehensive and maintainable over a 2-year horizon?
You're handed a pandas codebase that takes 3 hours to process a daily analytical job. The code uses iterrows() extensively. How do you systematically improve performance without breaking the logic?
Your company's marketing attribution model is showing 130% of conversions attributed (more than 100%). The model uses last-touch attribution. The data engineering team says the data is fine. How do you investigate?
A junior analyst reports that your customer table has 15% null values in the 'country' field, but the field was marked as required in the source application. Where do the nulls come from and how do you remediate them?
Book a mock interview with a senior Data Analyst / BI Analyst mentor — structured scorecard, replay, and a gap plan.