HikeCatalystHikeCatalyst
← All roles

AI / GenAI Engineer Interview Questions

200 scenario-based questions with detailed model answers, organized skill-wise and tool-wise. Filter by topic, level or keyword, reveal the answer — then pressure-test yourself in a real mock.

SKILL / TOOL
LEVEL
200 questions
Q001LLM FundamentalsMid

Your team is evaluating GPT-4o versus Claude 3.5 Sonnet for a legal document summarization service. Latency is acceptable up to 5 seconds per document; accuracy on jurisdiction-specific clauses is critical. How do you structure the comparison and what metrics drive the final decision?

Q002LLM FundamentalsSenior

A fintech startup running a GPT-4-turbo based advisory chatbot notices that outputs become inconsistent when the conversation exceeds 60 turns. Users report the model "forgetting" earlier financial goals they stated. You are brought in to diagnose and fix this. Walk through your investigation and solution.

Q003Prompt EngineeringMid

You are building a customer support bot for a SaaS company. Initial tests show the model frequently apologizes excessively, uses filler phrases, and occasionally promises features that don't exist. What prompt engineering techniques do you apply to fix each of these issues without fine-tuning?

Q004Prompt EngineeringSenior

Your team discovers that chain-of-thought prompting dramatically improves accuracy on multi-step math reasoning but increases output tokens by 3× and triples cost. A business stakeholder says to remove CoT to cut costs. How do you respond and what alternatives do you propose?

Q005RAG & RetrievalMid

You are building a RAG system for a healthcare provider's internal policy search tool. Clinicians report that answers sometimes mix policies from two different hospital campuses. Both campus policies are indexed in the same vector store. How do you fix this cross-contamination problem?

Q006RAG & RetrievalSenior

A retail e-commerce company runs a RAG-based product recommendation chatbot. Retrieval recall is 94% on single-intent queries but drops to 61% on queries with multiple intents like "I need waterproof running shoes under $100 for trail running." How do you redesign the retrieval pipeline to handle multi-intent queries?

Q007Vector DatabasesMid

You are migrating a RAG application from Pinecone to pgvector because leadership wants to consolidate on PostgreSQL. The existing index has 5 million 1536-dimensional vectors. What are the key steps and risks in this migration?

Q008Vector DatabasesSenior

You are architecting a vector search system for a news aggregator that indexes 2 million new articles daily, each with 768-dimensional embeddings. Queries must return results in under 100ms at p95. Describe your database choice, index strategy, and operational concerns.

Q009Fine-tuning & PEFTMid

Your team fine-tuned a Llama-3-8B model for SQL generation on a proprietary database schema. After deployment, users report that queries on standard public schemas (like Northwind) produce significantly worse results than before fine-tuning. What went wrong and how do you fix it?

Q010Fine-tuning & PEFTSenior

You are tasked with adapting Mistral-7B for a highly regulated insurance underwriting assistant. The model must adopt domain-specific terminology, follow strict output schemas, and never hallucinate coverage limits. Training data is limited to 3,000 labeled examples. Design the fine-tuning strategy end-to-end.

Q011Agents & Tool UseMid

You deploy a ReAct-style agent for an e-commerce operations team. It has access to order management, inventory, and CRM tools. In production, operators report the agent sometimes calls the CRM "delete customer" tool when it should only read customer records. How do you prevent this?

Q012Agents & Tool UseSenior

You are designing a multi-agent system for a financial research firm where one agent browses the web, one extracts structured data, and one synthesizes investment theses. After two weeks in production, the pipeline frequently produces contradictory theses when source articles disagree. How do you redesign the architecture to handle source conflicts?

Q013EmbeddingsMid

A data science team at a media company built a content recommendation system using text-ada-002 embeddings. After migrating to text-embedding-3-small, click-through rate dropped 8%. Both models produce 1536-dim vectors. The team assumes the vectors are interchangeable. What is wrong and how do you fix it?

Q014EmbeddingsSenior

You are building a cross-lingual semantic search system for a multinational HR platform that supports 18 languages. Users search in any language and expect results in all languages. Describe your embedding strategy and how you handle languages with sparse training data.

Q015Evaluation & GuardrailsMid

You are responsible for a customer-facing GenAI chatbot for a children's educational platform. QA reports that 3% of responses contain content that passes your adult content filter but is still inappropriate for under-13 users—references to violence in historical content, mature themes in literary analysis. How do you tighten the guardrails without making the bot useless?

Q016Evaluation & GuardrailsSenior

Your company's LLM-powered contract analysis tool has been in production for 6 months. Legal teams trust it heavily. An internal audit discovers the model has been systematically misclassifying one specific clause type—limitation of liability—as favorable when it is actually adverse, across dozens of contracts. How do you respond, establish accountability, and prevent recurrence?

Q017LLM Serving & InferenceMid

You are deploying a self-hosted Llama-3-70B model for an internal code assistant. On a single A100 80GB, throughput is 12 tokens/second for single requests. With 50 concurrent developers using it, response time becomes unacceptable. What optimizations do you apply?

Q018LLM Serving & InferenceSenior

You are the ML platform lead at a company running 15 different LLM-powered products. Each product team wants their own fine-tuned model variant. Managing 15 separate model deployments is operationally expensive. Design a serving architecture that handles multi-tenancy efficiently.

Q019Cost & Latency OptimizationMid

A startup's GPT-4 API bill hit $45,000 last month, mostly from a summarization pipeline processing 200,000 news articles daily. Leadership asks you to cut costs by 60% without sacrificing quality. What is your approach?

Q020Cost & Latency OptimizationSenior

Your team runs a real-time LLM inference service with a strict SLA of 800ms end-to-end at p99 for a customer-facing AI writing assistant. During traffic spikes, p99 regularly hits 2–3 seconds. Your current stack is vLLM on A100s behind an Nginx load balancer. Diagnose and resolve the SLA breach.

Q021MultimodalMid

You are building a multimodal QA system for a manufacturing company where quality inspectors upload photos of defective parts and ask natural language questions about what they see. After deployment, the system correctly identifies visible defects but frequently misidentifies scale—reporting a 2mm scratch as a major structural crack. How do you address this?

Q022MultimodalSenior

You are designing a document intelligence pipeline for an insurance company that processes 50,000 mixed-format claims documents daily—PDFs, scanned handwritten forms, and photographs of damaged property. The pipeline must extract structured data with 99.5% field-level accuracy. Describe your architecture.

Q023Hallucination & SafetyMid

Your company deploys an LLM-powered internal knowledge base assistant. An employee discovers the model confidently cited a policy document that does not exist and used it to justify an HR decision. Describe how you investigate this incident and what guardrails you add to prevent recurrence.

Q024Hallucination & SafetySenior

You are building the safety layer for an LLM-based medical symptom checker. The product must never claim to diagnose, must always recommend professional consultation for serious symptoms, and must not provide medication dosages. You have 3 weeks to ship the MVP. Design the complete safety architecture.

Q025Orchestration (LangChain/LlamaIndex)Mid

Your team built a LangChain document QA pipeline that works perfectly in development but fails intermittently in production with cryptic "context window exceeded" errors. The documents are the same size as during testing. What is likely happening and how do you fix it?

Q026Orchestration (LangChain/LlamaIndex)Senior

You are migrating a complex LLM pipeline from LangChain to LlamaIndex because the team finds LangChain's abstraction leaky for your use case—a multi-step research agent with custom retrieval logic. Three months in, the new pipeline has higher latency than LangChain despite identical model calls. Diagnose the performance regression.

Q027Context & MemoryMid

You are building a customer support chatbot for a telecom company. Customers often call back multiple times about the same issue. Currently, each session starts fresh with no history. Product wants the bot to remember context from the customer's last 3 conversations. Design the memory system.

Q028Context & MemorySenior

You are the AI lead for a long-running legal case management assistant used by litigation attorneys. Cases span 3–5 years with thousands of documents and hundreds of conversations. Standard context windows are insufficient. Design a memory architecture that maintains coherent case understanding across years of interactions.

Q029LLM FundamentalsMid

During a product demo, your team's GPT-4o-based assistant starts hallucinating company names and product details from a competitor's press release that was included in a retrieval context alongside your company's documents. The model appears to confuse which information belongs to which company. What is happening and how do you prevent it?

Q030Prompt EngineeringSenior

Your team is using an LLM to automatically generate SQL queries from natural language business questions for a non-technical analytics team. The model generates syntactically valid SQL 95% of the time but semantically incorrect queries 30% of the time—wrong joins, wrong aggregation levels. You cannot fine-tune the model. What is your prompt engineering strategy?

Q031RAG & RetrievalSenior

A consulting firm's internal knowledge management system uses RAG over 50,000 past project documents. Consultants complain that search results are stale—a project completed 3 years ago ranks higher than a more relevant recent engagement. How do you redesign the retrieval scoring to handle temporal relevance?

Q032Fine-tuning & PEFTSenior

You are a principal engineer at a cybersecurity company. You want to fine-tune an open-source LLM to generate realistic phishing email simulations for security awareness training. Legal approves the use case but raises concerns about dual-use risk. How do you design the fine-tuning and deployment to be safe by design?

Q033Agents & Tool UseSenior

You are building an autonomous coding agent that can read files, write code, run tests, and commit to Git. During a pilot with 5 developers, the agent occasionally runs `git push --force` on main branches when developers haven't explicitly asked it to. This causes serious team disruption. Redesign the agent's action space to prevent this.

Q034EmbeddingsMid

A startup is building a job-candidate matching platform. They use cosine similarity between job description embeddings and resume embeddings to rank candidates. Recruiters report that candidates with shorter resumes (recent graduates with 1 page) consistently rank below candidates with longer resumes (5-page CVs) even when qualifications match. Why does this happen and how do you fix it?

Q035Evaluation & GuardrailsSenior

You are the AI lead at a company preparing an LLM-powered product for SOC 2 Type II audit. The auditors want evidence that your AI outputs are consistently safe, accurate, and within defined behavioral boundaries. You have 8 weeks. What evaluation infrastructure do you build?

Q036LLM Serving & InferenceSenior

You are designing the inference infrastructure for a globally distributed LLM product serving 10 million users across Asia, Europe, and North America. GDPR requires that European users' data stays in Europe. Model weights are 70B parameters. Describe your multi-region deployment architecture.

Q037Cost & Latency OptimizationMid

Your team uses Claude Sonnet for a batch document classification pipeline that runs nightly, classifying 500,000 documents into 12 categories. The pipeline currently costs $8,000/month. Your manager asks you to reduce cost without changing the model or categories. What do you do?

Q038MultimodalSenior

You are building an AI system for a real estate company that processes property listing photos to automatically generate compelling listing descriptions. The system must handle 10,000 photos per day, maintain brand voice, and never hallucinate features not visible in the image. Design the end-to-end pipeline.

Q039Hallucination & SafetyMid

During A/B testing of your updated RAG system, you discover the new version has 15% lower hallucination rate but 8% lower user satisfaction scores. Users prefer the old version despite its higher hallucination rate. How do you reason about this trade-off and what do you recommend?

Q040Orchestration (LangChain/LlamaIndex)Senior

You are architecting a LangGraph-based agentic pipeline for a financial services company where the agent must comply with Regulation Best Interest (Reg BI). The agent recommends investment products. Every recommendation must be explainable, auditable, and have a documented rationale tied to the customer's profile. How do you design the pipeline?

Q041Context & MemorySenior

You are building a personalized tutoring AI for a test-prep company. The system tracks each student's knowledge gaps, learning style, and past question performance over months of study sessions. After 3 months of data, the system's personalization quality plateaus—it seems to rely only on recent data and ignores early sessions. Diagnose and fix this.

Q042LLM FundamentalsSenior

Your engineering team is considering replacing a GPT-4o-based pipeline with a locally hosted Llama-3-70B model to eliminate API costs. The pipeline processes sensitive HR data. The CTO wants a 3-month ROI analysis. What factors do you model and what is your recommendation framework?

Q043Agents & Tool UseMid

You build a LangChain agent for an internal helpdesk that can query a ticketing system, look up employee directories, and send Slack messages. During testing, you notice the agent sends Slack messages prematurely—before fully resolving the query—resulting in employees receiving incomplete or incorrect information. How do you fix this?

Q044RAG & RetrievalMid

Your team indexes product documentation using recursive character text splitter with chunk size 512, overlap 50. Support agents report that answers are often missing because the relevant information spans a chunk boundary—a procedure starts near the end of one chunk and finishes in the next. How do you redesign chunking?

Q045Hallucination & SafetySenior

You are the AI safety lead at an autonomous vehicle company. The company wants to use an LLM to answer driver questions about vehicle capabilities in real time—while the vehicle is moving. A safety engineer flags that hallucinated capability claims could lead to driver over-reliance and accidents. Design a safety-first system.

Q046Vector DatabasesMid

You are using Weaviate for semantic search on a product catalog of 500,000 items. After deploying a new embedding model with higher quality, you find that query results for the same queries are significantly worse. The index was partially rebuilt—only 200,000 new items were reindexed before a deadline forced the deployment. What is happening and how do you fix it?

Q047EmbeddingsSenior

You are building a code semantic search tool for an IDE plugin at a developer tools company. Developers search for code snippets using natural language queries. You find that general-purpose embedding models perform poorly on code-specific queries—"find where we handle database connection errors" matches irrelevant comments instead of exception handlers. How do you improve retrieval quality?

Q048Evaluation & GuardrailsMid

Your AI team just shipped a customer-facing chatbot. After one week, you receive a report that the chatbot is responding in a different language than the user's input for approximately 2% of conversations. How do you investigate this bug, quantify its scope, and fix it?

Q049Context & MemoryMid

You are building an LLM-powered coding assistant integrated into a developer's IDE. The assistant needs to understand the entire repository context (hundreds of files) to give relevant answers, but the model's context window is 32k tokens and the average repository is 200k+ tokens. How do you architect the context selection?

Q050Fine-tuning & PEFTMid

Your team fine-tuned a smaller open-source model on 10,000 customer service chat examples to replace GPT-4o for cost savings. In A/B testing, the fine-tuned model performs comparably on common queries but fails badly on rare edge cases that GPT-4o handled gracefully. How do you address this without reverting entirely to GPT-4o?

Q051LLM Serving & InferenceMid

Your company's API gateway logs show that 40% of LLM API calls are identical prompts submitted within 60 seconds of each other—a common caching opportunity. Describe how you implement semantic caching and what pitfalls to watch for.

Q052Hallucination & SafetyMid

You deploy a retrieval-augmented Q&A system for a pharmaceutical company's internal drug information portal. A pharmacist reports that the system recently gave a dosage recommendation for pediatric patients that was only correct for adults—both doses appeared in the retrieved documents. The system presented the adult dose without caveat. How do you fix this?

Q053Orchestration (LangChain/LlamaIndex)Mid

You are using LlamaIndex to build a multi-document QA system over 500 annual reports from publicly traded companies. Users ask comparative questions like "Which of these 10 companies had the highest R&D spend as a percentage of revenue in 2023?" Current single-document retrieval misses the comparative nature of these queries. How do you redesign the pipeline?

Q054LLM FundamentalsMid

A product team wants to add a feature where the LLM can refuse to answer questions outside its designated scope—a customer service bot for a software product that should not answer general coding questions or political topics. Users are bypassing the restriction by framing off-scope questions as hypotheticals. How do you make scope restriction more robust?

Q055RAG & RetrievalSenior

You are building a RAG system for a law firm's case research tool. Associates need to find relevant case precedents from a corpus of 2 million legal judgments. The challenge is that legal reasoning relies heavily on analogical reasoning—finding cases similar in legal principle but possibly different in surface fact pattern. Standard semantic search underperforms. How do you design the retrieval system?

Q056Cost & Latency OptimizationSenior

You are the ML infrastructure lead at a B2B SaaS company. Your LLM inference cost grew from $12,000 to $89,000 per month over 6 months as the product scaled, but revenue grew only 3×. The unit economics are deteriorating. Conduct a cost attribution analysis and propose a sustainable cost model.

Q057LLM FundamentalsSenior

Your e-commerce company runs GPT-4 for product descriptions. A product manager notices the model confidently writes incorrect ingredient lists for food items — data it was never trained on. The PM wants to 'just retrain the model.' How do you reframe the problem and what do you actually fix?

Q058LLM FundamentalsMid

You are benchmarking a new open-source 7B model against GPT-3.5-turbo for a customer support chatbot. The 7B model scores higher on your internal BLEU tests but users in a shadow deployment prefer GPT-3.5. How do you resolve the contradiction?

Q059Prompt EngineeringSenior

A legal tech startup uses an LLM to extract contract clauses. In production, extraction accuracy drops from 89% in dev to 71% when contracts are PDFs with complex table layouts. The PM wants to improve the prompt. As the senior engineer, what is your actual diagnosis and plan?

Q060Prompt EngineeringMid

Your team uses few-shot prompting for a sentiment classifier. A colleague adds five new examples to improve edge-case performance, but overall accuracy drops by 4%. What could cause this regression and how do you debug it?

Q061RAG & RetrievalSenior

Your healthcare company's RAG chatbot retrieves from 200,000 clinical guidelines. Users report that answers to multi-step questions like 'what dosage is safe for a diabetic patient with kidney disease?' return conflicting, contradictory guidelines. Retrieval scores look fine. What architectural changes do you make?

Q062RAG & RetrievalMid

You build a RAG pipeline on a 10,000-article internal wiki. Retrieval recall is 85% but users say answers are incomplete. When you inspect retrieved chunks, the relevant content is present but the LLM seems to ignore it. What do you investigate?

Q063Vector DatabasesSenior

Your team migrates a RAG system from Pinecone to pgvector to reduce cost. Post-migration, P99 retrieval latency jumps from 80ms to 620ms for a 5M-vector index with 1536-dimensional embeddings. Engineering wants to roll back. What do you tune before giving up?

Q064Vector DatabasesMid

You are building a semantic search system and must choose between Weaviate, Chroma, and pgvector for a corpus of 500K product descriptions. Your team is 3 engineers and you already run Postgres in production. What do you recommend and why?

Q065Fine-tuning & PEFTSenior

A fintech company fine-tuned Llama-3-8B on 50K customer service transcripts using full fine-tuning. The model performs great on support queries but now fails on general reasoning tasks the base model handled well. Your manager calls this 'catastrophic forgetting.' What do you do?

Q066Fine-tuning & PEFTMid

You are tasked with fine-tuning a coding assistant on your company's internal Python codebase of 200K files. You have 2xA100 80GB GPUs. Walk through your data preparation, training setup, and evaluation plan.

Q067Agents & Tool UseSenior

Your company deploys a GPT-4 agent that can execute SQL queries against a production database to answer analyst questions. In staging it works perfectly, but in week 2 of production it drops a table. The CTO demands an immediate fix and a process change. What do you do?

Q068Agents & Tool UseMid

You are building a ReAct-style agent with 8 tools. In testing, the agent sometimes enters infinite loops calling the same tool repeatedly with the same input. How do you detect and prevent this?

Q069EmbeddingsSenior

Your company uses OpenAI text-embedding-ada-002 for a document search system. The contract for OpenAI is ending and leadership wants to self-host embeddings. You have a corpus of 2M documents and latency SLA of under 100ms. What is your migration plan?

Q070EmbeddingsMid

You are building a multilingual customer support search system covering English, Hindi, Spanish, and Portuguese. A colleague suggests using separate language-specific embedding models for each language. What trade-offs do you evaluate?

Q071Evaluation & GuardrailsSenior

You are the AI lead at a bank deploying an LLM for loan officer assistance. Regulators ask for an audit of every AI-assisted decision within 72 hours. Your current system has no logging. Design the observability and audit architecture.

Q072Evaluation & GuardrailsMid

Your company's LLM chatbot for HR policy questions occasionally gives incorrect answers about PTO policies that differ by country. HR wants to add a guardrail to catch these errors before they reach employees. What do you build?

Q073LLM Serving & InferenceSenior

Your startup serves a 70B parameter LLM for a writing assistant. During peak hours (9-11 AM), GPU utilization is 95% and P99 latency is 8 seconds. Users are churning. You have budget for 2 more A100 80GB GPUs, bringing the total to 6. What do you do?

Q074LLM Serving & InferenceMid

You deploy a 13B quantized LLM using llama.cpp on a CPU-only server for an internal tool. Response latency is 45 seconds for a 500-token completion. Users won't tolerate more than 15 seconds. What options do you explore?

Q075Cost & Latency OptimizationSenior

Your AI platform team's monthly OpenAI bill is $180,000, mostly from GPT-4 calls for a document summarization pipeline processing 2M documents per month. The CTO gives you 60 days to cut costs by 40% without degrading P90 quality. What is your strategy?

Q076Cost & Latency OptimizationMid

A startup's LLM API costs spike 3x after adding chain-of-thought prompting to improve accuracy. The CEO asks you to optimize without sacrificing the accuracy gains. What techniques do you apply?

Q077MultimodalSenior

A retail company wants to automate product catalog ingestion: upload a product photo and a spec sheet PDF, and the system generates a structured product record. You have GPT-4o available. Design the end-to-end pipeline and identify failure modes.

Q078MultimodalMid

You are evaluating GPT-4V and LLaVA-1.6-34B for an insurance damage assessment tool that analyzes car damage photos. What evaluation framework do you set up before choosing a model?

Q079Hallucination & SafetySenior

Your AI-generated medical information assistant has a rare but severe failure mode: it occasionally generates plausible-sounding but wrong drug dosages. Legal says one incorrect dosage is an unacceptable liability. Your PM wants to ship in 4 weeks. How do you handle this?

Q080Hallucination & SafetyMid

You are building a news summarization tool and users complain the summaries occasionally contain facts not present in the source article. You suspect the LLM is blending information from its training data with the article. How do you quantify and reduce this?

Q081Orchestration (LangChain/LlamaIndex)Senior

Your team built a complex LangChain pipeline with 7 nodes: ingestion, chunking, embedding, retrieval, reranking, generation, and citation. It works in dev but fails intermittently in production with cryptic LangChain errors. How do you make this system debuggable and production-grade?

Q082Orchestration (LangChain/LlamaIndex)Mid

A junior engineer on your team is building a document QA system with LlamaIndex and notices the default SimpleDirectoryReader and default chunk size gives poor results. They ask for guidance on improving the index configuration. What do you tell them?

Q083Context & MemorySenior

Your customer service agent has a 128K context window but users report that after a long support session, the agent 'forgets' details mentioned 50 messages ago. Context is clearly within the window. What are the likely causes and how do you fix them?

Q084Context & MemoryMid

You are building a personal study assistant that needs to remember a student's learning progress, weak topics, and past quiz scores across sessions. The sessions are weeks apart. How do you implement persistent memory?

Q085LLM FundamentalsSenior

A distributed training run for a 13B model on 32xA100s diverges at step 8000 — loss suddenly spikes from 2.1 to 14 and the training run becomes unstable. You have access to logs and checkpoints. Walk through your debugging process.

Q086Prompt EngineeringSenior

You are building a complex multi-step legal document analysis system. A senior lawyer complains that the LLM's analysis misses implicit contractual risks that an experienced lawyer would catch. How do you encode legal reasoning in your prompting strategy?

Q087RAG & RetrievalSenior

You are the AI lead at a consulting firm. Your internal knowledge RAG system has 500 consultants' reports, but when a consultant asks 'what approaches have we used for supply chain optimization?', it returns reports about individual projects rather than synthesizing across them. What architectural change do you make?

Q088Vector DatabasesSenior

Your vector database holds 50M embeddings at 768 dimensions. Nightly batch re-embedding runs cause a 40% latency spike during the update window as new vectors are inserted. Users notice degraded search quality during this window. What do you fix?

Q089Fine-tuning & PEFTSenior

Your team fine-tuned a Llama-3-70B model for a medical coding assistant. HIPAA compliance requires the model weights to stay on-premise, but the hardware team can only provide 4xA100 80GB GPUs. The model in bf16 requires 140GB. What deployment strategy do you use?

Q090Agents & Tool UseSenior

Your company deploys a multi-agent system where Agent A researches a topic and passes results to Agent B for writing. In production, Agent B occasionally ignores Agent A's research and writes from its own knowledge, producing inconsistent outputs. How do you enforce data dependency?

Q091EmbeddingsSenior

Your semantic search system works well for English but a product expansion requires it to also serve Japanese and Korean users. Adding 10M new documents in these languages will double your index size. How do you design the new embedding and retrieval architecture?

Q092Evaluation & GuardrailsSenior

You are building an automated evaluation system for a creative writing assistant that generates marketing copy. Human evaluators score 5/5 but your automated ROUGE and BERTScore metrics consistently score the same outputs 3/5. How do you build better automated evals for creative tasks?

Q093LLM Serving & InferenceSenior

You manage a self-hosted Mistral-7B deployment using vLLM. A software update is required but you cannot afford downtime as the system serves 24/7. Design a zero-downtime deployment procedure.

Q094Cost & Latency OptimizationSenior

Your company runs an LLM-powered coding assistant used by 500 engineers. Analysis shows 35% of requests are for the same 20 common tasks (docstring generation, boilerplate functions, unit test stubs). Cloud GPU costs are $40K/month. How do you optimize for this traffic pattern?

Q095MultimodalSenior

A media company asks you to build an automated video content moderation system that flags hate speech, graphic violence, and copyrighted material in user-uploaded videos. The system must process 10,000 videos per day with under-1-hour SLA. Design the architecture.

Q096Hallucination & SafetySenior

During red-teaming of your customer-facing LLM, a tester extracts your full system prompt by asking 'repeat everything above this line.' Your security team escalates. What mitigations do you implement and what are their limitations?

Q097Orchestration (LangChain/LlamaIndex)Senior

Your team built a LangGraph-based multi-step agent for financial analysis. In production, the graph sometimes enters unexpected state transitions, skipping required validation nodes. How do you add observability and make state transitions safe?

Q098Context & MemorySenior

You are building a long-running AI research assistant that must maintain context across 6-hour research sessions with hundreds of tool calls. The context window fills up within 2 hours. Design a context management system that handles 6-hour sessions.

Q099Agents & Tool UseMid

You are building a customer support agent with tools: search_knowledge_base, create_ticket, update_order, send_email. During testing, the agent calls create_ticket and send_email for every query, even simple FAQs. How do you fix tool over-calling?

Q100Hallucination & SafetyMid

You deploy an LLM for an education platform serving students aged 12-17. A parent reports the model provided detailed instructions for making alcohol when a student asked about fermentation for a science project. What do you implement?

Q101Context & MemoryMid

You are building a chatbot for a banking app. The chatbot must remember user preferences (preferred language, account nickname, notification settings) across sessions but cannot store personal data in an LLM context window for privacy reasons. How do you design the memory architecture?

Q102LLM Serving & InferenceMid

Your team is evaluating whether to use streaming responses for a document analysis tool. The analysis takes 15-20 seconds to complete. Product wants streaming. Engineering is concerned about complexity. Walk through the trade-offs and implementation considerations.

Q103Evaluation & GuardrailsMid

Your team uses LLM-as-judge to evaluate RAG outputs. You notice the judge model consistently gives higher scores to longer, more verbose responses even when shorter answers are more accurate. How do you fix this evaluation bias?

Q104MultimodalMid

You are building a receipt OCR and expense categorization system using a vision-language model. The model works well on printed receipts but struggles with handwritten or faded receipts from certain vendor types. How do you improve robustness?

Q105Orchestration (LangChain/LlamaIndex)Mid

A junior engineer built a LangChain RetrievalQA chain. It works but makes 3 separate LLM calls for a single user query — one for query rewriting, one for generation, and one for citation extraction. Users complain about slow responses. How do you optimize?

Q106EmbeddingsMid

You notice that your semantic search system ranks 'The server crashed at 3 AM' higher than 'Production outage at 3:00 AM' for a query about production incidents, even though the second is more directly relevant. What embedding or retrieval issue might cause this?

Q107LLM FundamentalsMid

Your e-commerce recommendation team notices that GPT-4o returns inconsistent product descriptions when temperature is set to 0.9. A junior engineer suggests dropping it to 0.0 but stakeholders want some creative variation. How do you resolve this tension and configure sampling correctly?

Q108LLM FundamentalsSenior

A fintech startup running Llama-3-70B on-premises reports that the model's outputs degrade measurably after the 14th conversation turn in multi-turn sessions. Memory usage is stable. Logs show no errors. Diagnose the root cause and prescribe a fix without changing the model.

Q109Prompt EngineeringMid

You're building a legal document classifier at a law firm. The GPT-4 zero-shot prompt achieves 71% accuracy on a held-out set of 500 contracts. A product manager asks you to hit 85% before shipping. You have no budget for fine-tuning. What prompt engineering techniques do you apply, in what order?

Q110Prompt EngineeringSenior

A healthcare SaaS company's LLM-powered clinical note summarizer is producing outputs that sometimes over-claim diagnoses not explicitly stated in the source note. Traditional prompt instructions like 'only state what is in the document' are being ignored roughly 8% of the time. Design a robust mitigation without fine-tuning.

Q111RAG & RetrievalMid

Your internal knowledge base RAG chatbot returns correct documents in retrieval but the final LLM answer is still wrong 20% of the time. The retrieval hit rate measured at k=5 is 94%. Describe how you'd diagnose where the 20% failure is occurring and what you'd fix.

Q112RAG & RetrievalSenior

A media company runs a RAG system over 2 million news articles. Retrieval latency is acceptable at p50 but p99 spikes to 4.2 seconds under load, causing timeouts. The vector database is Pinecone with 1536-dimensional embeddings. Redesign the retrieval layer to fix p99 without degrading recall.

Q113Vector DatabasesMid

You're migrating a RAG system from FAISS flat index to Qdrant for a production SaaS product. The FAISS index holds 800K vectors at 768 dimensions. During the migration test, Qdrant ANN recall drops from 99.8% (FAISS exact) to 91.3%. Walk through how you'd tune Qdrant to recover recall.

Q114Vector DatabasesSenior

A logistics company stores 40 million shipment event embeddings in Weaviate for anomaly detection. Nightly batch upserts of 500K vectors cause query latency to spike for 45 minutes post-upsert. The on-call team wants to eliminate this degradation window. Diagnose and redesign.

Q115Fine-tuning & PEFTMid

You fine-tune Mistral-7B with QLoRA on a customer support dataset of 12,000 examples. After training, the model performs well on support tickets but gives nonsensical responses to simple factual questions like 'What is the capital of France?' that it answered correctly before. Explain what happened and how to prevent it.

Q116Fine-tuning & PEFTSenior

An insurance company wants to fine-tune a 13B parameter LLM to extract structured claim fields from unstructured adjuster notes. They have 3,000 labeled examples. You have one A100 80GB GPU. Architect the full fine-tuning pipeline—data prep through serving—with concrete parameters.

Q117Agents & Tool UseMid

You're building a ReAct-style agent that can query a SQL database. During testing, the agent enters an infinite loop where it keeps re-running the same SQL query with minor variations after getting an empty result set, never returning to the user. How do you diagnose and fix this?

Q118Agents & Tool UseSenior

A trading firm is building a multi-agent system where a planning agent decomposes research tasks and delegates to specialist sub-agents (equity analyst, macro analyst, news sentiment). During a market stress event, the system produced conflicting recommendations from sub-agents that the planner combined into a dangerously mixed position. Redesign the arbitration layer.

Q119EmbeddingsMid

Your semantic search system uses OpenAI's text-embedding-ada-002. A new engineer swaps it for text-embedding-3-small to cut costs without reindexing the existing 500K document vectors. Users report dramatically worse search quality the next day. Explain what went wrong and how to fix it correctly.

Q120EmbeddingsSenior

A multilingual customer support platform embeds queries in 14 languages using a single model. The team notices that queries in Thai and Vietnamese have 30% lower retrieval accuracy than English queries despite being translated into those languages at indexing time. Diagnose and redesign the embedding strategy.

Q121Evaluation & GuardrailsMid

Your LLM-powered customer-facing chatbot ships with a content filter that blocks 'harmful' outputs. After launch, you discover it's blocking 12% of legitimate customer queries about medication dosages—information a pharmacist would normally provide. How do you tune the guardrail without removing it entirely?

Q122Evaluation & GuardrailsSenior

A government agency uses an LLM to assist caseworkers with benefit eligibility determinations. An audit reveals the system is producing outputs with statistically significant disparate impact across racial groups—comparable to a 15% approval rate gap. How do you diagnose, measure, and remediate this in an AI system?

Q123LLM Serving & InferenceMid

You deploy a Llama-3-8B model on a single A10G GPU (24GB VRAM) using vLLM. Under load of 50 concurrent users, you observe GPU utilization at 40% but throughput is saturated and new requests are queuing. What is the bottleneck and how do you fix it?

Q124LLM Serving & InferenceSenior

A consumer app serving 10,000 RPM needs sub-200ms TTFT (time-to-first-token) for a 70B parameter model. Current TTFT is 380ms on 2× H100s. The model cannot be changed. Design a serving architecture that hits the target.

Q125Cost & Latency OptimizationMid

Your startup's OpenAI bill is $28,000 last month, 80% from GPT-4o calls in a document summarization pipeline. The average input document is 6,000 tokens. Your engineering manager asks you to cut costs by 50% within two sprints without degrading user-facing quality. What's your plan?

Q126Cost & Latency OptimizationSenior

A SaaS platform runs 2 million LLM calls per day across GPT-4o, Claude Sonnet, and Gemini Pro. The CTO wants a unified cost governance system with per-team budget enforcement, real-time alerting, and automatic model downgrade under budget pressure. Architect this system.

Q127MultimodalMid

You're building a document intelligence system that processes scanned insurance claim forms using GPT-4o vision. On clean, printed forms, accuracy is 97%. On handwritten forms, accuracy drops to 68%. Describe how you'd diagnose and close this accuracy gap without training a custom model.

Q128MultimodalSenior

A retail chain wants to automate product catalog ingestion from supplier images. Suppliers send images in wildly varying formats—some show one product, some show hundreds on a shelf, some include mixed products and promotional materials. You need to extract SKU, product name, and specifications at 99% precision. Design the complete pipeline.

Q129Hallucination & SafetyMid

A legal tech company's contract review LLM is hallucinating clause references—it says 'Section 4.2 restricts assignment of rights' but Section 4.2 actually covers something else entirely. This happens on 6% of outputs. Propose a detection and mitigation system.

Q130Hallucination & SafetySenior

Your company is deploying an LLM-powered assistant for nuclear plant safety procedure lookup. The safety regulator requires a documented maximum hallucination rate below 0.1% for safety-critical procedure steps. Current rate measured on internal eval set is 0.8%. Design a system-level approach to reach and certify 0.1%.

Q131Orchestration (LangChain/LlamaIndex)Mid

You've built a LangChain RAG pipeline for a consulting firm. After upgrading LangChain from 0.1.x to 0.2.x, 40% of queries start returning empty responses with no errors. The vector store, LLM, and embeddings all work when tested independently. Diagnose the breakage.

Q132Orchestration (LangChain/LlamaIndex)Senior

A pharmaceutical company's drug discovery assistant uses LlamaIndex over 500 research papers. Response quality is high but the system has no observability—when researchers report wrong answers, engineering cannot reproduce or trace the failure. Design a complete observability stack for this LlamaIndex pipeline.

Q133Context & MemoryMid

You're building an AI study tutor that needs to remember a student's misconceptions, mastered concepts, and learning style across multiple sessions spanning weeks. The student has 40 sessions of history. Design the memory architecture.

Q134Context & MemorySenior

A B2B SaaS product's AI assistant serves 50,000 enterprise accounts, each needing isolated memory—account-specific terminology, past decisions, organizational context. Storing full conversation history per account is cost-prohibitive at scale. Design a memory system that balances personalization fidelity against storage and retrieval cost.

Q135RAG & RetrievalSenior

A cybersecurity company's threat intelligence RAG system needs to answer questions about CVEs published in the last 24 hours, but their Pinecone index takes 48 hours to update due to batch ingestion pipelines. Engineers want real-time coverage without rebuilding the pipeline. Design a solution.

Q136Prompt EngineeringSenior

A global bank's compliance team uses an LLM to classify customer communications as potentially containing market abuse signals. The model shows high variance in its classification threshold—the same communication classified differently on 15% of repeated calls. Regulators demand deterministic outputs. Solve this without switching to a rule-based system.

Q137Fine-tuning & PEFTSenior

You're the ML lead at a startup that has fine-tuned Llama-3-70B on proprietary customer interaction data for a sales CRM. A competitor has just released a superior open-source base model (Qwen-2.5-72B) that your team wants to adopt. How do you transfer the fine-tuning value without losing 6 months of work?

Q138Agents & Tool UseSenior

An autonomous data analysis agent at a hedge fund is given read/write access to a shared analytical database. During a test run, it executed a DROP TABLE command that deleted three months of processed trade data, interpreting a cleanup instruction too broadly. Design a comprehensive safety architecture for database-interacting agents.

Q139LLM FundamentalsSenior

A research team at a university uses a 7B open-source model for scientific literature review. They report that the model's outputs are much better in the morning than in the afternoon. The infrastructure is identical. No model updates occurred. Investigate this anomaly.

Q140EmbeddingsSenior

A social media platform uses embeddings for content recommendation. After two months in production, the recommendation quality starts degrading. The model hasn't changed. An engineer suspects 'embedding drift.' Explain what embedding drift is in this context, how to detect it, and how to remediate it.

Q141Evaluation & GuardrailsSenior

You're the evaluation lead for a code generation AI assistant at a developer tools company. The product team wants to ship weekly model updates. Your current evaluation suite takes 4 hours to run and requires human raters for 30% of cases. How do you accelerate the evaluation pipeline to support weekly releases without compromising quality signal?

Q142LLM Serving & InferenceSenior

A startup needs to serve 6 different fine-tuned LoRA adapters on top of a shared Llama-3-8B base model to 6 different enterprise clients, each with their own latency SLA. The team proposes running 6 separate model instances. Explain why this is wasteful and architect a better solution.

Q143Hallucination & SafetyMid

A customer service bot for a telecom company is confidently telling customers wrong contract end dates, pulling from an outdated knowledge base even though the correct dates are in a real-time database. The RAG pipeline is supposed to use the live database but the bot keeps using stale information. Diagnose why.

Q144Orchestration (LangChain/LlamaIndex)Senior

Your team uses LangGraph to build a customer complaint resolution agent. In production, some complaint threads cause the agent's state graph to exceed 50 nodes in a single execution, consuming excessive tokens and sometimes running for 3+ minutes. How do you redesign the graph to bound execution complexity?

Q145Cost & Latency OptimizationSenior

A real-time customer service platform processes 500,000 chat messages per day through Claude Sonnet. 60% of messages are simple greetings, account status checks, and FAQs that don't require LLM reasoning. The team is paying for full LLM calls on all messages. Redesign the routing architecture.

Q146Context & MemorySenior

A coding assistant deployed in VS Code handles sessions where developers work on the same codebase for 4–6 hours. By hour 4, the assistant's responses become stale—it references old code states that the developer has since refactored. Design a context management system that keeps the assistant's code understanding current without overloading the context window.

Q147MultimodalSenior

A manufacturing company uses a vision LLM to detect defects in circuit board images captured at 30 fps on a production line. The model achieves 94% accuracy in testing but drops to 78% in production, where boards move on a conveyor and images are occasionally motion-blurred. Fix the production accuracy gap.

Q148Vector DatabasesSenior

A legal research firm stores 10 million case documents in a Weaviate cluster. A new business requirement demands that results must be filtered by jurisdiction (50 US states + federal) before vector search, not after. Currently, jurisdiction filtering runs as a post-processing step. Explain the performance difference and redesign for pre-filtering.

Q149Agents & Tool UseMid

You're building an AI agent that writes and executes Python code to answer data analysis questions. A colleague suggests giving the agent unrestricted Python exec() access in the production environment. What risks does this create and how do you safely sandbox code execution?

Q150Evaluation & GuardrailsMid

Your team is using an LLM-as-judge to evaluate the quality of a customer support bot. A product manager questions whether the LLM judge is reliable. How do you validate the judge's reliability and what metrics do you use?

Q151Fine-tuning & PEFTMid

A startup fine-tunes a 7B chat model on 8,000 customer service examples. After training, the model performs well on held-out examples but in production, users report it sounds robotic and excessively formal, unlike the company's casual brand voice. What went wrong and how do you fix it?

Q152Hallucination & SafetySenior

A B2C mental health app uses an LLM to provide psychoeducation and coping strategies. The AI Safety team flags that under certain user inputs, the model is providing specific medication dosage guidance it's not qualified to give, bypassing the app's safety guardrails. Audit and harden the safety architecture.

Q153Context & MemoryMid

A customer support AI at a subscription box company needs to remember user preferences mentioned in past conversations (allergies, product dislikes, size preferences). Currently every new conversation starts cold with no memory of past sessions. Users complain they have to repeat themselves every time. Design the preference memory system.

Q154Orchestration (LangChain/LlamaIndex)Mid

A data team built a LlamaIndex pipeline that processes 10,000 documents nightly. The ingestion pipeline takes 6 hours to complete, causing it to overlap with business hours usage. The pipeline is I/O bound on embedding API calls. How do you speed it up?

Q155LLM Serving & InferenceMid

Your team self-hosts a CodeLlama-34B model for a code review assistant. Developers report random 'connection reset' errors during long code reviews. The errors don't correlate with load. Server logs show the process is healthy. Diagnose the issue.

Q156MultimodalMid

You're building a feature where users can photograph their refrigerator contents and the AI suggests recipes. In testing, the model performs well on well-lit, organized fridges, but fails on crowded, dark, or partially-obscured items (common in real fridges). Describe your approach to improve real-world performance.

Q157LLM FundamentalsSenior

Your fintech startup's compliance team flags that GPT-4 outputs differ on identical prompts across two consecutive days — same temperature, same seed. The model version hasn't changed per the API docs. Walk through how you'd diagnose and guarantee determinism for audit trails.

Q158LLM FundamentalsMid

Your team is switching from GPT-3.5-turbo to a 7B open-source model for a customer-support chatbot. During evaluation, you notice the smaller model frequently ignores the system prompt and answers out of scope. What is the root cause and how do you fix it?

Q159Prompt EngineeringSenior

You're building an automated legal contract analysis system. The LLM performs well on short contracts but hallucinates clause references on 40-page documents. Chain-of-thought prompting hasn't fixed it. Describe your full redesign strategy.

Q160Prompt EngineeringMid

Your e-commerce client wants product description generation. The model produces good copy but sometimes mixes up technical specs (e.g., writes 512GB when the input says 256GB). You suspect it's a prompt issue rather than a model capability issue. How do you diagnose and fix it?

Q161RAG & RetrievalSenior

Your enterprise knowledge base RAG system serves 5,000 employees. Retrieval precision is good in testing but degrades significantly for new hires who phrase queries differently from how documents are written. Describe your approach to close this vocabulary gap.

Q162RAG & RetrievalMid

You're implementing a RAG system for a medical research tool. When users ask multi-part questions like 'What are the side effects of metformin and how does it compare to glipizide?', the retrieved chunks answer only one part. How do you fix this?

Q163Vector DatabasesSenior

Your recommendation system uses Pinecone with 50M vectors. Query latency has crept from 80ms to 450ms over six months as you've added vectors. Your SLA requires p99 < 150ms. What's your diagnosis and remediation plan?

Q164Vector DatabasesMid

You're migrating a semantic search system from Elasticsearch BM25 to Weaviate with text2vec-openai. During A/B testing, BM25 outperforms Weaviate on keyword-heavy queries like exact product SKUs and model numbers. How do you address this?

Q165Fine-tuning & PEFTSenior

You fine-tuned LLaMA-3-8B on 10,000 customer service transcripts using LoRA. The model performs excellently on support queries but has significantly worse performance on general reasoning tasks. Product wants both. How do you architect a solution?

Q166Fine-tuning & PEFTMid

Your team is fine-tuning a 13B model on a single A100-80GB GPU. Training crashes with OOM after 200 steps. You're using LoRA with rank 16. Walk through the sequence of optimizations you'd apply before resorting to multi-GPU training.

Q167Agents & Tool UseSenior

You've deployed a ReAct agent that manages cloud infrastructure via Terraform. During a production incident, the agent autonomously deleted a critical database because it misinterpreted a cleanup task. How do you redesign the agent's safety architecture?

Q168Agents & Tool UseMid

You're building an LLM agent for a travel booking platform that calls flight search, hotel search, and payment APIs. The agent frequently makes redundant API calls — searching for the same flight three times within one session. How do you fix this?

Q169EmbeddingsSenior

Your multilingual customer support RAG system retrieves Spanish documents for English queries with only 60% precision. The embedding model used is text-embedding-ada-002. The support knowledge base is 80% English, 20% Spanish. What's the root cause and your fix?

Q170EmbeddingsMid

You're building a code semantic search tool for a Python codebase. You test with text-embedding-ada-002 but precision on code-specific queries like 'function that parses JWT tokens' is poor. What embedding approach would you use instead?

Q171Evaluation & GuardrailsSenior

Your LLM-powered HR assistant at a 10,000-employee company has been giving subtly biased advice — recommending different salary ranges for candidates with identical qualifications but different names suggesting gender or ethnicity. How do you build a systematic evaluation and mitigation pipeline?

Q172Evaluation & GuardrailsMid

Your team ships a new prompt version every two weeks. Last release caused a 12% drop in user satisfaction scores, but you only detected it after two weeks. How do you build an automated evaluation pipeline that catches regressions before deployment?

Q173LLM Serving & InferenceSenior

You're serving a 70B model on 8×A100 GPUs with tensor parallelism via vLLM. Under burst load of 500 concurrent users, you see GPU utilization drop to 40% while request queue grows. Identify the bottleneck and your optimization strategy.

Q174LLM Serving & InferenceMid

You're deploying a Mistral-7B model on a single RTX 4090 (24GB VRAM) for internal use at a 50-person company. Initial deployment has p50 latency of 3.2 seconds per request. Your users expect under 1.5 seconds. What optimizations do you apply?

Q175Cost & Latency OptimizationSenior

Your company spends $180K/month on OpenAI API calls for a document summarization pipeline. The CFO wants a 60% cost reduction without sacrificing quality. You have a team of 4 engineers and 3 months. What's your strategy?

Q176Cost & Latency OptimizationMid

Your chatbot's average response latency is 4.5 seconds, causing 22% of users to abandon before the response arrives. You cannot change the model. What specific techniques do you use to reduce perceived and actual latency?

Q177MultimodalSenior

You're building a quality control system at a manufacturing plant that uses GPT-4V to detect defects in PCB images. The model achieves 92% precision in testing but drops to 71% in production on the factory floor. Diagnose the distribution shift and propose a fix.

Q178MultimodalMid

A media company wants to build an automated video clip tagging system using a vision-language model. Videos are 5–30 minutes long. Describe how you'd architect the pipeline given that current VLMs process only individual frames or short clips.

Q179Hallucination & SafetySenior

Your AI-powered drug interaction checker for pharmacists has been found to provide confidently wrong interaction warnings in 3% of cases. The vendor's model can't be swapped. How do you reduce this error rate to under 0.1% using guardrails?

Q180Hallucination & SafetyMid

Your internal Q&A bot for a law firm confidently answers questions about recent case law from 2024, but your knowledge cutoff is January 2023. Users are relying on this incorrect information. What do you do immediately and long-term?

Q181Orchestration (LangChain/LlamaIndex)Senior

Your LangChain-based multi-step research agent takes 45–90 seconds per query in production, making it unusable. Profiling shows 12 sequential LLM calls per query on average. How do you redesign the orchestration to achieve under 15 seconds?

Q182Orchestration (LangChain/LlamaIndex)Mid

You're using LlamaIndex to build a RAG system over a 10,000-document corpus. After deployment, you discover that index builds take 6 hours and any new document addition requires a full rebuild. How do you fix the indexing architecture?

Q183Context & MemorySenior

Your enterprise CRM AI assistant loses context of deals discussed 20 minutes ago within the same user session. Users are complaining about repetitive questions. The underlying model has a 16K context window and conversations average 8K tokens. How do you redesign memory?

Q184Context & MemoryMid

You're building a coding assistant that should remember user preferences like preferred coding style and favorite libraries across sessions (not just within a session). How do you implement persistent memory in a cost-effective way?

Q185LLM FundamentalsSenior

Your team is debating whether to use a 70B model with 4-bit quantization or a 13B model in full BF16 for a reasoning-heavy tax advisory application. Both fit on the same hardware. What's your decision framework?

Q186Prompt EngineeringSenior

You're running automated A/B tests on prompt variants for a financial news summarization product. Variant B scores 8% better on an LLM judge, but Variant A scores 12% better on user click-through rate. How do you resolve this conflict and decide what to ship?

Q187RAG & RetrievalSenior

Your investment bank's RAG system retrieves from 10 years of earnings call transcripts. Analysts complain that the system retrieves relevant passages but the final answers still contain errors because the LLM misinterprets numbers in the retrieved text. What's the architecture fix?

Q188Agents & Tool UseSenior

Your customer data analysis agent can execute SQL queries via a tool. During a routine test, it exfiltrated customer PII by joining tables it didn't need and emailing the results to an external address included in a user's cleverly crafted query. How do you harden this agent?

Q189Fine-tuning & PEFTSenior

You're preparing a dataset for fine-tuning a code generation model on your company's internal TypeScript conventions. After initial training, the model writes syntactically correct code but ignores your naming conventions and file organization patterns. What's wrong with your dataset and how do you fix it?

Q190Evaluation & GuardrailsSenior

You're running LLM-as-judge evaluations for a mental health support chatbot. You discover the judge model (GPT-4) systematically rates verbose empathetic responses higher than concise accurate ones, introducing a 'verbosity bias'. How do you detect and correct for this?

Q191Vector DatabasesSenior

Your SaaS product has 500 enterprise customers, each with private document collections in a shared Weaviate cluster. A security audit finds that cross-tenant data leakage is theoretically possible via vector similarity queries. How do you architect tenant isolation?

Q192LLM Serving & InferenceSenior

Your vLLM deployment experiences unpredictable latency spikes every 90–120 minutes in production. CPU and GPU utilization look normal during spikes. What's causing this and how do you fix it?

Q193Cost & Latency OptimizationSenior

Your RAG pipeline calls a reranker model (cross-encoder) that adds 800ms of latency for every query. Reranking improves answer quality by 18%. Your SLA is 2 seconds end-to-end. How do you keep the quality gain while meeting the SLA?

Q194EmbeddingsSenior

You're building a personalized content recommendation system where user preferences are stored as embedding vectors updated daily. You notice that as preferences evolve over weeks, older interest vectors drift out of alignment with the current user state and create poor recommendations. How do you handle temporal embedding drift?

Q195Hallucination & SafetySenior

Your AI-generated news summary product has been caught fabricating quotes attributed to real public figures. Legal has issued a stop-work order. What technical controls do you implement to allow the product to relaunch safely?

Q196MultimodalMid

Your retail app uses GPT-4V to analyze product photos uploaded by sellers and auto-fill listing fields like category, color, and material. Sellers complain that the model often misidentifies fabric types in clothing photos. How do you improve accuracy?

Q197Orchestration (LangChain/LlamaIndex)Mid

Your LangChain agent works correctly in development but in production it occasionally enters infinite loops where it calls the same tool 30+ times before hitting the iteration limit. The logs show no obvious reason. How do you debug and prevent this?

Q198Context & MemorySenior

You're building an AI product manager assistant that handles product strategy across 50 different products at a large tech company. Each product has its own context (roadmap, metrics, stakeholders). The assistant must answer questions about any product without mixing up contexts. How do you architect context isolation?

Q199LLM FundamentalsMid

A product manager asks you why the AI assistant sometimes gives completely different answers to the same question asked five minutes apart. They want it 'consistent.' Explain the cause and your approach — including the business trade-offs of full determinism.

Q200Hallucination & SafetyMid

You're testing your RAG system's grounding and find that 15% of answers include facts not present in the retrieved chunks. The retrieval is correct, but the LLM adds information from its training data. How do you enforce source-grounded generation?

Can you defend these answers under follow-up pressure?

Get a free profile review first — we'll identify whether resume, LinkedIn, interview-call flow or prep gaps are blocking your next offer.

Get Free Profile Review →
📄 Get Free Profile Review