HikeCatalystHikeCatalyst
← All roles

DevOps / SRE / Platform Engineer Interview Questions

200 scenario-based questions with detailed model answers, organized skill-wise and tool-wise. Filter by topic, level or keyword, reveal the answer — then pressure-test yourself in a real mock.

SKILL / TOOL
LEVEL
200 questions
Q001KubernetesSenior

Your e-commerce platform's checkout service pods are OOMKilled repeatedly during Black Friday traffic spikes. The requests spike 10x over baseline in 90 seconds. The team wants to just raise memory limits. What is your actual diagnostic and remediation plan?

Q002CI/CD PipelinesMid

Your GitHub Actions pipeline for a Node.js monorepo with 12 services takes 28 minutes end-to-end. Developers are skipping PR checks and merging directly to main. How do you reduce pipeline time while keeping quality gates?

Q003Terraform & IaCSenior

A colleague ran terraform apply manually from their laptop against production and accidentally deleted an RDS subnet group, causing a 40-minute database outage. The state file is now out of sync. Walk through your immediate recovery and the policy changes you implement to prevent recurrence.

Q004Observability (Metrics/Logs/Traces)Senior

Your distributed payment service has p99 latency of 800ms but p50 is 45ms. Users report intermittent failures, but your error rate dashboard shows under 0.1%. Metrics and logs seem clean. How do you find the root cause?

Q005Incident Response & PostmortemsMid

You are the on-call engineer. At 2:47 AM you get paged: the API gateway is returning 503s for 60% of requests to a fintech app processing overnight batch transactions. You have 5 minutes before the SLA breach window opens. Walk through your first 10 minutes.

Q006SLO/SLI & Error BudgetsSenior

Your team has a 99.9% availability SLO for a B2B SaaS product. In the past 30 days you have consumed 87% of your error budget. Engineering wants to ship a major database migration next week. How do you handle the conversation and the decision?

Q007Docker & ContainersMid

A data science team's Docker image for a model-serving API is 8.4 GB and takes 18 minutes to push and pull in CI. The image contains conda, CUDA drivers, and a PyTorch model checkpoint. How do you reduce image size and pipeline time?

Q008Cloud (AWS/GCP/Azure)Senior

Your AWS workload runs entirely in us-east-1. A major us-east-1 partial outage takes down your RDS primary and three of six AZs for 90 minutes. Post-incident, leadership asks for a multi-region DR plan with an RTO of 15 minutes and an RPO of 60 seconds. How do you architect this?

Q009NetworkingMid

A microservice deployed in Kubernetes cannot reach an external payment API on port 443. The service works in staging but not production. Both clusters use the same base AMI. How do you systematically debug this connectivity issue?

Q010Secrets & SecuritySenior

During a security audit, you discover that your Kubernetes cluster has 40+ Deployments with database passwords stored as base64-encoded Kubernetes Secrets, all readable by any developer with kubectl get secret rights. Design a migration to a proper secrets management system without downtime.

Q011GitOps (ArgoCD/Flux)Mid

Your team adopts ArgoCD for the first time. A developer pushes a broken Helm chart to main and ArgoCD auto-syncs it to production, causing a 20-minute outage. The team wants to disable ArgoCD. How do you defend and fix the process?

Q012Capacity & AutoscalingSenior

A media streaming platform serving 2 million concurrent viewers uses Kubernetes HPA on CPU. During a major sports event, the stream quality degrades before HPA triggers new pods. Pods are at 60% CPU but the service is visibly struggling. Diagnose and redesign the scaling strategy.

Q013On-call & AlertingMid

Your team receives 120 PagerDuty alerts per week and engineers are showing burnout symptoms. Alert fatigue is causing slow response times. The VP of Engineering asks you to cut alerts by 60% without reducing incident detection. How do you approach this?

Q014Reliability & ChaosSenior

Your organization wants to introduce chaos engineering to a payments platform processing $2M/day. Leadership is nervous about running experiments in production. How do you design and pitch a chaos engineering program that gains their confidence?

Q015Platform/IDP & Dev ExperienceSenior

A 200-engineer engineering organization has 15 platform engineers supporting 30 product teams. Each product team follows different deployment patterns, has its own Dockerfile conventions, and opens infra tickets for every new service. Design an Internal Developer Platform to reduce platform team toil and improve developer autonomy.

Q016Cost OptimizationMid

Your AWS bill jumped from $180k to $310k month-over-month. No major new features shipped. The engineering team has no idea where the $130k increase came from. How do you investigate and identify the root cause within 48 hours?

Q017KubernetesMid

A developer reports that a new Deployment is stuck in Pending state. The cluster has plenty of node capacity shown in kubectl get nodes. No error is visible in the pod description's Events field. How do you debug this?

Q018CI/CD PipelinesSenior

Your organization has 80 microservices, each with its own Jenkins pipeline. The platform team spends 30% of their time maintaining pipeline configurations across repos. You are asked to design a pipeline-as-a-platform strategy. What do you propose?

Q019Terraform & IaCMid

You inherit a Terraform codebase with all AWS infrastructure in a single state file and a single main.tf that is 4,000 lines long. The team is scared to make changes because plan output is unpredictable. How do you refactor it safely?

Q020Observability (Metrics/Logs/Traces)Mid

A retail platform's Prometheus instance is consuming 400 GB of memory and crashing weekly. You have 2,000 exporters and 15 million active time series. How do you reduce cardinality and stabilize the system without losing critical visibility?

Q021Incident Response & PostmortemsSenior

After a 3-hour production outage affecting 500,000 users of a healthcare SaaS platform, your postmortem reveals 5 contributing factors and no single root cause. The CTO asks you to present the findings and action items to the board. How do you structure this?

Q022SLO/SLI & Error BudgetsMid

You are defining SLOs for a new internal developer platform. The platform serves 200 engineers and has no existing SLA. How do you choose meaningful SLIs and set appropriate SLO targets without historical data?

Q023Docker & ContainersSenior

Your security team flags that 60% of your production containers are running as root. The application teams say they cannot change this because their apps need root for port 80. You need to remediate this across 40 services in 6 weeks. How do you execute this?

Q024Cloud (AWS/GCP/Azure)Mid

Your team is migrating a self-managed PostgreSQL cluster on EC2 to Amazon Aurora PostgreSQL. The database is 2 TB and serves an e-commerce platform with 99.9% uptime SLA. The migration window is Saturday midnight to 4 AM. How do you plan and execute this?

Q025NetworkingSenior

Your Kubernetes cluster uses Calico CNI. After scaling from 50 to 200 nodes, you observe intermittent packet drops between pods on different nodes. Node-to-node ping works. Only pod-to-pod cross-node traffic is affected. What do you investigate?

Q026Secrets & SecurityMid

A developer accidentally commits AWS access keys to a public GitHub repository. You are the on-call SRE and you get the alert from GitHub's secret scanning partner. The keys belong to a CI service account with broad S3 write access. What do you do in the first 15 minutes?

Q027GitOps (ArgoCD/Flux)Senior

You are implementing GitOps for a multi-tenant SaaS platform where each customer has an isolated Kubernetes namespace. There are 300 tenants, growing at 20 per month. Manual namespace creation and configuration is a bottleneck. How do you automate tenant onboarding with GitOps?

Q028Capacity & AutoscalingMid

Your EKS cluster's Cluster Autoscaler is provisioning new nodes, but pods remain in Pending state for 8-12 minutes after the autoscaler runs. Developers are complaining that auto-scaling is not working. What is happening and how do you fix it?

Q029On-call & AlertingSenior

You join a company where on-call rotation is shared across the full engineering team of 40 engineers and there is no runbook for any alert. A new engineer paged at 3 AM resolved an incident by rebooting a server, not knowing the actual cause. How do you build an on-call program that scales with the team?

Q030Reliability & ChaosMid

Your team uses Kubernetes and wants to verify that the application handles pod evictions gracefully during a node upgrade. You have 3 replicas of a stateless API service. How do you test this without impacting production users?

Q031Platform/IDP & Dev ExperienceMid

Developers at your company report that setting up a local development environment takes 2-3 days for a new hire. The setup involves 14 manual steps, installing 8 tools at specific versions, and configuring VPN and secrets. How do you reduce this to under 2 hours?

Q032Cost OptimizationSenior

Your GCP Kubernetes workloads run on n2-standard-16 instances exclusively. The platform team discovers that average CPU utilization across all nodes is 18% and average memory utilization is 22%. You are asked to cut the compute bill by 40% without impacting performance. How do you approach this?

Q033KubernetesSenior

Your company runs a multi-tenant Kubernetes cluster serving 50 teams. A noisy-neighbor incident causes one team's batch job to consume all available CPU on 30% of nodes, slowing response times for 15 other teams. How do you architect guardrails to prevent this from happening again?

Q034CI/CD PipelinesMid

Your team's deployment pipeline promotes from dev to staging to production automatically on every merge to main. After a bad deploy, the production rollback required 25 minutes because the pipeline had to re-run all stages. How do you redesign the pipeline to enable faster rollback?

Q035Terraform & IaCSenior

Your organization uses multiple cloud providers: AWS for compute, GCP for data pipelines, and Cloudflare for DNS and CDN. The Terraform codebases for each are managed by separate teams with no shared standards. You are tasked with creating a unified IaC governance model. How do you approach this?

Q036Observability (Metrics/Logs/Traces)Senior

You are building observability for a new event-driven microservices platform using Kafka. Traditional request-response tracing does not work well across async message boundaries. How do you implement distributed tracing across Kafka producers and consumers?

Q037Incident Response & PostmortemsMid

You run a postmortem for a database failover that took 8 minutes instead of the expected 90 seconds. The postmortem identifies 3 action items. Six weeks later, none of the action items are done and the same failure mode recurs. How do you fix the postmortem process?

Q038SLO/SLI & Error BudgetsSenior

Your team maintains a real-time fraud detection service that must respond in under 200ms for 99.5% of requests. A partner team proposes a new ML model update that will increase average latency from 80ms to 130ms. How do you evaluate whether this update is safe to ship given your SLO?

Q039Docker & ContainersMid

A Java Spring Boot service takes 90 seconds to start in Kubernetes, causing liveness probe failures and restart loops during rolling updates. The app has a 4 GB heap and loads configuration from a remote Config Server on startup. How do you fix this?

Q040Cloud (AWS/GCP/Azure)Senior

Your Azure Kubernetes Service cluster's pods cannot pull images from Azure Container Registry during a regional Azure Active Directory outage. The outage lasts 2 hours. How do you redesign your image pull architecture to be resilient to identity provider failures?

Q041NetworkingMid

Your load balancer is routing traffic unevenly: one backend instance handles 70% of requests while two others handle 15% each. All three instances have equal capacity and are healthy. The load balancer algorithm is set to round-robin. Why is this happening and how do you fix it?

Q042Secrets & SecuritySenior

Your platform uses HashiCorp Vault for secrets management. A Vault audit log review reveals that a Kubernetes service account is reading 400 secrets per minute continuously — far more than expected for the service it belongs to. You suspect a security incident but want to confirm before revoking access. How do you investigate?

Q043GitOps (ArgoCD/Flux)Mid

Your team wants to use Flux to manage Kubernetes resources but several engineers are uncomfortable because they fear that direct kubectl edits during an incident will be overwritten by Flux. How do you address this concern and design the incident workflow?

Q044Capacity & AutoscalingSenior

Your machine learning inference service runs on GPU nodes in Kubernetes. The models take 8 minutes to load into GPU memory. HPA scales the service, but new pods are useless for 8 minutes after creation. Users see slow response times during traffic spikes despite autoscaling. How do you solve this cold-start problem?

Q045On-call & AlertingMid

You are setting up monitoring for a new PostgreSQL database cluster serving a logistics platform. The database has no existing dashboards or alerts. What are the five most critical metrics you instrument first and what thresholds do you use?

Q046Reliability & ChaosSenior

You are a principal SRE at a global airline booking platform. The platform has 99.99% uptime SLA (52 minutes downtime per year). Engineering wants to move to a weekly release cadence from the current monthly cadence. How do you evaluate the risk and provide a recommendation?

Q047Platform/IDP & Dev ExperienceSenior

Your platform team is responsible for a Kubernetes-based internal PaaS. Developers frequently deploy services with no resource limits set, no liveness probes, and no PodDisruptionBudgets. The team has tried documentation and training. How do you enforce production-readiness standards at the platform level?

Q048Cost OptimizationMid

Your team runs 15 microservices on AWS ECS Fargate. The monthly Fargate bill is $45,000 but Datadog APM shows average CPU utilization at 8% and memory at 14% across all tasks. How do you right-size the tasks and estimate the savings?

Q049CI/CD PipelinesSenior

Your organization's security team mandates that all container images must be signed and signature verification must occur before deployment to production. You have 60 Dockerfiles across 20 teams. Design an end-to-end supply chain security implementation.

Q050Cloud (AWS/GCP/Azure)Mid

You need to design a secure, scalable architecture for a fintech startup that processes financial transactions on AWS. The system must comply with PCI DSS requirements. The engineering team has limited security expertise. What is your proposed architecture?

Q051KubernetesSenior

Your e-commerce platform's checkout service is experiencing 30-second pod startup times during Black Friday traffic spikes, causing failed readiness checks and cascading failures. The cluster is on GKE with n2-standard-8 nodes. Walk through your diagnosis and remediation strategy.

Q052CI/CD PipelinesMid

Your team's GitHub Actions pipeline takes 45 minutes end-to-end for a Java microservice. Product is complaining the slow feedback loop blocks iteration. The pipeline runs unit tests, integration tests against a real Postgres, Docker build, and pushes to ECR. How do you cut it to under 10 minutes?

Q053Terraform & IaCSenior

A team at a fintech has been making manual changes to production AWS infrastructure for six months. You've been asked to bring it under Terraform management without downtime. The stack includes VPCs, RDS clusters, EKS, and IAM roles across three accounts. What's your migration approach?

Q054Observability (Metrics/Logs/Traces)Senior

Your distributed payments service spans 12 microservices across AWS regions. A customer reports intermittent payment failures affecting roughly 2% of transactions, but your error rate dashboard shows no spike. How do you instrument and diagnose this gap?

Q055Incident Response & PostmortemsMid

You are on-call when PagerDuty fires at 2am: the API gateway for a healthcare SaaS product is returning 503s for all endpoints. You have access to Datadog, AWS Console, and kubectl. Walk through your first 15 minutes.

Q056SLO/SLI & Error BudgetsSenior

Your team's payment processing service has a 99.95% availability SLO for the quarter. Six weeks in, a 4-hour partial outage has consumed 60% of the error budget. Engineering wants to ship a major database migration next week. How do you make this decision?

Q057Docker & ContainersMid

A Node.js microservice Docker image has grown to 1.8 GB over six months of feature additions. Deploy times have increased 4x and the image scanner reports 47 CVEs, mostly in base OS packages. How do you remediate both issues in a single refactor?

Q058Cloud (AWS/GCP/Azure)Senior

A media streaming startup is migrating from a single AWS region (us-east-1) to active-active multi-region across us-east-1 and eu-west-1. The platform uses DynamoDB, S3, CloudFront, and ECS Fargate. What are the three hardest technical problems and how do you solve each?

Q059NetworkingMid

Developers are reporting intermittent connection timeouts between two EKS microservices in the same VPC. The services are in different namespaces. Ping works fine but HTTP calls fail roughly 5% of the time with a 'connection reset by peer' error. What do you investigate?

Q060Secrets & SecuritySenior

During a security audit, you discover that a 50-engineer startup has database passwords and API keys committed as plaintext in Kubernetes ConfigMaps across 30 microservices, stored in a GitHub repo with 200+ commits of history. What's your remediation plan and how do you prevent recurrence?

Q061GitOps (ArgoCD/Flux)Mid

Your team just adopted ArgoCD for a Kubernetes platform. A developer manually kubectl apply'd a hotfix directly to production under pressure during an incident. Now ArgoCD shows the app as OutOfSync. How do you handle the immediate sync decision and what process do you implement to prevent this pattern?

Q062Capacity & AutoscalingSenior

A logistics SaaS processes daily batch jobs that cause CPU spikes to 100% on their EKS cluster every morning between 6-8am, causing delays for real-time order management requests. Node autoscaling takes 3-4 minutes to provision new nodes. How do you architect a solution that eliminates the contention?

Q063On-call & AlertingMid

Your team receives 200+ PagerDuty alerts per week, but post-analysis shows 80% are either noisy signals that resolve themselves or alerts for non-customer-impacting issues. Engineers are burning out. How do you restructure the alerting philosophy and tooling?

Q064Reliability & ChaosSenior

You're introducing chaos engineering to a fintech platform that has never run fault injection in production. Legal and security teams are resistant, citing compliance (PCI-DSS) and customer data risk. How do you design a chaos program that satisfies these constraints while still generating useful reliability signal?

Q065Platform/IDP & Dev ExperienceSenior

Engineering leadership wants to build an Internal Developer Platform to reduce the time from 'code merged' to 'running in staging' from 45 minutes to under 5 minutes for 200 developers across 80 microservices. What's your architecture and rollout strategy?

Q066Cost OptimizationMid

Your AWS bill jumped 40% month-over-month without any known traffic increase. The bill is $180,000/month, mostly EC2 and RDS. You have Cost Explorer access. Walk through your investigation and the top three likely causes.

Q067KubernetesMid

A Kubernetes Deployment has 10 replicas but metrics show only 3-4 pods are receiving traffic. The others show as Running and Ready. The service is a ClusterIP backed by an Nginx Ingress. What's causing the uneven distribution and how do you fix it?

Q068Terraform & IaCMid

A junior engineer ran terraform destroy in the wrong workspace and deleted a production RDS database with 3 years of customer data. Backups exist in AWS. Walk through the recovery process and the controls you implement to prevent this from happening again.

Q069Observability (Metrics/Logs/Traces)Mid

Your team is migrating from Datadog to a self-hosted Grafana + Prometheus + Loki + Tempo stack for cost reasons. The migration must not break on-call alerting for 14 teams during the transition. How do you run the migration without a visibility gap?

Q070Incident Response & PostmortemsSenior

A postmortem for a 2-hour outage at your B2B SaaS company reveals the root cause was a correct change — a database index was added to fix a slow query — but the index build locked the table and blocked writes. Three similar incidents have occurred in the past year. How do you prevent the fourth?

Q071GitOps (ArgoCD/Flux)Senior

Your organization runs 5 Kubernetes clusters (dev, staging, prod-us, prod-eu, prod-apac) all managed by ArgoCD. A single ArgoCD instance manages all five. You're being asked to evaluate whether to keep the hub-and-spoke model or move to per-cluster ArgoCD instances. What are the trade-offs and what's your recommendation?

Q072Docker & ContainersSenior

A security team audit flags that 12 microservices in your Kubernetes cluster run containers as root (uid 0), and three of them mount the Docker socket. This is a production fintech environment. Prioritize the remediation work and explain the actual risk of each finding.

Q073Cloud (AWS/GCP/Azure)Mid

Your team manages a multi-account AWS setup using Control Tower. A new product team needs a standalone AWS account for a PCI-DSS payment component, isolated from the main organization. They need it provisioned with baseline security guardrails in 48 hours. How do you do this?

Q074NetworkingSenior

You're designing the network architecture for a multi-tenant SaaS platform on AWS where tenants must be cryptographically isolated from each other. Each tenant can generate up to 10 Gbps burst traffic. You have 200 tenants and the platform team has 4 engineers. What's your architecture?

Q075Secrets & SecurityMid

A developer reports that when they run kubectl logs on any pod in the cluster, they can see other teams' application logs including database connection strings logged at startup. What's the immediate mitigation and the longer-term fix?

Q076Capacity & AutoscalingMid

Your HPA-scaled application is thrashing — scaling up to 40 pods, then down to 8 pods, up to 35 pods, every 5-10 minutes, causing rolling restarts and connection drops. CPU utilization oscillates between 85% and 15%. How do you stabilize this?

Q077Reliability & ChaosMid

Your team wants to run a GameDay for a retail e-commerce platform before the holiday season. You have one week to design and execute it. The hypothesis is that the checkout service can tolerate a database primary failover without losing transactions. How do you design the exercise?

Q078Platform/IDP & Dev ExperienceMid

Developers on your platform are spending an average of 2 hours per sprint dealing with flaky CI tests that fail intermittently, requiring re-runs. The test suite has 3,000 tests across 20 services. How do you systematically address flakiness without just adding retries?

Q079KubernetesSenior

Your organization runs a multi-tenant Kubernetes cluster shared by 8 product teams. One team's batch job consumed all available CPU on a node, causing memory pressure and OOMKills on neighboring pods from other teams. How do you implement fair, enforceable resource isolation?

Q080CI/CD PipelinesSenior

Your company is shifting to a trunk-based development model with feature flags, replacing long-lived feature branches. Currently the CI/CD pipeline is designed around branches with separate environments per branch. How do you redesign the pipeline and what's the impact on your environment strategy?

Q081Observability (Metrics/Logs/Traces)Senior

You're implementing distributed tracing for a microservices platform that handles HIPAA-regulated patient data. Your engineers want full request traces including request bodies for debugging. Security and compliance require no PHI in trace data. How do you implement tracing that satisfies both?

Q082SLO/SLI & Error BudgetsMid

Your team is defining SLOs for a new search service that has no historical data. Product wants a 99.99% availability SLO. Engineering pushes back saying 99.9% is more realistic. How do you resolve this and what's your process for setting the initial SLO?

Q083Terraform & IaCSenior

Your Terraform codebase has grown to 15,000 lines across 200 resource blocks in a single state file. Plan and apply operations take 8+ minutes due to AWS API rate limiting. A team of 6 engineers frequently gets state lock conflicts. How do you restructure this?

Q084On-call & AlertingSenior

You're the SRE lead and the VP of Engineering is asking you to justify the team's 24/7 on-call rotation to the board. Last quarter you had 14 incidents, MTTD averaged 8 minutes, MTTR averaged 47 minutes. A board member asks why MTTR is so high and whether automation could replace on-call humans. Present your assessment.

Q085Cost OptimizationSenior

A data platform team on GCP is running BigQuery queries that cost $45,000 per month. Analytics engineers run ad-hoc queries without cost visibility and there's no governance. You need to cut costs by 50% within 90 days without significantly impacting analyst productivity. How?

Q086KubernetesMid

A team wants to run a stateful database (PostgreSQL) in Kubernetes using a StatefulSet. Your organization has historically used managed RDS for all databases. What are the arguments for and against, and what would make you approve or reject the proposal?

Q087Cloud (AWS/GCP/Azure)Senior

Your company's Azure landing zone was designed for 10 subscriptions but has grown to 85 subscriptions organically. Policy compliance across subscriptions is inconsistent, tagging is absent, and the security team has no visibility. You have 3 months to bring order. What's your plan?

Q088GitOps (ArgoCD/Flux)Mid

Your ArgoCD ApplicationSet is configured to deploy a microservice to 12 clusters using a cluster-generator. A bad Helm chart version was pushed to the git repo and ArgoCD started syncing it to all 12 clusters simultaneously. How do you stop the blast radius and recover?

Q089NetworkingMid

After migrating from HAProxy to Istio service mesh for internal microservice traffic, developers report that some services now experience 3-5ms additional latency per request compared to pre-Istio benchmarks. Product says this is unacceptable for a real-time recommendations service. How do you approach this?

Q090Incident Response & PostmortemsSenior

Six months after a major outage, your postmortem action items still haven't been completed. Engineering is prioritizing features over reliability work. You are the SRE lead presenting to the VP of Engineering. How do you make the case and what mechanism do you propose?

Q091Docker & ContainersMid

Your team runs Docker Compose in development and Kubernetes in production. Developers frequently report 'it works on my machine but not in staging.' You've identified that the Docker Compose setup diverges significantly from the Kubernetes manifests. How do you close the gap?

Q092SLO/SLI & Error BudgetsSenior

You manage SLOs for a platform with 40 microservices. Currently each service has its own SLO dashboard but no aggregated user journey view. A major customer reports a bad experience that wasn't caught by any individual service SLO. How do you implement user journey SLOs?

Q093Capacity & AutoscalingSenior

A gaming company's backend experiences 50x traffic spikes at game launch events lasting 2-4 hours, then returns to baseline. Reserved Instances cover baseline but the spike costs $80,000 per launch in on-demand pricing. You need to cut the spike cost by 60%. What's your architecture?

Q094Secrets & SecuritySenior

Your platform team is implementing zero-trust networking for a microservices architecture. Services currently use IP-based network policies for authorization. Explain how you'd migrate to workload identity-based mTLS authorization using SPIFFE/SPIRE, and what breaks during the migration.

Q095Platform/IDP & Dev ExperienceSenior

You are designing a golden path for containerizing new microservices at a 500-person engineering organization. Currently every team builds their Dockerfile from scratch, leading to 80+ unique Dockerfile patterns. How do you standardize without creating a bottleneck?

Q096Reliability & ChaosSenior

Your SRE team uses chaos experiments to test reliability, but after 12 months you've hit diminishing returns — the same steady-state failures are being tested repeatedly. A senior engineer suggests running cascading failure scenarios. What cascading failure experiments would you design for a microservices platform?

Q097Cloud (AWS/GCP/Azure)Mid

Your team uses AWS Lambda for a critical data processing pipeline that must process events within 30 seconds. After a deploy, you're seeing 15-second cold starts affecting roughly 8% of invocations, breaching your latency SLO. How do you diagnose and eliminate the cold start problem?

Q098KubernetesMid

A Kubernetes upgrade from 1.27 to 1.29 is needed for security patches. The cluster runs 15 production services. During a previous upgrade, a team experienced 5 minutes of downtime because their PodDisruptionBudget configuration was misconfigured. How do you ensure zero-downtime this time?

Q099CI/CD PipelinesSenior

Your organization wants to implement supply chain security for software artifacts, including provenance attestation and SBOM generation. Currently the CI builds Docker images and pushes to ECR with no signing. What are the components of a production-grade artifact supply chain security system?

Q100On-call & AlertingMid

You're inheriting on-call for a legacy Java monolith that has no dashboards, no traces, and only application error logs stored in CloudWatch. A customer reports 'slowness' in the payment workflow. How do you triage this with the tools available?

Q101Cost OptimizationMid

Your company spends $15,000/month on CloudWatch Logs ingestion and storage for an EKS cluster producing 50 GB/day of container logs. Most of the logs are DEBUG-level application traces that are never queried in production. How do you reduce costs by 70% without losing important signal?

Q102Terraform & IaCMid

Two engineers on your team made conflicting Terraform changes to the same module in different branches. When the first was applied, it changed an AWS security group. When the second was applied, it reverted that security group change and added a new rule, causing a brief compliance violation. How do you prevent this class of conflict?

Q103Observability (Metrics/Logs/Traces)Mid

Your team has instrumented 20 microservices with Prometheus metrics, but the engineering manager complains that the dashboards are too technical (showing HTTP request rate and memory usage) and don't reflect whether the product is working. What's the process for building business-level observability?

Q104GitOps (ArgoCD/Flux)Senior

Your team is evaluating migrating from ArgoCD to Flux v2 for a new Kubernetes platform. Both are GitOps tools but have different architectures. What are the genuine technical trade-offs, and in what scenarios is each the better choice?

Q105KubernetesSenior

A fintech company's payment-processing pods are evicted every 90 minutes during peak trading windows. The nodes are at 85% memory utilization and the HPA is not scaling out. Walk through your diagnosis and the changes you'd make to prevent future evictions.

Q106CI/CD PipelinesMid

Your team's GitHub Actions pipeline for a Node.js microservice takes 22 minutes end-to-end. Developers are complaining about slow feedback. Identify the biggest opportunities to cut that time in half and describe the trade-offs of each.

Q107Terraform & IaCSenior

During a Terraform apply in your staging environment, the state file gets corrupted after a colleague force-pushes a conflicting change mid-run. Production uses the same module structure. Describe your recovery process and the guardrails you'd put in place to prevent recurrence.

Q108Observability (Metrics/Logs/Traces)Senior

A distributed e-commerce checkout flow spans 6 microservices. Customers report intermittent 5-10 second latency spikes at checkout, but your per-service dashboards all show p99 latency under 200ms. How do you find the root cause?

Q109Incident Response & PostmortemsMid

You are the incident commander for a P1 outage affecting 40% of users on a SaaS platform. The on-call engineer has been debugging solo for 45 minutes with no resolution. What do you do in the next 15 minutes?

Q110SLO/SLI & Error BudgetsSenior

Your API gateway SLO is 99.9% availability over a 30-day rolling window. The team burned 60% of the error budget in the first 10 days due to a botched deployment. Engineering wants to push a major feature next week. How do you handle the conversation?

Q111Docker & ContainersMid

A data science team's Docker image for a model-serving API is 8.7 GB, causing 12-minute cold starts in Kubernetes. The image is built from a standard Python 3.10 base. What's your plan to reduce image size and cold-start time?

Q112Cloud (AWS/GCP/Azure)Senior

You are migrating a 200-node on-premises Hadoop cluster to GCP. The cluster processes 50 TB of data daily with 300+ Hive jobs scheduled via Oozie. The business requires zero data loss and under 4 hours of Hive job downtime during cutover. Design the migration.

Q113NetworkingMid

A Kubernetes service suddenly starts timing out for external clients even though pod health checks are passing and the pods are returning 200s to internal calls. Describe your investigation process for this network-layer issue.

Q114Secrets & SecuritySenior

A developer accidentally committed a valid AWS access key to a public GitHub repository 6 hours ago. The key has IAM permissions including S3 full access and EC2 describe. Walk through your incident response and the systemic changes you'd make.

Q115GitOps (ArgoCD/Flux)Senior

Your ArgoCD-managed production cluster drifts from the Git state because an SRE manually patched a ConfigMap during an incident. The drift isn't self-healing because the ConfigMap has 'syncPolicy: automated' disabled for that app. How do you handle the drift, the incident debrief, and the policy going forward?

Q116Capacity & AutoscalingMid

A gaming company's matchmaking service sees a 10x traffic spike every Friday evening. The current HPA takes 8 minutes to scale out enough pods, causing queue buildup and player frustration. How would you redesign the scaling strategy?

Q117On-call & AlertingMid

Your team receives 200+ PagerDuty alerts per week, but only about 20 lead to actual action. Engineers are experiencing alert fatigue and starting to silence notifications. How do you audit and fix your alerting stack?

Q118Reliability & ChaosSenior

Your team at a logistics company wants to start a chaos engineering program. The CTO is skeptical about injecting failures into production and prefers staging. How do you make the case for production chaos, design the first experiment, and measure its value?

Q119Platform/IDP & Dev ExperienceSenior

A 500-engineer organization has 12 platform teams each maintaining their own bespoke CI templates, Helm charts, and Terraform modules. New services take 3 weeks to scaffold and onboard. You've been asked to lead the internal developer platform initiative. What's your year-one roadmap?

Q120Cost OptimizationMid

Your AWS bill increased by $180,000 last quarter with no corresponding growth in users. Engineering leadership asks you to investigate. Describe your systematic approach to finding and addressing the cause.

Q121KubernetesSenior

A healthcare SaaS runs a multi-tenant Kubernetes cluster where one tenant's batch job consumed 80% of cluster CPU during business hours, causing latency spikes for other tenants. Describe the isolation model you'd implement without requiring separate clusters per tenant.

Q122CI/CD PipelinesSenior

Your organization is adopting trunk-based development but several teams still merge long-lived feature branches (2+ weeks old) that regularly cause merge conflicts and broken builds. How do you drive the cultural and technical transition to trunk-based development?

Q123Terraform & IaCMid

You inherit a Terraform codebase for a media company's AWS infrastructure with 600 resources all in a single state file. Plan operations take 9 minutes and the team is afraid to change anything. How do you refactor this safely?

Q124Observability (Metrics/Logs/Traces)Mid

Your Elasticsearch logging cluster is ingesting 800 GB of logs per day and costs $18,000/month. The team uses only 5% of those logs for debugging. Propose a tiered logging architecture that reduces cost without sacrificing debuggability.

Q125Incident Response & PostmortemsSenior

A postmortem for a 3-hour database outage at an insurance company identified 5 contributing factors including an undocumented manual runbook step. Six months later, the same failure mode recurs almost identically. What went wrong with the postmortem process, and how do you fix it?

Q126SLO/SLI & Error BudgetsMid

You're defining SLIs for a streaming video platform. The platform team suggests using server-side HTTP 200 rate as the availability SLI. Why might this be inadequate, and what would you propose instead?

Q127Docker & ContainersSenior

A security audit found that 70% of your company's production containers run as root. Engineering says changing this will break applications. Design a migration plan that achieves non-root containers across all services without causing production incidents.

Q128Cloud (AWS/GCP/Azure)Mid

Your Azure Kubernetes Service cluster's nodes are in a single availability zone due to an original infrastructure oversight. The SRE team needs to migrate to a multi-AZ setup with zero downtime for a real-time messaging service handling 50,000 concurrent connections.

Q129NetworkingSenior

Two microservices in the same Kubernetes cluster communicate over gRPC. After a Kubernetes upgrade, the calls intermittently fail with GOAWAY frames. The issue only occurs under high load. Diagnose and fix this.

Q130Secrets & SecurityMid

A developer wants to use environment variables in Kubernetes pods to pass database credentials from Kubernetes Secrets. Your security team flags this as insufficient. Explain the risk and propose a more secure secrets management pattern.

Q131GitOps (ArgoCD/Flux)Mid

Your team is deploying to three environments (dev, staging, prod) using ArgoCD. Currently all three environments share the same Helm values file and deployments to prod sometimes carry developer experiments. Design a promotion workflow that ensures prod only receives validated, intentional changes.

Q132Capacity & AutoscalingSenior

A machine learning inference service on Kubernetes has bursty GPU workloads that don't fit the standard HPA CPU/memory model. During peak, pods queue requests for 4 minutes before new GPU nodes are available. Design a scaling solution that keeps queue depth under 30 seconds.

Q133On-call & AlertingSenior

After six months of on-call, three engineers have quit citing burnout. The team averages 14 pages per engineer per week, with 40% of pages occurring between 2 AM and 5 AM local time. How do you redesign the on-call program to be sustainable without reducing reliability?

Q134Reliability & ChaosMid

A retail platform's checkout service has never had a formal resilience review. Black Friday is 8 weeks away and leadership wants assurance it will hold. Describe how you conduct an accelerated resilience review and what targeted hardening you'd prioritize.

Q135Platform/IDP & Dev ExperienceMid

Developers at a 200-person startup complain that creating a new microservice requires filing a ticket to the platform team, waiting 3 days, and getting back 14 different config files they don't understand. Propose a self-service scaffolding solution.

Q136Cost OptimizationSenior

A startup's engineering team runs all workloads on On-Demand EC2 instances and Fargate, spending $220,000/month. A financial review reveals 65% of the EC2 fleet runs at under 15% CPU utilization. Design a cost optimization strategy targeting a 40% reduction in 90 days.

Q137KubernetesMid

A pod in your Kubernetes cluster is stuck in 'Pending' state for 20 minutes. The Deployment was successful last week with the same configuration. Walk through your investigation.

Q138CI/CD PipelinesMid

Your team's deployment pipeline pushes directly to production on every merge to main with zero automated testing gates. A bad merge last month caused a 2-hour outage. Design a pipeline with appropriate safety gates for a B2B SaaS API with 200 customers.

Q139Terraform & IaCSenior

Your company has 15 AWS accounts managed by Terraform. A new compliance requirement mandates that all S3 buckets must have versioning and server-side encryption enabled within 30 days. How do you audit the current state and enforce the requirement at scale?

Q140Observability (Metrics/Logs/Traces)Senior

Your team runs Prometheus at a media company ingesting 15 million time series. Query latency on dashboards has degraded to 45-60 seconds and Prometheus is consuming 48 GB RAM. Describe your scaling strategy.

Q141Incident Response & PostmortemsMid

You are leading a postmortem for a database failover that took 18 minutes instead of the expected 2 minutes, causing an SLA breach for an enterprise customer. How do you structure the postmortem to avoid blame and produce durable action items?

Q142SLO/SLI & Error BudgetsSenior

A product team argues that your platform's 99.95% availability SLO is too strict and is blocking feature velocity. They want to relax it to 99.5%. As the SRE lead, how do you evaluate this request and what's your recommendation?

Q143Docker & ContainersMid

A container in your Kubernetes cluster is restarting every 15 minutes with OOMKilled status. The memory limit is set to 512 Mi but the developer says the application should use less than 200 MB. Diagnose and resolve this.

Q144Cloud (AWS/GCP/Azure)Senior

Your company's DR strategy relies on an RTO of 4 hours and RPO of 1 hour for a multi-tier AWS application. A tabletop exercise reveals the actual failover would take 12 hours. Redesign the DR architecture to meet the stated objectives.

Q145NetworkingMid

Your application is experiencing intermittent DNS resolution failures in Kubernetes. The error appears as 'Temporary failure in name resolution' and affects 0.1% of requests. CoreDNS pods appear healthy. Diagnose the root cause.

Q146Secrets & SecuritySenior

During a SOC2 Type II audit, the auditor flags that your Kubernetes cluster has no network policies and all pods can communicate with all other pods. You have 80 services to secure. Design an incremental approach to achieve least-privilege network segmentation within 60 days.

Q147GitOps (ArgoCD/Flux)Senior

Your Flux-managed cluster has 50 HelmReleases. After a Flux controller upgrade, 15 HelmReleases enter a Failed state with 'upgrade retries exhausted'. Production is affected. How do you triage, recover, and prevent this class of failure in future upgrades?

Q148Capacity & AutoscalingMid

A Kubernetes cluster running 120 nodes in AWS is consistently at 45% average CPU utilization. Leadership wants to know why costs are high despite low utilization. Explain the structural reasons and propose optimizations.

Q149On-call & AlertingMid

You are asked to create runbooks for the 10 most common production alerts. The team has no existing documentation. Describe your process for building high-quality runbooks quickly, without requiring senior engineers to write them from scratch.

Q150Reliability & ChaosSenior

A team at a bank wants to conduct a pod failure injection experiment in production using Chaos Mesh but the change advisory board requires a formal impact assessment and rollback procedure before approving. Write the technical sections of that approval request.

Q151Platform/IDP & Dev ExperienceSenior

Engineers at your company spend an average of 45 minutes per day waiting for CI pipelines. A developer satisfaction survey ranks CI speed as the top pain point. You have a $50,000 annual budget to improve it. How do you prioritize and spend it?

Q152Cost OptimizationMid

Your Kubernetes cluster in GCP runs 300 nodes but engineers report that large batch jobs frequently fail to schedule because the cluster is 'full' even though average CPU is 35%. Explain the paradox and propose a cost-effective fix.

Q153KubernetesSenior

You are operating a Kubernetes cluster for a telecom company with strict latency requirements (p99 < 5ms). After a node autoscale event, latency spikes to 80ms for 90 seconds. Identify the cause and implement a fix.

Q154CI/CD PipelinesSenior

Your company is moving from a monolith to microservices. Each team wants their own independent CI/CD pipeline, but the platform team is concerned about duplication, security drift, and maintenance overhead across 40+ pipelines. Design a shared pipeline infrastructure.

Q155Terraform & IaCMid

A junior engineer on your team ran terraform destroy on the wrong workspace and deleted a production PostgreSQL RDS instance with 3 years of customer data. The RDS had automated backups enabled. Walk through your recovery steps and timeline.

Q156Observability (Metrics/Logs/Traces)Mid

A developer asks you to add monitoring for a new Node.js Express service being deployed to Kubernetes. Describe the complete observability setup you'd add from scratch, including what you'd instrument and why.

Q157Incident Response & PostmortemsSenior

A critical API at a transportation company went down for 45 minutes because an on-call engineer manually ran a database migration script in production without testing it in staging. This violated a documented change process. How do you handle the post-incident response, balancing accountability with psychological safety?

Q158SLO/SLI & Error BudgetsMid

You are asked to define an SLO for a batch data pipeline that runs nightly and loads data into a reporting database. The traditional latency/availability model doesn't obviously apply. How do you define meaningful SLIs?

Q159Reliability & ChaosMid

Your team's service has no circuit breakers and all retries are aggressive (3 retries with 0ms delay). During a downstream API degradation, your service amplified the load on the degraded API 4x, causing a full outage. What changes do you implement?

Q160Platform/IDP & Dev ExperienceMid

Developers complain that the local development environment takes 2 hours to set up and frequently breaks when engineers switch between projects. The company runs a 15-service microservices architecture. Design a standardized local dev environment.

Q161Cost OptimizationSenior

Your company processes 500 TB of data per month in AWS using Glue ETL and stores results in S3, then queries them with Athena. The total data infrastructure bill is $85,000/month. A VP asks for a 50% reduction. Is it feasible and how?

Q162NetworkingSenior

Your company's microservices communicate over a flat network with no mutual TLS. A new regulation requires encryption in transit for all internal service-to-service communication. You have 60 services. Design the rollout without causing downtime.

Q163Secrets & SecurityMid

Your team is about to launch a new API that requires calling third-party services using API keys. There are 8 services, 3 environments, and the keys must be rotated every 90 days per security policy. Describe the secrets management architecture.

Q164GitOps (ArgoCD/Flux)Mid

A team wants to use ArgoCD to manage their Kubernetes manifests but also wants developers to be able to do quick hotfixes by running kubectl directly in emergencies. How do you design a policy that allows emergency breaks while maintaining GitOps integrity?

Q165Capacity & AutoscalingSenior

A social media platform's analytics pipeline needs to process 200 million events per day, but events arrive unevenly — 80% arrive between 6 PM and midnight. The current static compute allocation over-provisions during off-peak and under-provisions during peak. Design an event-driven autoscaling architecture.

Q166KubernetesSenior

Your e-commerce company's checkout service is running on GKE. During Black Friday, pods start crashing with OOMKilled every 10 minutes. CPU usage is normal at 40%. Memory requests are set to 256Mi and limits to 512Mi. How do you triage and permanently fix this?

Q167CI/CD PipelinesMid

A fintech startup's GitHub Actions workflow deploys to production on every merge to main, but the pipeline takes 47 minutes end-to-end. Engineers are complaining the feedback loop is killing productivity. Walk through how you'd cut that to under 12 minutes without compromising safety.

Q168Terraform & IaCSenior

Your team manages 12 AWS accounts across three environments using a monolithic Terraform root module. A bug in a networking change wiped out prod subnets last quarter. The CTO wants a plan to restructure IaC to prevent blast radius issues. What do you propose?

Q169Observability (Metrics/Logs/Traces)Senior

A SaaS platform on AWS reports that p99 latency for their search API spikes to 8 seconds every day at 2 PM UTC but p50 stays under 200ms. CloudWatch shows no CPU or memory anomalies. How do you instrument and identify the root cause using observability tooling?

Q170Incident Response & PostmortemsMid

You are the incident commander for a P1 outage: a payment processor's webhook endpoint has been returning 503s for 22 minutes, affecting 3,000 merchants. Two engineers are debugging live. How do you manage the next 30 minutes, and what does your postmortem look like?

Q171SLO/SLI & Error BudgetsSenior

A B2B SaaS company defines availability as 'HTTP 2xx responses / total requests' with a 99.9% monthly SLO. Engineering keeps freezing deploys to protect the budget. The product team argues the freeze prevents shipping features. How do you redesign the SLO framework to resolve this tension?

Q172Docker & ContainersMid

Your team's Node.js microservice Docker image is 1.4 GB and takes 18 minutes to build in CI. Developers say local iteration is painfully slow. The Dockerfile starts with FROM node:18 and copies the entire repo. How do you optimize it?

Q173Cloud (AWS/GCP/Azure)Senior

A healthcare analytics company needs to migrate 40 TB of patient data from on-prem Oracle to AWS RDS PostgreSQL, with zero tolerance for data loss and a 4-hour maintenance window. How do you architect and execute this migration?

Q174NetworkingSenior

A microservices platform on EKS is experiencing intermittent 5-10 second connection delays between services. The network team says BGP is healthy and no packet loss is detected. Service-to-service calls use Kubernetes DNS (CoreDNS). How do you systematically isolate the root cause?

Q175Secrets & SecurityMid

A developer accidentally committed an AWS access key to a public GitHub repository 2 hours ago. The key has AdministratorAccess. Describe your immediate response and the longer-term process change you'd implement.

Q176GitOps (ArgoCD/Flux)Senior

Your platform team runs ArgoCD managing 80 applications across 4 clusters. A new compliance requirement mandates that no production deployment happens without a written approval from two staff engineers. How do you implement this in your GitOps workflow without breaking developer velocity?

Q177Capacity & AutoscalingSenior

A video streaming platform's transcoding workers on EKS scale out fine during traffic spikes but scale-in is dangerously slow — nodes drain for 40 minutes after a spike ends, leaving $15,000/day in idle compute. Transcoding jobs run 5-20 minutes each. How do you fix this?

Q178On-call & AlertingMid

Your team is averaging 23 pages per engineer per week, mostly from low-signal noise. Three engineers have escalated burnout concerns. You have two weeks to reduce alert volume by 60% before two engineers go on leave. What is your process?

Q179Reliability & ChaosSenior

A logistics company wants to start a chaos engineering program but their CTO is concerned about running experiments in production. Their platform handles 50,000 shipment updates per hour with no redundant infrastructure. How do you build a credible chaos program that satisfies the CTO and provides real value?

Q180Platform/IDP & Dev ExperienceSenior

A 200-engineer organization has 14 different ways to provision a new microservice — each team invented their own Terraform modules, CI templates, and K8s manifests. Onboarding a new service takes 3-5 days. The VP of Engineering wants this down to 2 hours. How do you build an Internal Developer Platform to solve this?

Q181Cost OptimizationMid

Your AWS bill jumped 40% month-over-month. Engineering leadership wants a root cause and a remediation plan in 48 hours. You have CloudWatch, Cost Explorer, and Terraform state access. Walk through your investigation process.

Q182KubernetesMid

A developer reports that their new deployment on EKS is stuck in Pending state. kubectl get pods shows 0/3 pods running. There have been no recent node changes. How do you diagnose and fix this systematically?

Q183CI/CD PipelinesSenior

Your team wants to implement a trunk-based development workflow for a monorepo containing 30 microservices. Currently each service has its own pipeline. Engineers are worried about unrelated service deployments blocking their merge. Design the CI/CD architecture.

Q184Terraform & IaCMid

You run terraform plan in your CI pipeline and see that it will destroy and recreate an RDS instance that currently holds production data. The Terraform code change only modified the instance type from db.t3.medium to db.t3.large. Why is this happening and how do you fix it safely?

Q185Observability (Metrics/Logs/Traces)Mid

Your team just migrated from ECS to EKS. Application metrics in CloudWatch are gone and the on-call team is flying blind. You have two days to restore full observability. What do you deploy and in what order?

Q186Incident Response & PostmortemsSenior

Your company's postmortems are written but action items are never completed — 68% remain open after 90 days. Engineering leadership says postmortems feel like theater. How do you redesign the process so action items actually get done?

Q187SLO/SLI & Error BudgetsMid

You're asked to define SLOs for a batch data pipeline that runs nightly to load customer analytics data into BigQuery. The pipeline has no user-facing traffic. What SLIs would you choose and how do you set the initial targets?

Q188Docker & ContainersSenior

A security audit flags that 7 of your production containers are running as root and have --privileged mode enabled. The applications are vendor-provided images you cannot modify. How do you harden these containers without breaking the applications?

Q189Cloud (AWS/GCP/Azure)Mid

Your GCP project's Cloud Run service is randomly timing out under moderate load. The service is stateless, processes image uploads, and calls a Cloud Vision API. GCP logs show no errors, but users report 30-second hangs followed by success. What do you investigate?

Q190NetworkingMid

A newly deployed Lambda function can read from DynamoDB but cannot connect to an internal REST API running on ECS inside a VPC. Both are in the same AWS account and region. How do you diagnose and fix this connectivity issue?

Q191Secrets & SecuritySenior

Your company runs 400 microservices on EKS, each needing database credentials, API keys, and TLS certs. Currently all secrets are stored in Kubernetes Secrets (base64 in etcd, unencrypted). A compliance audit requires secrets to be encrypted at rest and rotated every 90 days. Design the target state.

Q192GitOps (ArgoCD/Flux)Mid

A Flux-managed application is stuck in a reconciliation loop — the GitRepository source shows 'Ready: False' and the error is 'failed to checkout and determine revision.' The Git repo and network are healthy. What do you check?

Q193Capacity & AutoscalingMid

Your HPA is configured to scale on CPU usage with a target of 60%. During a traffic spike, CPU hits 90% but the HPA does not add pods for 8 minutes. By the time pods are added, the spike has passed. How do you tune the HPA to respond faster?

Q194On-call & AlertingSenior

Your organization wants to implement SLO-based alerting to replace the current threshold-based alert system. You have 15 services, each with distinct traffic patterns, managed in Prometheus. Design the alerting framework including burn rates, windows, and runbooks.

Q195Reliability & ChaosMid

A retail platform's checkout service has a dependency on a third-party tax calculation API. During load testing, when the tax API is slow (>2s), checkout becomes completely unavailable. The tax API has no SLA. How do you make the checkout service resilient to this dependency?

Q196Platform/IDP & Dev ExperienceMid

Developers at a 50-person startup complain that setting up a local development environment takes 2-3 days and frequently breaks when teammates update dependencies. How do you standardize the dev environment to get new engineers productive in under 2 hours?

Q197Cost OptimizationSenior

A machine learning team runs GPU instances 24/7 on AWS for model training. Monthly EC2 GPU cost is $85,000. Jobs run for 2-12 hours and complete predictably. The team says they 'need on-demand because jobs can't be interrupted.' How do you cut this cost by at least 40% without lying to them?

Q198KubernetesSenior

You are migrating a stateful Redis cluster from VMs to Kubernetes. The Redis cluster serves 200,000 operations per second with a replication factor of 3. Describe the architecture, the migration strategy, and the operational concerns you'd document for the on-call team.

Q199CI/CD PipelinesSenior

A platform team is evaluating whether to build their own CI/CD system versus adopting a managed solution. They have 500 engineers, 200 microservices, multi-cloud deployments (AWS + Azure), and a requirement to run some jobs in an air-gapped environment. Walk through your recommendation.

Q200Observability (Metrics/Logs/Traces)Senior

Your distributed system has 30 microservices generating 500 GB of logs per day. The logging cost in Datadog is $45,000/month and growing. You need to cut costs by 50% without losing critical debugging capability. What is your log management strategy?

Can you defend these answers under follow-up pressure?

Book a mock interview with a senior DevOps / SRE / Platform Engineer mentor — structured scorecard, replay, and a gap plan.

Book a Mock Interview →
FREE PROFILE AUDIT

Book your free audit

Tell us where you are — a senior mentor reviews your profile and shows you exactly what's blocking interview calls. Only name, email and role are required; the more you share, the sharper your audit. No spam, no obligation.

A FEW MORE DETAILS (OPTIONAL)
I want

* required · Prefer talking? WhatsApp +91 83598 96054 or email connect@hikecatalyst.com

📄 Score My Resume