Cloud Engineer / Cloud Architect Interview Questions

200 scenario-based questions with detailed model answers, organized skill-wise and tool-wise. Filter by topic, level or keyword, reveal the answer — then pressure-test yourself in a real mock.

SKILL / TOOL

LEVEL

200 questions

Q001AWS Compute (EC2/Lambda/ECS/EKS)Mid

Your EC2 Auto Scaling group keeps launching and terminating instances every few minutes, and the on-call channel is flooded with flapping alerts. Walk me through how you would diagnose the scaling policy, health check, and warm-up settings causing this thrash.

Q002AWS Compute (EC2/Lambda/ECS/EKS)Senior

A Lambda behind API Gateway starts timing out only during the 11 AM IST traffic spike, though average duration is 200ms. You suspect cold starts plus a VPC ENI bottleneck. How do you confirm the root cause and decide between provisioned concurrency and re-architecting?

Q003AWS Compute (EC2/Lambda/ECS/EKS)Mid

An ECS Fargate service shows tasks stuck in PENDING for twenty minutes after a deploy, and the previous version is still serving traffic. What sequence of checks — capacity, ENI limits, image pull, task role — do you run before rolling back?

Q004AWS Compute (EC2/Lambda/ECS/EKS)Senior

Your EKS cluster upgrade from 1.27 to 1.29 broke a third of the workloads because of deprecated APIs nobody audited. How do you structure a repeatable upgrade pipeline — API deprecation scanning, canary node groups, rollback strategy — so this never surprises you again?

Q005AWS Compute (EC2/Lambda/ECS/EKS)Mid

A batch workload on Spot instances keeps losing nodes mid-job and the team wants to move everything to On-Demand, tripling cost. How do you redesign the job for interruption tolerance instead — checkpointing, capacity-optimized allocation, mixed instance policies?

Q006AWS Compute (EC2/Lambda/ECS/EKS)Senior

After a kernel patch, one EC2 instance family in ap-south-1 shows 30% higher CPU steal and your latency SLO is burning. The vendor AMI team denies any change. How do you isolate whether this is the AMI, the instance generation, or noisy neighbors?

Q007AWS Compute (EC2/Lambda/ECS/EKS)Senior

Your EKS pods intermittently fail DNS resolution under load, and CoreDNS metrics show throttling. The app team blames the platform; you suspect conntrack exhaustion and ndots behavior. Describe your investigation and the node-level and cluster-level fixes you would weigh.

Q008AWS Compute (EC2/Lambda/ECS/EKS)Mid

A developer shipped a Lambda that recursively triggers itself via the same S3 bucket it writes to, and concurrency hit the account limit overnight. How do you stop the loop safely, restore other workloads, and add guardrails against recursive invocation?

Q009AWS Compute (EC2/Lambda/ECS/EKS)Senior

You inherit 400 EC2 instances launched over five years with no AMI lineage, mixed instance generations, and IMDSv1 still enabled. Outline your plan to baseline, patch, enforce IMDSv2, and migrate to a golden-image pipeline without breaking the legacy apps running on them.

Q010AWS Compute (EC2/Lambda/ECS/EKS)Mid

An ECS rolling deployment drains connections too aggressively and users in India see 502s for two minutes on every release. Which combination of deregistration delay, health check grace, and minimum healthy percent would you tune, and how do you validate the fix?

Q011AWS Compute (EC2/Lambda/ECS/EKS)Senior

Your platform team wants one giant multi-tenant EKS cluster; app teams want a cluster each. Capacity is in Mumbai with limited instance availability. How do you frame the trade-offs — blast radius, upgrade coordination, IRSA sprawl, cost — and what would you actually recommend?

Q012AWS Compute (EC2/Lambda/ECS/EKS)Mid

A Lambda processing payment webhooks occasionally runs twice for the same event and finance noticed duplicate ledger entries. The team insists Lambda should be exactly-once. How do you explain the delivery semantics and retrofit idempotency without rewriting the whole function?

Q013AWS Compute (EC2/Lambda/ECS/EKS)Senior

During Diwali sale traffic, your EKS Cluster Autoscaler hit ap-south-1 capacity errors for the instance type you standardized on, and pods sat unschedulable for forty minutes. How do you redesign node provisioning — Karpenter, diversified instance pools, priority classes — to survive the next sale?

Q014AWS Storage (S3/EBS)Mid

An S3 bucket holding customer KYC documents was found with a bucket policy allowing a wildcard principal on GetObject. It has been live for months. Walk through your immediate containment, how you assess whether data was actually accessed, and the preventive controls you add.

Q015AWS Storage (S3/EBS)Senior

Your S3 bill jumped 4x in a month with no new data. Investigation hints at a misconfigured lifecycle rule fighting with Intelligent-Tiering and millions of small-object transitions. How do you untangle the request-cost explosion and design a lifecycle policy that actually saves money?

Q016AWS Storage (S3/EBS)Mid

A nightly analytics job started failing with S3 503 SlowDown errors after the data team switched to writing millions of tiny files under a single date prefix. How do you fix the partitioning scheme and the writer behavior without breaking downstream consumers?

Q017AWS Storage (S3/EBS)Senior

Someone deleted a production S3 prefix and you discover versioning was suspended two years ago and the last backup is stale. What do you attempt for recovery, what do you tell stakeholders, and how do you redesign protection — versioning, Object Lock, replication — afterward?

Q018AWS Storage (S3/EBS)Mid

A database EC2 instance shows iowait spiking every evening, and you find its gp2 volume is exhausting burst credits. The DBA wants a bigger instance. How do you prove it is an EBS problem and choose between gp3 tuning and io2 provisioned IOPS?

Q019AWS Storage (S3/EBS)Senior

Your team restores an EBS snapshot for DR and the database takes hours to warm up because of lazy block loading, blowing the RTO. How do you redesign the restore path — fast snapshot restore, pre-warming, or replication — and what does each cost?

Q020AWS Storage (S3/EBS)Mid

An application writes presigned S3 URLs into emails, and customers complain links expire before they open them, while security complains the expiry is already too long. How do you balance the two, and what alternatives to long-lived presigned URLs would you propose?

Q021AWS Storage (S3/EBS)Senior

Cross-region replication to your DR bucket silently stopped three weeks ago after someone rotated the replication role, and nobody noticed until an audit. How do you detect replication lag continuously, backfill the missed objects, and make this failure impossible to miss again?

Q022AWS Storage (S3/EBS)Mid

Finance flags thousands of orphaned EBS volumes and snapshots from terminated instances across the org. Some may back forgotten but important data. Describe your process to classify, safely archive or delete them, and the automation you put in place to stop the accumulation.

Q023AWS Storage (S3/EBS)Senior

Your media platform serves files directly from S3 and the data-transfer-out line item now dwarfs storage cost, mostly to Indian mobile users. How do you restructure delivery — CloudFront, caching policy, compression, regional placement — and estimate the savings before committing?

Q024AWS Storage (S3/EBS)Mid

After enabling default SSE-KMS on a busy bucket, applications start throwing KMS throttling errors during peak hours. How do you diagnose the request pattern, and when would you switch to S3 Bucket Keys versus raising KMS limits versus SSE-S3?

Q025AWS Storage (S3/EBS)Senior

A compliance request demands you prove that objects in a financial-records bucket were never modified after upload. Versioning exists but governance-mode Object Lock was never enabled. What evidence can you actually produce from S3 metadata and CloudTrail, and what gap do you admit?

Q026AzureMid

Your Azure VM scale set in Central India fails to scale out during a sale event with allocation failures, though quota looks fine. Walk through how you distinguish regional capacity issues from quota, and the zonal and SKU diversification changes you would make.

Q027AzureSenior

A subscription-wide Azure Policy assignment meant to enforce tagging accidentally used a deny effect and blocked every deployment across forty teams for two hours. How do you roll it out next time — audit first, exemptions, gradual scopes — and repair trust with the teams?

Q028AzureMid

An AKS cluster suddenly cannot pull images after the platform team enabled a private endpoint on Azure Container Registry. Nodes show ImagePullBackOff. Describe the DNS and network path you trace, and the fix for clusters in peered VNets.

Q029AzureSenior

Your company merged with another that runs everything in a separate Azure AD tenant. Leadership wants unified access in a quarter. How do you evaluate cross-tenant Lighthouse delegation versus tenant migration, and what hidden costs — managed identities, B2B limits, conditional access — do you call out?

Q030AzureMid

An Azure Function on a Consumption plan processes invoices fine in testing but in production randomly stalls and restarts mid-batch. How do you investigate host restarts, timeout limits, and storage-account throttling, and when do you recommend moving to Premium plan?

Q031AzureSenior

A storage account key used by a legacy app leaked in a public repo. The app has no managed identity support and rotating the key breaks three other undocumented consumers. Sequence your containment, discovery of consumers via logs, and migration off shared keys.

Q032AzureMid

Your App Service in South India shows 30-second cold responses every morning. The team blames Azure; you suspect Always On is off and the plan is undersized. How do you confirm with App Service diagnostics and decide between plan changes and warm-up tricks?

Q033AzureSenior

An Express deployment of Azure Firewall plus forced tunneling broke outbound connectivity for AKS, and pods cannot reach the API server. Describe the asymmetric routing problem you suspect, how you verify it with effective routes, and the UDR design that fixes it.

Q034AzureMid

Cost Management shows an Azure subscription burning lakhs monthly on a dev environment, mostly premium disks attached to deallocated VMs and forgotten public IPs. How do you build an automated hygiene sweep without deleting something a team secretly depends on?

Q035AzureSenior

You must design private connectivity from on-prem Chennai datacenter to Azure with 99.99% availability for a trading workload. Compare ExpressRoute with VPN failover against dual ExpressRoute circuits, and explain how you would test failover without scaring the trading desk.

Q036AzureMid

A developer granted Contributor at subscription scope to a service principal for a quick demo, and it has been mining-style busy ever since. How do you audit what it actually did, scope it down, and set up alerts for privileged role assignments?

Q037AzureSenior

Your Azure SQL geo-replica in the paired region lags minutes behind during bulk loads, and a DR drill showed unacceptable data loss. How do you quantify achievable RPO under real write load and decide between failover groups, ZRS alternatives, or application-level replication?

Q038AzureSenior

After enabling Defender for Cloud across all subscriptions, the security score tanked and teams face hundreds of recommendations of mixed quality. How do you triage into a realistic remediation roadmap, suppress noise defensibly, and report progress to leadership without gaming the score?

Q039GCPMid

A GKE cluster in asia-south1 suddenly schedules nothing; nodes show ready but pods stay pending citing taint mismatches after an autopilot-to-standard migration. Walk through how you audit taints, node pools, and admission policies to restore scheduling safely.

Q040GCPSenior

Your BigQuery bill exploded because a dashboard tool runs SELECT star on a petabyte table every five minutes. How do you contain it today — custom quotas, reservations, materialized views — and restructure the dataset so analysts cannot accidentally do this again?

Q041GCPMid

A Cloud Run service intermittently returns 429s under modest load. You find concurrency set to 1 and max instances capped from an old cost scare. How do you load-test, retune concurrency and CPU allocation, and set sane autoscaling bounds?

Q042GCPSenior

A service account key committed to a public repo was used to spin up GPU instances in three regions overnight. Describe your first hour — revocation, forensic scoping with audit logs, billing protection — and the org policies you enforce so keys stop existing.

Q043GCPMid

Your team enabled VPC Service Controls around BigQuery and Cloud Storage, and suddenly half the data pipelines from a partner project fail with perimeter violations. How do you map the legitimate access paths and design ingress and egress rules instead of disabling the perimeter?

Q044GCPSenior

Leadership wants to move a latency-sensitive workload from AWS Mumbai to GCP asia-south1 for committed-use discounts. How do you benchmark real network latency for Indian users across both, model the egress costs of a gradual migration, and define go or no-go criteria?

Q045GCPMid

A Cloud SQL Postgres instance hits 100% disk and auto-grows repeatedly because a stuck replication slot retains WAL. The app team only sees rising cost. How do you identify the offending slot, fix replication, and alert on WAL retention going forward?

Q046GCPSenior

Your org uses folder-level IAM in GCP and an audit found a basic Editor role inherited by 200 service accounts through a folder grant from 2021. How do you map effective access at scale, migrate to predefined and custom roles, and verify nothing breaks?

Q047GCPMid

After enabling Cloud CDN, users in tier-2 Indian cities still report slow image loads. Cache hit ratio is 40%. How do you investigate cache keys, vary headers, and origin behavior to push hit rates up without serving stale personalized content?

Q048GCPSenior

A GKE workload using Workload Identity intermittently gets permission denied on Cloud Storage during scale-ups, then recovers. How do you confirm metadata-server race conditions versus IAM propagation, and what retry and architecture changes make the failure invisible to users?

Q049GCPMid

Your Dataflow streaming job lags hours behind during evening peaks, and autoscaling sits at max workers. How do you determine whether the bottleneck is a hot key, a slow sink, or shuffle, and what would you change first?

Q050GCPSenior

You operate Shared VPC with host and service projects, and a new team needs Private Service Connect to a partner SaaS plus on-prem reachability. Walk through the IP planning, who owns which firewall rules, and the failure modes of overlapping ranges you must check.

Q051VPC & NetworkingMid

A new microservice in a private subnet cannot reach an external payment gateway, but a sibling service can. NACLs, security groups, and routes look identical at first glance. Describe your systematic packet-path walkthrough to find the asymmetry.

Q052VPC & NetworkingSenior

Two acquired companies must merge VPCs and both use 10.0.0.0/16 throughout. Re-IPing production is deemed impossible this year. Compare your realistic options — NAT-based bridging, PrivateLink fronting, secondary CIDR migration — and the operational debt each one creates.

Q053VPC & NetworkingMid

Your NAT gateway bill quietly grew to dominate the network spend, and flow logs show terabytes flowing to S3 and ECR through it. Explain how you confirm the culprits and what gateway endpoints and interface endpoints you deploy to cut the cost.

Q054VPC & NetworkingSenior

After migrating to Transit Gateway, a latency-sensitive service between two VPCs gained two milliseconds and the trading team noticed. How do you verify the data path, decide where VPC peering still beats TGW, and design a hybrid hub-and-spoke that allows exceptions?

Q055VPC & NetworkingMid

An ALB intermittently returns 504s to clients while target metrics look healthy. You suspect idle timeout mismatches between the ALB, the app server keep-alive, and an upstream proxy. Walk through how you prove it and the timeout hierarchy you enforce.

Q056VPC & NetworkingSenior

A security review demands all east-west traffic between hundred-plus microservices be inspected. Centralized firewall appliances would double latency and cost. How do you frame alternatives — security groups as segmentation, GWLB inspection for selected flows, service mesh mTLS — and defend your recommendation?

Q057VPC & NetworkingMid

Developers complain DNS lookups for an internal service fail roughly once in a thousand requests across the VPC. How do you investigate Route 53 resolver limits, per-ENI DNS throttling, and client retry behavior to pin down the loss?

Q058VPC & NetworkingSenior

You exposed an internal API to a partner via PrivateLink, and now they report connections dying at exactly 350 seconds. Their team blames you. How do you demonstrate where the idle timeout sits, and what keep-alive or design changes do you agree on?

Q059VPC & NetworkingMid

A subnet sized /24 for an EKS node group ran out of IPs in a week because every pod consumes one. The VPC has little free space left. What short-term mitigations and longer-term CNI or CIDR strategies do you weigh?

Q060VPC & NetworkingSenior

During an incident, cross-AZ traffic between your app and cache layer spiked latency and cost simultaneously. How do you implement AZ-aware routing and topology-aware hints without sacrificing resilience, and how do you quantify the cross-AZ data transfer you eliminated?

Q061VPC & NetworkingMid

After enabling VPC flow logs org-wide for an audit, the logging bill tripled. The auditors only need specific deny events and a few sensitive subnets. How do you redesign collection — filters, aggregation intervals, destinations — to satisfy them affordably?

Q062VPC & NetworkingSenior

A multicast-dependent legacy app must move to AWS, which the team discovers does not support multicast natively in VPCs. Transit Gateway multicast exists but with constraints. How do you validate feasibility, and what application-level alternatives do you propose if it falls short?

Q063VPC & NetworkingSenior

Your dual-stack rollout stalls because an IPv6-only client in a pilot cannot reach RDS, which sits IPv4-only behind legacy security groups. Map out the DNS64, NAT64, and endpoint changes required, and how you stage the migration without breaking IPv4 clients.

Q064IAM & SecurityMid

GuardDuty flags credential exfiltration: an EC2 instance role is being used from an IP outside AWS. The app on that instance serves production traffic. Walk through your first thirty minutes — containment without downtime, session revocation, and scoping what the attacker touched.

Q065IAM & SecuritySenior

An access review reveals 90 IAM users with long-lived keys, some embedded in Android apps shipped to customers in India. You cannot rotate without breaking older app versions. Design the migration to temporary credentials and the compensating controls during the long tail.

Q066IAM & SecurityMid

A deploy pipeline suddenly fails with AccessDenied on actions it performed yesterday. Nobody admits changing anything. How do you use CloudTrail, IAM Access Analyzer, and policy version history to find whether an SCP, permission boundary, or inline policy changed?

Q067IAM & SecuritySenior

Your org wants to ban wildcard actions in IAM policies, but a scan shows 600 violations, many in Terraform modules teams copied for years. How do you prioritize by blast radius, generate least-privilege replacements from access data, and roll out without breaking production?

Q068IAM & SecurityMid

A contractor's laptop with active SSO sessions was stolen in transit between Bengaluru offices. What do you invalidate, in what order, and how do you verify there were no actions taken from the device after the theft window began?

Q069IAM & SecuritySenior

A cross-account role meant for a vendor's read-only monitoring was created without an external ID, and the vendor's own breach notification just arrived. Explain the confused-deputy exposure, how you assess misuse from CloudTrail, and the trust-policy pattern you mandate afterward.

Q070IAM & SecurityMid

Developers keep pasting secrets into Lambda environment variables, and a recent fork of the repo leaked one. Compare Secrets Manager, Parameter Store, and KMS envelope patterns for their workflows, and describe how you make the secure path the easy path.

Q071IAM & SecuritySenior

Security wants every human action in production to go through just-in-time elevated access with approval, but the on-call team protests it will slow incident response at 3 AM. Design the break-glass mechanism, its audit trail, and the abuse detection around it.

Q072IAM & SecurityMid

An S3 Access Analyzer finding shows a KMS key policy grants decrypt to an account ID nobody recognizes. Deleting the grant might break an unknown integration. How do you trace actual usage of the key, identify the consumer, and remediate safely?

Q073IAM & SecuritySenior

Your bug bounty program reports SSRF in an internal admin tool that can reach the EC2 metadata service. IMDSv2 is enforced on new instances but hundreds of legacy ones remain. Prioritize your response across the app fix, metadata hardening, and role-permission reduction.

Q074IAM & SecurityMid

After enabling SSO with your identity provider, a team's automation broke because it relied on an IAM user tied to a departed employee. How do you inventory all human-tied credentials powering automation and migrate them to roles without a big-bang outage?

Q075IAM & SecuritySenior

A red team exercise showed they escalated from a compromised CI runner to org-admin in forty minutes via chained role assumptions. Walk through how you map assumable-role graphs across accounts, break the chains, and continuously detect new escalation paths.

Q076IAM & SecuritySenior

Compliance mandates customer-managed keys with annual rotation for all data at rest, but you discover several services hold snapshots encrypted with a key scheduled for deletion in six days by a departed engineer. What is your recovery sequence and the governance fix?

Q077Cost Optimization (FinOps)Mid

Your AWS bill doubled month over month and the CFO wants answers by Friday. Cost Explorer shows the jump spread across data transfer, NAT, and a new account. Describe your triage workflow to produce a defensible breakdown and the top three actions.

Q078Cost Optimization (FinOps)Senior

You bought three-year Reserved Instances heavily, then the org migrated half the fleet to Graviton and Kubernetes, stranding commitments. How do you quantify the waste, use the RI marketplace and exchanges, and redesign the commitment strategy around Savings Plans and forecast confidence?

Q079Cost Optimization (FinOps)Senior

Engineering teams ignore your cost dashboards because nothing maps to their services. Tagging coverage is 40%. How do you drive tagging to enforceable coverage — tag policies, CI checks, backfill — and design showback reports teams actually act on?

Q080Cost Optimization (FinOps)Senior

A unit-economics review shows your cost per transaction rose 40% while traffic grew, the opposite of expected economies of scale. How do you instrument cost per business metric across services, find the superlinear component, and decide which architectural fix pays back fastest?

Q081Cost Optimization (FinOps)Mid

Dev and staging environments run 24x7 and together cost more than production. Teams claim they need them always available for global colleagues. Design the scheduling, on-demand wake-up, and exception process that cuts the spend without making developers in India resent you.

Q082Cost Optimization (FinOps)Senior

Your anomaly detection fired on a 600% spike in CloudWatch costs; investigation shows a debug log statement shipped to production emitting high-cardinality custom metrics. How do you stop the bleed, recover what you can with AWS support, and gate metric creation going forward?

Q083Cost Optimization (FinOps)Mid

A rightsizing tool recommends downsizing 200 instances, but the last time someone followed such advice blindly, a quarter-end batch job fell over. How do you validate recommendations against seasonal utilization patterns and stage the changes with rollback safety?

Q084Cost Optimization (FinOps)Senior

Leadership wants 30% cloud savings this fiscal year; you estimate only 15% is achievable without re-architecture. Build the case: how do you separate quick wins, commitment optimization, and engineering-effort items, and present the trade-off between savings targets and roadmap velocity?

Q085Cost Optimization (FinOps)Mid

Your Kubernetes clusters show 25% average CPU utilization, yet teams keep over-requesting resources because throttling once burned them. How do you use historical usage to set requests, introduce vertical scaling recommendations, and rebuild trust so bin-packing improves?

Q086Cost Optimization (FinOps)Senior

An ML team's training spend jumped from lakhs to crores in a quarter on GPU instances, often idle between experiments. How do you implement GPU pooling, spot strategies with checkpointing, and budget alerts that interrupt waste without blocking a critical model launch?

Q087Cost Optimization (FinOps)Mid

After a migration, you find duplicated spend: the old environment was never decommissioned and has been quietly billing for five months. How do you verify nothing live depends on it, decommission safely, and change the migration checklist so this cannot recur?

Q088Cost Optimization (FinOps)Senior

Procurement is negotiating an enterprise discount agreement and asks you for a three-year usage forecast. Your org's usage history is two years of chaos. How do you build a forecast with confidence intervals, and what commitment structure protects you against being wrong?

Q089Cost Optimization (FinOps)Mid

Cross-AZ data transfer between your app tier and Kafka brokers is now a top-five line item. The architects say multi-AZ is non-negotiable. How do you cut the transfer cost — rack awareness, fetch-from-closest-replica, compression — while keeping the resilience story intact?

Q090Migration StrategyMid

Mid-way through a lift-and-shift, you discover the core app hardcodes the on-prem datacenter's IP ranges and an undocumented dependency on a local SMB share. How do you re-plan the wave, and what discovery steps were missed that you would mandate next time?

Q091Migration StrategySenior

Your six-month datacenter exit has a hard deadline because the colo contract ends, but two business-critical apps failed performance testing in the cloud. Walk through your option analysis — extend colo for a subset, throw hardware-equivalent instances at it, or accept degraded SLOs.

Q092Migration StrategyMid

A database migration cutover window is four hours on a Sunday night IST, and your dry run took six. Identify where you would attack the timeline — pre-seeding, CDC replication, parallelism — and what go or no-go checkpoints you define inside the window.

Q093Migration StrategySenior

Post-migration, the application works but costs three times the business case estimate, mostly from chatty cross-tier traffic that was free on-prem. How do you explain the miss, re-baseline the case honestly, and sequence the re-architecture that recovers the economics?

Q094Migration StrategySenior

You are migrating 80 apps and the assessment spreadsheet says 60 are rehost candidates, but app owners keep insisting theirs is special. Design the intake interview and scoring rubric that separates genuinely complex apps from owners protecting their turf.

Q095Migration StrategySenior

During a phased migration, half the services moved to AWS while the rest stay on-prem, and end-to-end latency for a user transaction crossing the boundary four times is now unacceptable. How do you re-sequence the waves to respect the call-graph instead of org boundaries?

Q096Migration StrategyMid

An Oracle-to-Postgres migration assessment claims 85% schema compatibility, but the PL/SQL packages contain business logic nobody fully understands. How do you build a conversion-plus-validation pipeline and decide what becomes application code versus stored procedures?

Q097Migration StrategySenior

Your VMware estate is being squeezed by license cost hikes, and leadership wants everything on cloud-native VMware-as-a-service as a shortcut. Argue both sides: when does the managed VMware bridge make sense versus doubling down on true rehost-and-refactor, and what lock-in does each create?

Q098Migration StrategyMid

Two days after cutover, users report intermittent errors that never appeared in testing, and the rollback window expires tonight because CDC replication back to on-prem will be stopped. Define the evidence you need within hours to make the rollback call.

Q099Migration StrategySenior

A banking client requires zero-downtime migration of a payments service processing UPI transactions around the clock. Describe your strangler-pattern plan — traffic mirroring, dual-write risks, reconciliation jobs — and the metrics that gate each percentage increase of shifted traffic.

Q100Migration StrategyMid

The migration plan assumed AWS DMS would handle a 4 TB SQL Server database, but LOB columns and a custom CLR assembly keep breaking full load. What fallback paths — native backup restore to RDS, snowball-style transfer, log shipping — do you evaluate and how?

Q101Migration StrategySenior

Six months after declaring migration success, you find shadow infrastructure: teams quietly kept on-prem servers running because they never trusted the cloud build. How do you surface the full shadow estate, address the legitimate gaps that drove it, and finish the exit truthfully?

Q102Migration StrategySenior

You are asked to produce a 7R disposition for 300 applications in eight weeks with a team of three. Perfect analysis is impossible. Describe the tooling, sampling strategy, and confidence labeling you use so downstream wave planning knows which dispositions to re-verify.

Q103Multi-Region & DRMid

Your first DR drill in two years failed: the runbook referenced deleted resources, DNS failover TTLs were 24 hours, and the standby database credentials had rotated. Prioritize what you fix first and how you make drills a recurring, low-drama exercise.

Q104Multi-Region & DRSenior

Leadership wants active-active across ap-south-1 and ap-southeast-1 after an outage, but the core database cannot do multi-master writes. Walk through how you decompose the system by consistency requirements and propose realistic per-service strategies instead of a blanket active-active promise.

Q105Multi-Region & DRMid

During a regional impairment, your health checks failed over DNS to the standby region, but stateful sessions broke and users were logged out mid-transaction. How do you redesign session handling and failover sequencing so the next failover is invisible?

Q106Multi-Region & DRSenior

Your RPO commitment is five minutes, but a replication-lag audit shows the cross-region database replica regularly lags twenty minutes during nightly batch loads. How do you reconcile the commitment — throttle batches, redesign replication, or renegotiate the RPO with evidence?

Q107Multi-Region & DRSenior

A pilot-light DR setup has drifted: the standby AMIs are eight months old and a security group change never propagated. Design the automation that keeps the pilot light continuously deployable and the drift detection that alerts before an auditor or disaster finds it.

Q108Multi-Region & DRSenior

Finance challenges the DR budget: the warm standby region costs 60% of production and has never been used. Defend or restructure it — tiering applications by RTO, scaling standby down, or moving cold tiers to backup-and-restore — with the math you would present.

Q109Multi-Region & DRMid

After failing over to the DR region during a drill, failback to primary corrupted recent writes because reverse replication was never tested. Describe the failback design — write fencing, reconciliation, cutover order — you implement so failback is as rehearsed as failover.

Q110Multi-Region & DRSenior

An RBI-regulated client requires DR within India, but your primary is ap-south-1 and the only other Indian region is ap-south-2 with fewer service offerings. Audit which of your services lack parity there and design around the gaps without leaving the country.

Q111Multi-Region & DRMid

Your S3-based data lake replicates cross-region, but a recent corruption bug replicated bad data to DR within minutes, defeating the purpose. How do you add point-in-time recoverability — versioning windows, backup vaults, delayed replicas — to protect against logical corruption?

Q112Multi-Region & DRSenior

A region-wide IAM or control-plane outage means you cannot make changes in the primary, and your failover automation itself runs in that region. Redesign the failover control plane so it works when the primary region's APIs are the thing that is down.

Q113Multi-Region & DRMid

Route 53 health checks flapped during a partial outage and traffic ping-ponged between regions for twenty minutes, making things worse. How do you tune health check thresholds, add manual override switches, and decide between failover and weighted routing policies?

Q114Multi-Region & DRSenior

Your event-driven system uses regional Kafka clusters, and a DR drill revealed consumers in the standby region reprocessed three hours of events, sending duplicate notifications to customers. Design the offset-sync and idempotency strategy that makes regional failover safe for consumers.

Q115Multi-Region & DRSenior

You must define a company-wide DR tiering policy: hundreds of services, each owner claiming theirs is tier-1. Construct the objective criteria, the cost each tier implies, and the chargeback mechanism that makes owners honest about their real RTO needs.

Q116Databases in Cloud (RDS/DynamoDB)Mid

An RDS Postgres instance hit storage-full at 2 AM and went read-only, paging you. Autoscaling storage was off and a runaway temp table was the trigger. Walk through your recovery sequence and the monitoring thresholds you set so this pages nobody again.

Q117Databases in Cloud (RDS/DynamoDB)Senior

A DynamoDB table designed years ago now throttles on a hot partition every time a celebrity user posts. The partition key is userId. How do you confirm the hot key with CloudWatch contributor insights and redesign the key schema with write sharding, live?

Q118Databases in Cloud (RDS/DynamoDB)Mid

After a minor-version RDS maintenance window, application latency doubled. The team suspects plan regressions from the engine upgrade. How do you compare query plans before and after, stabilize the worst offenders, and change how you stage engine upgrades?

Q119Databases in Cloud (RDS/DynamoDB)Senior

Your Aurora cluster failed over and the application stayed broken for ten minutes because connection pools cached the old writer endpoint. Explain the DNS and driver behavior at fault, and the combination of smart drivers, RDS Proxy, and pool settings you would implement.

Q120Databases in Cloud (RDS/DynamoDB)Mid

A DynamoDB on-demand table's cost rose 5x after a new feature added a scan-based admin report run hourly. How do you find the offending access pattern from metrics, and what GSI or export-based redesign do you give the feature team?

Q121Databases in Cloud (RDS/DynamoDB)Senior

Compliance requires point-in-time recovery tested quarterly, but your last RDS PITR test restored to an instance that took four hours to become performant due to lazy loading and cold buffers. Redesign the restore validation so the tested RTO reflects reality.

Q122Databases in Cloud (RDS/DynamoDB)Mid

Read replicas were added to scale an RDS MySQL workload, but replica lag spikes during flash sales make users see stale order status. How do you measure lag impact, route critical reads to the writer selectively, and decide if caching beats more replicas?

Q123Databases in Cloud (RDS/DynamoDB)Senior

You inherit a single 12 TB RDS instance serving forty microservices, the org's biggest single point of failure. Decompose the path to per-domain databases — dependency mapping, dual-write hazards, CDC-based extraction — and what you do first for immediate risk reduction.

Q124Databases in Cloud (RDS/DynamoDB)Mid

A developer enabled DynamoDB Streams with a Lambda trigger that errors on malformed items, and the iterator age has been climbing for days, risking data loss at the 24-hour limit. Describe your triage, replay options, and the dead-letter design you add.

Q125Databases in Cloud (RDS/DynamoDB)Senior

Your team wants DynamoDB global tables for an India-plus-Singapore deployment, but the product has a strict requirement that a user's writes are immediately readable in their home region after failover. Explain the conflict-resolution semantics and where this requirement actually breaks.

Q126Databases in Cloud (RDS/DynamoDB)Mid

RDS Performance Insights shows your top wait event is lock contention from a nightly reconciliation job colliding with user traffic that never used to overlap, thanks to global expansion. How do you confirm the collision and restructure the job without a rewrite?

Q127Databases in Cloud (RDS/DynamoDB)Senior

An audit reveals your RDS snapshots are unencrypted because the source instance predates the encryption mandate, and you cannot encrypt in place. Plan the encrypted-copy migration for a 6 TB production database with a 30-minute maximum downtime budget.

Q128Serverless ArchitectureMid

A serverless API works fine until a marketing campaign in India triggers ten times normal load, and downstream RDS connections exhaust instantly. Lambda scaled; the database did not. How do you retrofit connection pooling, throttling, and queue-based buffering without a rewrite?

Q129Serverless ArchitectureSenior

Your Step Functions workflow orchestrating order fulfilment silently stopped progressing for a subset of executions after a state machine update changed an input path. How do you detect schema-drift failures across versions, replay stuck executions, and version state machines safely?

Q130Serverless ArchitectureMid

Finance flags that your serverless stack costs more than the containerized service it replaced, driven by per-request pricing at sustained high throughput. How do you model the crossover point between Lambda and Fargate or EC2, and present the re-platforming case?

Q131Serverless ArchitectureSenior

A 15-minute Lambda timeout truncates your largest customers' export jobs, and the team keeps raising memory as a workaround. Re-architect the job — chunking with Step Functions, Fargate tasks, or S3 multipart assembly — and justify the option you would pick.

Q132Serverless ArchitectureMid

Tracing shows a single user request fans out through six Lambdas chained synchronously, and p99 latency stacks up beyond two seconds. How do you decide what to collapse, what to make asynchronous, and how to keep the trace coherent across the redesign?

Q133Serverless ArchitectureSenior

An incident revealed your serverless system has no usable local story: developers test only in a shared dev account, stepping on each other and occasionally on production event buses. Design the per-developer isolation, ephemeral stacks, and event-mocking strategy you would implement.

Q134Serverless ArchitectureMid

A Lambda@Edge function added for A/B testing increased TTFB for users in tier-2 Indian cities and occasionally throws region-replication errors on deploy. How do you measure its real latency cost and evaluate CloudFront Functions or origin-side logic instead?

Q135Serverless ArchitectureSenior

Your API Gateway bill and limits are strained at a billion requests monthly, and the team debates moving to ALB plus Lambda or function URLs. Compare the trade-offs — authorizers, WAF, throttling, per-request cost — and define the migration test plan.

Q136Serverless ArchitectureMid

After a deploy, a Lambda's unhandled promise rejections started silently dropping records because the handler returned before async work finished. Walk through how you catch this class of bug — runtime behavior, DLQs, destination configs — and the code-review rule you add.

Q137Serverless ArchitectureSenior

You must implement a strict per-tenant rate limit and quota system in a serverless multi-tenant SaaS, where noisy tenants currently starve others through shared Lambda concurrency. Design the isolation — reserved concurrency tiers, token buckets in DynamoDB or Redis, API keys — and its failure modes.

Q138Serverless ArchitectureMid

A scheduled EventBridge rule that triggers your nightly billing Lambda fired twice on one occasion, and customers were double-charged. How do you investigate the duplicate trigger, build idempotency into the billing path, and decide whether scheduled Lambdas suit billing at all?

Q139Serverless ArchitectureSenior

Cold starts on a JVM-based Lambda fleet breach your 500ms p99 contract every deploy as all containers cycle. Compare SnapStart, provisioned concurrency with gradual deployment, and a runtime rewrite, including the operational cost of each, and commit to a recommendation.

Q140Event-Driven (SQS/SNS/EventBridge/Kafka)Mid

An SQS consumer deployment with a new bug NACKed everything, and after several retries thousands of messages landed in the DLQ overnight. Describe your redrive strategy — fix verification, replay rate, poison-message isolation — without overwhelming the downstream system again.

Q141Event-Driven (SQS/SNS/EventBridge/Kafka)Senior

A Kafka topic's consumer group rebalances every few minutes during peak, stalling processing. You suspect max.poll.interval violations from a slow downstream call. Walk through the metrics that confirm it and the consumer redesign — async processing, pause-resume, cooperative rebalancing — you would apply.

Q142Event-Driven (SQS/SNS/EventBridge/Kafka)Mid

Your team used SNS fan-out to five SQS queues, and one subscriber team reports missing roughly 1% of messages. Nothing errors. How do you investigate delivery — filter policies, queue permissions, raw delivery settings — and prove where messages disappear?

Q143Event-Driven (SQS/SNS/EventBridge/Kafka)Senior

An EventBridge bus has become the org's nervous system, but a schema change by one producer broke eleven consumers simultaneously last week. Design the governance — schema registry, contract testing, versioned detail-types, consumer canaries — that lets the bus evolve without these blast radii.

Q144Event-Driven (SQS/SNS/EventBridge/Kafka)Mid

A FIFO SQS queue throttles at exactly the per-message-group throughput limit because every message uses the same group ID, serializing the workload. How do you choose a better grouping key while preserving the ordering guarantees the business actually needs?

Q145Event-Driven (SQS/SNS/EventBridge/Kafka)Senior

During a regional incident, your MSK cluster lost a broker and under-replicated partitions piled up while producers with acks=all stalled, taking checkout down. Reassess the trade-offs — min.insync.replicas, acks, topic placement — and the load testing that would have exposed this.

Q146Event-Driven (SQS/SNS/EventBridge/Kafka)Mid

An order event consumed twice created duplicate shipments because the consumer's deduplication relied on best-effort SQS deduplication IDs beyond the five-minute window. Explain where the window failed and the idempotency-key design you implement at the consumer side.

Q147Event-Driven (SQS/SNS/EventBridge/Kafka)Senior

Your event-sourced system needs to replay nine months of Kafka events to rebuild a corrupted projection, but retention is thirty days and the archive lives in S3 as hourly dumps. Design the replay pipeline, ordering guarantees, and how live traffic merges in afterward.

Q148Event-Driven (SQS/SNS/EventBridge/Kafka)Mid

Queue depth on a payment-notification SQS queue grows every evening IST and drains by morning, adding hours of delay customers notice. How do you decide between scaling consumers on queue-depth metrics, batching changes, and smoothing the upstream producer?

Q149Event-Driven (SQS/SNS/EventBridge/Kafka)Senior

A new team wants to publish domain events to both EventBridge and Kafka because consumers live in both ecosystems, risking divergent dual-publishes. Architect the single-source-of-truth pattern — outbox, bridge connectors, ownership rules — and the consistency failure modes you must monitor.

Q150Event-Driven (SQS/SNS/EventBridge/Kafka)Mid

Your Lambda triggered by SQS keeps scaling until it exhausts database connections, despite a reserved concurrency cap, because batch size and concurrency interact in ways the team did not model. Explain the tuning levers and the back-pressure design you settle on.

Q151Event-Driven (SQS/SNS/EventBridge/Kafka)Senior

An incident postmortem shows an EventBridge rule misconfiguration silently dropped events for six days — no errors, no DLQ, just absence. Design the end-to-end delivery observability — canary events, per-rule metrics, archive-replay verification — that turns silent loss into a paged alert.

Q152Event-Driven (SQS/SNS/EventBridge/Kafka)Senior

You must migrate a self-managed Kafka cluster on EC2 to MSK with zero consumer-visible downtime, while sixty applications hardcode broker addresses. Sequence the migration — MirrorMaker topology, offset translation, client cutover waves — and the rollback story at each phase.

Q153Terraform/CloudFormationMid

A teammate ran terraform apply with a stale state file from a local backup, and now state and reality disagree across dozens of resources. Walk through your reconciliation — state pull, imports, targeted refresh — and the backend locking setup that prevents recurrence.

Q154Terraform/CloudFormationSenior

Your monolithic Terraform root module takes 25 minutes to plan and every change risks the whole stack. Design the decomposition into layered states — network, platform, app — including how you migrate resources between states without destroy-recreate on live infrastructure.

Q155Terraform/CloudFormationMid

A terraform plan in CI shows 47 unexpected changes after a provider version bump nobody pinned. How do you separate cosmetic diffs from destructive ones, introduce version pinning and lock files, and unblock the urgent change that started all this?

Q156Terraform/CloudFormationSenior

Someone manually fixed production during an incident, and the next pipeline apply reverted their fix, recreating the outage at 4 AM. Design the drift-detection and emergency-change workflow that lets humans act fast during incidents without the pipeline undoing them later.

Q157Terraform/CloudFormationMid

A CloudFormation stack update is stuck in UPDATE_ROLLBACK_FAILED on a production stack because a resource was deleted out-of-band. Walk through your options — continue rollback with skips, drift detection, resource import — and how you avoid making it worse.

Q158Terraform/CloudFormationSenior

Secrets keep leaking into Terraform state — RDS passwords, API keys — and state files are readable by every engineer via the shared backend bucket. Redesign secret handling and state access — ephemeral resources, external secret references, state encryption and IAM — with a migration plan.

Q159Terraform/CloudFormationSenior

Your team copies a 2,000-line Terraform module between projects and the copies have diverged for a year. How do you converge them into a versioned registry module, handle the breaking differences, and roll consumers forward without simultaneous big-bang upgrades?

Q160Terraform/CloudFormationSenior

A terraform destroy targeting a dev workspace ran against production because workspace selection relied on a local environment variable. Postmortem the failure modes and design the guardrails — separate accounts, pipeline-only applies, deletion protection, plan approvals — ranked by which you implement first.

Q161Terraform/CloudFormationMid

Renaming a resource in Terraform wants to destroy and recreate your production NAT gateways, which would drop traffic. Explain how you use moved blocks or state mv to refactor safely, and how you verify the plan is truly a no-op.

Q162Terraform/CloudFormationSenior

Your org runs both CloudFormation (inherited) and Terraform (new standard), and resources increasingly depend across the boundary, causing deploy-order deadlocks. Define the interop strategy — SSM parameter contracts, data sources, migration priorities — and how you decide what migrates versus stays.

Q163Terraform/CloudFormationMid

A junior engineer's PR passed review, but apply failed halfway, leaving an ALB created and its listeners missing during business hours. How do you structure plans, apply stages, and health-checked deployments so partial applies cannot leave production half-built?

Q164Terraform/CloudFormationSenior

Policy-as-code is mandated: no public buckets, mandatory tags, approved instance types. Teams already bypass review with click-ops when the pipeline is slow. Design enforcement across OPA or Sentinel in CI plus SCP backstops, and the developer-experience work that removes the bypass incentive.

Q165Terraform/CloudFormationMid

Terraform state for a critical stack was corrupted during a failed migration to a new backend, and the latest readable snapshot is a week old. Describe your recovery — versioned backend objects, state surgery, selective imports — and what you change about backend versioning.

Q166Landing Zones & Multi-AccountMid

Your startup grew from one AWS account to one giant shared account with 30 teams, and a dev experiment just throttled production APIs through shared service quotas. Make the case for a multi-account split and outline the first three accounts you would carve out.

Q167Landing Zones & Multi-AccountSenior

You are designing a landing zone for a fintech with 50 expected accounts. Walk through your OU structure, the first five SCPs you would write, the identity model, and centralized logging design — plus the decisions you would deliberately defer until later.

Q168Landing Zones & Multi-AccountMid

An SCP intended to block leaving the organization also blocked an account's ability to update its own member-account alternate contacts, breaking a compliance task. How do you test SCP changes safely, and what staging structure do you build for policy rollout?

Q169Landing Zones & Multi-AccountSenior

A team spun up accounts outside Control Tower using a personal email, and one now hosts a customer-facing service. Describe the enrollment of brownfield accounts into the landing zone — guardrail conflicts, baseline retrofitting — and the detection that catches unmanaged accounts early.

Q170Landing Zones & Multi-AccountSenior

Developers complain it takes three weeks to get a new account with networking, DNS, and CI access. Design the account-vending automation — factory pipeline, baseline stacks, default budgets — that delivers a ready account in under a day with controls intact.

Q171Landing Zones & Multi-AccountSenior

Your hub-and-spoke shared-services VPC pattern hit Transit Gateway attachment and route limits as account count crossed 200. How do you re-architect — multiple TGWs, Cloud WAN, cell-based segmentation — and migrate attachments without breaking inter-account traffic?

Q172Landing Zones & Multi-AccountMid

Centralized CloudTrail and Config aggregation exist, but during an incident you discovered three accounts had Config recorder stopped for months. How do you continuously verify baseline services are running in every account and auto-remediate when someone turns them off?

Q173Landing Zones & Multi-AccountSenior

Two business units demand isolation so strict they want separate AWS Organizations, citing acquisition plans and regulatory walls. Argue the trade-offs — consolidated billing loss, duplicated guardrails, identity federation complexity — and the single-org alternative with hard boundaries you would counter-propose.

Q174Landing Zones & Multi-AccountMid

A shared ECR registry in the tooling account serves 80 accounts, and a misconfigured repository policy briefly let any AWS principal pull your proprietary images. How do you audit who actually pulled, fix the policy pattern, and add org-path conditions everywhere?

Q175Landing Zones & Multi-AccountSenior

Your landing zone predates IAM Identity Center; 60 accounts use hand-rolled cross-account roles with inconsistent names and trust policies. Plan the migration to centralized identity — permission-set design, role coexistence period, automation cutover — without locking out CI systems mid-flight.

Q176Landing Zones & Multi-AccountMid

The platform team owns the network account, and an app team's deploy failed because a subnet share via RAM was quietly un-shared during cleanup. How do you make shared-resource dependencies explicit, alert on share changes, and define ownership contracts between teams?

Q177Landing Zones & Multi-AccountSenior

Leadership asks what it would take to evacuate a compromised member account entirely within 48 hours — workloads, data, DNS — while preserving forensic evidence. Design the account-evacuation runbook and the standing capabilities that make it actually achievable.

Q178Compliance & Data ResidencyMid

A customer contract requires all their data to remain in India, but you discover your CDN logs and a third-party error tracker ship data to US regions. How do you inventory every egress path for that tenant's data and remediate without degrading observability?

Q179Compliance & Data ResidencySenior

Under the DPDP Act, your legal team asks engineering to support data principal erasure requests within a defined SLA, but personal data is scattered across S3 lakes, DynamoDB, backups, and Kafka retention. Architect the discovery, erasure, and proof-of-deletion pipeline, including the backup problem.

Q180Compliance & Data ResidencyMid

An RBI-regulated client's auditor asks you to demonstrate that payment data never left Indian regions in the past year. What evidence can you actually assemble — CloudTrail, flow logs, replication configs, support-case access logs — and where are the honest gaps?

Q181Compliance & Data ResidencySenior

Your SaaS wants to sell to EU customers, triggering GDPR data-localization preferences, while your single-region architecture sits in ap-south-1. Design the cell-based, region-pinned tenancy model — tenant routing, shared control plane versus regional data planes — and its operational cost.

Q182Compliance & Data ResidencyMid

A developer copied a production database snapshot into a dev account to debug an issue, violating your data-handling policy and possibly regulations. Walk through containment, the technical guardrails — snapshot-sharing SCPs, masked clones, synthetic data — and how you make safe debugging easy.

Q183Compliance & Data ResidencySenior

SOC 2 evidence collection consumes a sprint every quarter: screenshots, manual exports, ticket archaeology. Design the continuous-compliance pipeline — Config rules mapped to controls, automated evidence storage, drift alerts — and how you keep auditors happy with machine-generated proof.

Q184Compliance & Data ResidencyMid

Your logging pipeline accidentally captured Aadhaar numbers in application logs replicated to a third-party SIEM overseas. Sequence the response — scoping, purging across hot and cold storage, vendor coordination, disclosure assessment — and the log-scrubbing controls you implement after.

Q185Compliance & Data ResidencySenior

A government tender requires deployment on a MeitY-empanelled cloud with sovereign controls, but your stack leans on managed services unavailable there. How do you assess the porting gap, decide between containerizing substitutes and re-architecting, and price the sovereign variant honestly?

Q186Compliance & Data ResidencyMid

Encryption-at-rest is mandated everywhere, but a scan finds unencrypted EBS volumes, an old Elasticsearch domain, and SQS queues without KMS. Prioritize remediation by data sensitivity and re-encryption difficulty, and describe which fixes require downtime and which do not.

Q187Compliance & Data ResidencySenior

Legal asks whether AWS support engineers abroad accessing a console session for troubleshooting constitutes a cross-border data transfer for your regulated workload. How do you evaluate and configure the technical controls — support access restrictions, KMS key policies, session recording — behind your answer?

Q188Compliance & Data ResidencyMid

Your data warehouse team wants to use a US-region AI service to process customer text that may contain PII from Indian users. Define the evaluation you run — data flow mapping, anonymization feasibility, regional alternatives — before you sign off or refuse.

Q189Compliance & Data ResidencySenior

An acquisition closes and you inherit a codebase whose backups, DR region, and analytics pipeline all violate your industry's residency commitments. You have ninety days per the compliance attestation cycle. Triage what you fix, what you isolate, and what you disclose.

Q190Hybrid & On-Prem ConnectivityMid

Your site-to-site VPN between the Pune datacenter and AWS flaps several times daily, breaking long-lived database replication sessions. Walk through how you diagnose the tunnel drops — DPD timers, rekey mismatches, ISP path changes — and stabilize or redesign the link.

Q191Hybrid & On-Prem ConnectivitySenior

A Direct Connect circuit through a Mumbai colocation went down during a fiber cut, and your supposed VPN backup could not carry the replication traffic, blowing your RPO. Redesign the hybrid connectivity for graceful degradation, and define what traffic gets dropped first.

Q192Hybrid & On-Prem ConnectivityMid

After connecting on-prem AD-integrated DNS with Route 53 resolver endpoints, some cloud workloads intermittently resolve on-prem hostnames to stale records. How do you trace the forwarding chain, identify caching layers at fault, and set conditional forwarding rules correctly?

Q193Hybrid & On-Prem ConnectivitySenior

Your hybrid app spans on-prem Oracle and cloud microservices, and a 35ms round trip across the Direct Connect makes a chatty ORM unusable. Quantify the chattiness, then argue the options — caching tier, data replication into cloud, API facade on-prem — with their consistency costs.

Q194Hybrid & On-Prem ConnectivityMid

A factory site with unreliable 50 Mbps connectivity must sync sensor data to the cloud and tolerate day-long outages. Design the edge buffering, compression, and backfill strategy, and explain how you prevent backfill floods from starving live telemetry when the link returns.

Q195Hybrid & On-Prem ConnectivitySenior

Security mandates that all cloud-bound traffic from on-prem traverse the corporate proxy, which now bottlenecks at 2 Gbps and adds 20ms, infuriating teams using cloud storage heavily. Negotiate the architecture — proxy bypass for trusted prefixes, TLS inspection scope, private endpoints — defensibly.

Q196Hybrid & On-Prem ConnectivityMid

An on-prem Kubernetes cluster joins your service mesh with cloud clusters, but cross-environment mTLS fails intermittently because on-prem nodes drift clock-wise without reliable NTP. Describe how you confirm certificate-validity skew as the cause and harden time sync across environments.

Q197Hybrid & On-Prem ConnectivitySenior

Your BGP session over Direct Connect advertises on-prem routes, and an engineer accidentally announced a supernet overlapping a VPC CIDR, blackholing cloud-internal traffic for minutes. Design the route filtering, prefix limits, and change-control for routing that prevents a recurrence.

Q198Hybrid & On-Prem ConnectivityMid

Backups from on-prem to S3 saturate the shared Direct Connect every night, and morning replication jobs queue behind them. How do you implement bandwidth scheduling and QoS across the hybrid link, and when would you argue for a dedicated transfer path instead?

Q199Hybrid & On-Prem ConnectivitySenior

Leadership wants to run AWS Outposts in your Hyderabad datacenter for a latency-critical workload, but your facilities team has never hosted vendor-managed racks. Assess the operational realities — power and network prerequisites, service subset limits, update windows — and the exit strategy if it fails.

Q200Hybrid & On-Prem ConnectivitySenior

A latency-sensitive trading workload must stay on-prem for exchange colocation rules while its risk-analytics pipeline runs in cloud, and end-of-day spikes overwhelm the link. Architect the data flow — aggregation at the edge, delta sync, burst capacity — within the regulatory boundary.

Can you defend these answers under follow-up pressure?

Book a mock interview with a senior Cloud Engineer / Cloud Architect mentor — structured scorecard, replay, and a gap plan.

Book a Mock Interview →

Cloud Engineer / Cloud Architect Interview Questions

Book your free audit