Application / Production Support Engineer Interview Questions
200 scenario-based questions with detailed model answers, organized skill-wise and tool-wise. Filter by topic, level or keyword, reveal the answer — then pressure-test yourself in a real mock.
An enterprise customer reports intermittent checkout failures nobody can reproduce in staging. You have one blurry screenshot and an angry executive sponsor demanding answers today. How do you narrow the problem space before going back to the customer for more data?
A customer insists a bug appeared 'after your last release', but the changelog shows nothing related to their workflow. Describe how you separate correlation from causation here without dismissing their claim or burning days bisecting releases.
Three different customers report similar slowness, but in three different modules, on three different product versions. How do you decide whether you are chasing one root cause or three, and how do you structure the parallel investigation?
You inherit a ticket that has bounced between four engineers over six weeks, each adding contradictory notes. The customer is furious. How do you rebuild a coherent picture of the issue and restart the investigation without repeating their work?
A bug reproduces only on the customer's production tenant during Monday morning peak, and they refuse to grant you access. What proxies, instrumentation, or controlled experiments do you use to test hypotheses you cannot directly observe?
The customer's 'critical bug' turns out to reproduce only when a third-party browser extension is installed. Walk through how you isolated that variable, and how you communicate a finding that essentially says the fault is not in your product.
You have two plausible root causes: a race condition in your product or a misconfigured load balancer on the customer side. Testing either disrupts their production. How do you design the cheapest experiment that discriminates between the two?
A reseller files issues on behalf of an end customer, so every reproduction question takes three days round-trip. How do you front-load your information requests so the investigation converges on a diagnosis with the fewest possible round-trips?
Your only clue is that the failure stopped for two weeks and returned exactly when daylight saving time ended in the customer's region. Describe how you turn that single observation into a structured hypothesis list and a test plan.
A customer demands you 'just restart everything' weekly because it makes their symptom disappear, but the workaround destroys the evidence. How do you negotiate keeping the system broken long enough to capture meaningful diagnostics?
After three weeks you prove the un-reproducible bug requires a specific locale, a month-end date, and a stale cache to align. How do you document this class of multi-factor issue so the next engineer recognizes the pattern in hours, not weeks?
Mid-investigation, the customer silently upgrades their environment and the symptom changes shape. Your earlier data is now suspect. How do you decide what evidence to keep, what to discard, and how to re-baseline the investigation?
A customer's issue disappears whenever they enable debug logging, which strongly suggests a timing-sensitive bug. What techniques do you use to observe the failure without altering the very conditions that cause it?
You suspect the customer's report actually describes two unrelated bugs they have merged into one ticket. How do you tease the symptoms apart and convince the customer to track them separately without seeming like you are deflecting?
Engineering closed the ticket as 'cannot reproduce' twice, but the customer hits the bug daily. As the senior support engineer, what do you change about the reproduction package on the third attempt to make the escalation finally land?
A customer's application server is at 99% disk but du and df disagree by 40GB, and they are panicking about an imminent outage. Which commands do you run to find deleted-but-open files holding the space, and how do you free it safely?
You receive a 2GB tar of /var/log from a crashed appliance and have one hour before the customer call. Which commands do you chain to surface the crash timeline quickly, and which files do you open first?
A customer says your agent process 'randomly dies' on their RHEL box with no monitoring installed. How do you use journalctl, dmesg, and exit codes to determine whether it crashed, was OOM-killed, or was stopped by an operator?
On a hardened customer server you have a read-only shell, no root, and no ability to install tools. How do you investigate high load, suspicious sockets, and a runaway process using only /proc and coreutils?
Your product's installer fails at 87% on the customer's Ubuntu host with a generic 'permission denied'. Walk through how you would use strace or carefully staged re-runs to pinpoint the exact file and syscall that fails.
A customer reports file corruption only when your service writes to their NFS mount. Describe how you test locking, caching, and mount options from the shell to distinguish an NFS misconfiguration from a genuine product bug.
The customer pastes a cron entry that 'works manually but never on schedule'. What environmental differences between an interactive shell and cron do you check first, and how do you prove which one is actually biting them?
During a P1 you must inspect a 30GB log on the customer's box without copying it off or freezing their terminal. Which streaming combinations of tail, grep, and awk do you use, and what do you refuse to run on a loaded host?
A customer claims your binary is 'eating all the CPU' and top shows 400% usage across threads. How do you use ps, pidstat, and /proc to identify which thread is spinning and what it is doing before escalating to engineering?
Your service fails to start on one of forty supposedly identical customer VMs, and the config files are byte-identical. What differences across kernel parameters, ulimits, SELinux state, and library versions do you hunt for, and in what order?
The customer's admin deleted a config file your service needs, but the process is still running with the file handle open. How do you recover the file contents from the running process before anyone restarts it?
A customer's backup job intermittently corrupts your product's data directory and they insist their script is fine. How would you audit their shell script for unsafe patterns like unquoted variables, missing locks, and race-prone temp files, then demonstrate the failure?
You need a customer with minimal Linux skills to gather diagnostics from an air-gapped server over the phone. How do you design a single copy-paste command block that is safe, complete, and produces one portable archive on the first try?
After a kernel patch, the customer's I/O latency tripled and they blame your product. Using iostat, vmstat, and per-device statistics, how do you build a before-and-after case showing where the regression actually lives?
An enterprise customer dumps 2GB of unstructured logs into a ticket and demands root cause by morning, and no timestamps line up across files. Describe your first hour: what you index, what you grep, and how you build a unified timeline.
In Splunk you can see the error spike, but the offending request ID never appears in the application log, only in the proxy log. How do you stitch identity across log sources that do not share correlation IDs?
The customer's log rotation deleted the exact window where the incident happened. What secondary evidence, such as metrics, audit trails, downstream logs, and database state, do you mine to reconstruct events you can no longer read directly?
Your ELK cluster shows the customer's error rate at 0.02%, but they insist 'everything is failing'. How do you reconcile sampled or dropped logs with customer perception, and verify your ingestion pipeline is not silently losing events?
Your team keeps grepping for the literal error string and missing variants that embed different locales and request IDs. How do you build a regex and a saved search that catches the whole error family without drowning in noise?
Two intertwined services log at different levels, and the failure only makes sense when both streams are interleaved by millisecond. Clock skew between the hosts is roughly 400ms. How do you correct for skew and prove the true ordering of events?
A customer sends logs containing PII they should never have shared, but you still need the diagnostic content. What is your immediate handling procedure, and how do you obtain a sanitized yet still useful dataset?
An error appears 50,000 times in the dump but is clearly a symptom, not the cause. Walk through how you de-duplicate, cluster, and time-order the log noise to find the one first-occurrence line that actually matters.
The customer's logs arrive in five formats: JSON, syslog, multi-line Java stack traces, CSV audit records, and a binary trace. What is your tooling approach to normalize them quickly enough to be useful during a live escalation?
You suspect a memory leak from gradually slowing response times, but the logs only capture errors, not resource state. What log-derived signals like GC lines, latency percentiles, and restart fingerprints let you confirm the leak without new instrumentation?
Splunk license limits mean the customer only ships WARN and above, and the bug manifests at INFO. How do you negotiate a targeted, time-boxed verbose capture that will not blow their ingestion quota or fill their disks?
A failure pattern hits only in logs from one of nine Kubernetes pods, but pods are recycled hourly and names keep changing. How do you write a query that tracks the faulty workload identity across pod churn?
Your grep across the dump returns 14,000 matches inside the failure window. Describe the funnel of filters on time, severity, thread, and request ID you apply to get down to a reviewable set within thirty minutes.
Engineering says 'the logs look fine' and the customer says 'it failed at 14:32'. You find nothing at 14:32 but a burst at 14:47. How do you investigate timezone, buffering, and batching effects before accusing anyone's clock?
A recurring stack trace was supposedly fixed three versions ago but still appears in this customer's logs. How do you verify which binary actually ran, ruling out stale jars, cached images, or partial deployments from the logs alone?
A customer's API calls fail with TLS handshake errors only from their Mumbai office while Singapore works fine, and their network team blames your endpoint. How do you use openssl s_client and path-tracing evidence to localize the fault?
After a certificate rotation, 5% of a customer's clients still fail with chain errors while the rest are fine. What differences in trust stores, caching, and intermediate certificate delivery do you investigate, and what proof do you collect?
A customer reports your webhook deliveries 'sometimes never arrive', yet their firewall logs show nothing dropped. Walk through how you use delivery logs, retry traces, and a controlled test endpoint to determine where the requests actually die.
DNS for your SaaS resolves correctly everywhere except inside one enterprise customer's network, where it returns a year-old IP. How do you trace their resolver chain, identify the stale layer, and prove it is their split-horizon setup?
Curl from the customer's server shows 200 OK, but their application gets connection resets against the same endpoint. What differences between the two clients, including TLS versions, proxies, SNI, and keep-alive, do you check first?
A customer behind a TLS-intercepting proxy reports random 502s on long-running uploads to your API. How do you design tests that separate the proxy's buffering and timeout behavior from genuine faults in your service?
An enterprise customer demands you explain why their HTTP requests succeed but responses take exactly 30 seconds. What does that suspiciously round number suggest, and how do you confirm a timeout-and-retry layer is hiding the real failure?
Packet captures from the customer show retransmissions and a shrinking TCP window during file transfers to your platform. How do you determine whether the bottleneck is their network path, an MTU mismatch, or your server's receive behavior?
A customer migrated to a new proxy and now your desktop agent cannot authenticate, though browsers work fine. How do you compare the two clients' proxy negotiation, including PAC files, NTLM, and CONNECT handling, to find the gap?
Your status page is green, but a customer's synthetic checks from three regions show 8% packet loss to your API for six hours. How do you investigate a peering or routing issue you do not control, and what do you tell them meanwhile?
A customer reports intermittent 'connection refused' to your service every day around 13:00 UTC. What time-correlated suspects, such as DNS TTL expiry, NAT table exhaustion, and scheduled jobs, do you investigate, and how?
mTLS between the customer's gateway and your ingestion endpoint breaks only for payloads over 1MB. Describe how you isolate whether renegotiation, proxy buffering, or a misconfigured size limit is responsible, using captures from both ends.
The customer insists 'the network is fine because ping works' while your agent times out on port 443. How do you demonstrate the difference between ICMP reachability and TCP/TLS service reachability with evidence their network team will accept?
A global customer sees sessions drop exactly every 60 minutes across all offices, while your token lifetime is 8 hours. What stateful middleboxes and idle-timeout behaviors do you suspect, and how do you instrument a session to catch the killer?
HTTP 403s from your CDN hit only one customer, and only on POST requests. How do you determine whether a WAF rule, their request headers, or a geo block is responsible, and reproduce it with curl so engineering can act?
A customer claims invoices vanished after a migration, but you suspect they were never created. Describe the queries you would run against audit and transaction tables to prove creation versus deletion, without write access to their database.
Your diagnostic query, which is fine on test data, ran twelve minutes on the customer's 800-million-row table and alarmed their DBA about load. How do you rewrite and stage investigative queries so they are safe on production-scale data?
Two reports disagree: the customer's BI tool says 9,412 active users while your product UI says 9,377. Walk through how you reconcile the two definitions and hunt down the 35-row discrepancy in the data itself.
A customer reports duplicate payment records appearing 'randomly', and you suspect retried transactions without idempotency keys. What queries would isolate the duplicate clusters, their timing distribution, and the client behavior that created them?
The customer's DBA will only run queries you send by email, one batch per day. How do you design a single diagnostic batch capturing schema, row counts, anomalies, and representative samples so you never waste a round-trip?
A nightly job has corrupted referential integrity, leaving 40,000 orphaned child rows accumulated over six months. The customer wants both an explanation and a fix. How do you quantify the blast radius and design a remediation script you would trust on their production?
A customer's query times out only on the first Monday of every month. What data-volume and statistics-related suspects do you check, and what evidence would you gather before telling them it is not a product bug?
You need to prove whether the customer's data issue came from your application or from their direct database edits. What forensic signals, such as updated_at anomalies, missing audit rows, and trigger gaps, would you query for, and how conclusive are they?
A support dashboard query you inherited uses SELECT * across four joins and breaks every time the customer's schema gains a column. How do you rewrite it to be resilient, and what does the failure teach you about writing diagnostic queries?
The customer reports slow searches, and their EXPLAIN plan shows a full table scan because an index 'disappeared'. How do you investigate index loss across failed migrations, replica differences, and accidental drops, and confirm which replica actually serves the application?
Timezone bugs have shifted some of the customer's timestamps by exactly 5.5 hours, but only on rows written by one integration. How do you write queries that identify the affected population precisely enough for a targeted correction?
A GDPR deletion request was processed, yet the customer says the user still appears in exports. How do you trace the record through soft-deletes, materialized views, replicas, and cached aggregates to find exactly where the deletion failed to propagate?
You suspect the customer's CSV import duplicated rows with subtle whitespace differences that the UI hides. What SQL would you use to detect near-duplicate keys at scale, and how do you present the evidence to a non-technical admin?
A customer's integration gets 401s from your API, but the same key works in your Postman. List the differences you check between their runtime call and your reproduction, including encoding, header casing, proxies, and clock skew, and how you capture their exact request.
An enterprise customer reports that 0.3% of API calls return 500, but only under their production load and never in your tests. How do you instrument and sample their failing requests without asking them to change application code?
The customer's webhook receiver returns 200, yet they claim they 'never got the data'. How do you use your delivery logs, response body capture, and a request-bin style test endpoint to show where the payload actually went?
After your API deprecated a field, one customer's integration began silently writing nulls instead of failing loudly, and they have lost three weeks of data. How do you reconstruct what was sent versus stored, and what do you change about deprecation handling?
A customer swears their curl command 'is identical to the docs' yet keeps getting 400. Their paste shows smart quotes and a truncated token. How do you debug user-mangled requests systematically rather than by staring at them?
Your rate limiter returns 429s to a customer who claims they send only 10 requests per second against a 100 rps limit. How do you investigate bursting, automatic retries, keys shared across their microservices, and window effects to explain the math?
A customer's POST works in Postman but fails from their Java service with a 411. What differences in content-length handling, chunked encoding, and client library defaults would you investigate, and how do you confirm the diagnosis with a capture?
Pagination on your API occasionally skips records for one large customer, silently breaking their nightly sync. How would you prove whether the cause is cursor invalidation under concurrent writes, their resume logic, or an indexing lag on your side?
An integration partner blames your API for timeouts, but your access logs show their requests completing in 90ms. Where between their client and your edge could the time be going, and what evidence would you collect from each layer?
A customer reports your API returns different results for the same request depending on which of your regions answers. How do you investigate replication lag, cache divergence, and header-driven routing, and craft reproduction calls that pin a specific region?
The customer's OAuth refresh flow fails once a day right around token expiry, then recovers on its own. Describe how you reproduce the expiry boundary deterministically in a test setup instead of waiting 24 hours per attempt.
A high-volume customer hit a breaking change in an API version they never pinned. How do you assess which of their call patterns are affected, build them a concrete migration map, and prevent unversioned callers from being blindsided next release?
A customer reports intermittent SSL errors only when calling your API from their CI pipeline, never from developer laptops. What environment differences would you have them capture, and what would a minimal failing reproduction look like?
Your API returns 200 with an empty body for roughly one in 5,000 calls from a single customer, and engineering cannot reproduce it. What request-level fingerprints, including headers, sizes, timing, and connection reuse, do you gather to corner the bug?
Engineering bounced your escalation back twice with 'need more info', and the enterprise customer is now copying their CTO on every reply. What does your third escalation contain that the first two did not, and how do you make it un-bounceable?
You have confirmed a product bug, but the fix will not ship for two months and the customer's renewal is in three weeks. How do you package the workaround, the timeline, and the risk story for both engineering and the account team?
A developer asks you to 'just get the customer on a call' for the fourth time instead of reading your reproduction doc. How do you fix the handoff format so engineering trusts written escalations, and when do you still say yes to the call?
Your escalation queue holds nine tickets and engineering will accept only two this sprint. Walk through how you rank them across revenue at risk, severity, breadth, and age, and how you justify the cut list to the account managers who lose out.
Engineering's root-cause analysis contradicts the evidence you collected from the customer's environment, and you believe they debugged the wrong code path. How do you challenge their conclusion constructively without torching the cross-team relationship?
A bug you escalated was closed as 'works as designed', but the design genuinely harms this customer's workflow. How do you convert the ticket from a defect dispute into a product gap conversation, and who do you bring into it?
You are handing off a nine-month-old escalation with 300 comments to a new engineering owner. How do you compress that history into a brief they will actually read, and what do you deliberately leave out?
The customer keeps adding new symptoms to an escalated ticket, and engineering complains about scope creep. How do you decide what belongs to the original escalation versus new tickets, and how do you reset expectations on both sides?
A critical patch from engineering fixed the customer's bug but introduced a worse regression in their environment. Engineering wants fresh logs; the customer wants a rollback and an apology. Sequence your next four hours and justify the order.
Engineering can only reproduce the bug using the customer's data, and legal will not let raw data leave the tenant. What anonymization, synthetic-data, or supervised-access options do you broker so the investigation can keep moving?
Support escalations to engineering at your company have a 40% bounce-back rate, and you are asked to fix the process. What does your escalation template require, what gets rejected at triage, and how do you measure the improvement?
An engineer privately tells you the bug is known, gnarly, and unlikely to be prioritized, while the customer asks you directly for a fix date. How do you handle the information asymmetry honestly without leaking internal candor?
Two enterprise customers hit the same bug, but their preferred fixes conflict: patching one breaks the other's workflow. How do you represent both fairly in the engineering discussion, and what compromise paths do you put on the table?
You suspect the bug lives in a partner vendor's component rather than your product, but the customer holds you accountable end to end. How do you run a three-party escalation without the customer becoming the messenger between vendors?
A P1 escalation went quiet because the assigned engineer left the company and the ticket was never reassigned, and the customer noticed before you did. What do you fix in your follow-the-ticket process, and what exactly do you tell the customer?
An enterprise customer's VP opens a call by saying your product cost them a million-dollar deal and demands someone be fired. You do not yet know whether the fault is yours. Script your first five minutes on that call.
You must tell a customer that the data they lost is unrecoverable because of their own misconfiguration, and they are already hostile. How do you deliver the finding without triggering a blame war, and what do you offer as the next step?
A customer pastes your private status update into a public forum, out of context, claiming you admitted fault. How do you respond publicly and privately, and what changes about the way you write customer updates afterward?
After a rough month of incidents, the customer's admin now rejects every explanation you give and demands engineering on every call. How do you rebuild credibility ticket by ticket, and what early wins do you deliberately engineer?
During an outage call with forty customer stakeholders, their CIO asks you point-blank whether their data was exposed. You genuinely do not know yet. Give your exact answer, and explain why each phrase in it is chosen.
A non-technical customer keeps describing the bug in metaphors like 'it hiccups, then it sulks', and deadlines are looming. What questioning techniques turn their narrative into reproducible steps without making them feel interrogated?
You discover that your previous update to the customer contained a factual error that understated impact, and they have already reported your numbers upward. How do you correct the record while limiting damage to their internal credibility and yours?
An angry customer writes a 1,200-word email at midnight listing seven grievances, three tickets, and one threat to churn. Structure your reply: what you answer in writing, what you split into separate tickets, and what you deliberately move to a phone call.
Your product genuinely caused the customer's outage, and they are drafting an RCA for their regulator. How do you stay transparent and useful to them while coordinating with your own legal team on what can go in writing?
A customer demands hourly updates on a ticket where nothing will change for three days while a long job reruns. How do you negotiate a sane cadence without appearing to go dark, and what does each scheduled update actually contain?
The customer's security team and platform team give you contradictory requirements in the same email thread, and each escalates when you follow the other. How do you force a single decision-maker from their side without overstepping your role?
You are taking over the named-support role for an account the previous engineer let rot: fourteen stale tickets and several broken promises. Draft your first-week plan and the opening message you send to the customer.
On a recorded call, the customer's engineer becomes personally abusive toward your junior teammate, and you are the senior on the line. What do you say in the moment, and what follow-up do you drive with the account team afterward?
A customer asks you to confirm in writing that the upcoming release will 'definitely fix' their issue, and you are about 80% confident. How do you phrase a commitment that is honest, useful for their planning, and survivable if you are wrong?
Three stakeholders at the customer, an end user, an IT director, and a procurement lead, each need a different version of the same incident update. Describe how you layer one set of facts into three communications without any inconsistency between them.
Monday 9am: 60 new tickets from a weekend release, four flagged P1 in Zendesk, and two of those clearly mislabeled. Walk through your first hour of triage and how you defend downgrading an enterprise customer's self-declared P1.
A customer games the SLA by reopening resolved tickets to keep the clock running and accumulate breach credits. How do you address the pattern with evidence, fix the workflow that allows it, and avoid turning it into a contractual fight?
Your queue's first-response SLA is 30 minutes, but real diagnosis takes hours, so engineers send empty 'we are looking into it' replies. How do you make first responses genuinely useful to the customer without blowing the SLA?
ServiceNow shows your team breaching resolution SLA on 18% of P2s, but most of those breaches are tickets stuck waiting on customers. How do you redesign states, clock-pause rules, and customer nudges so the metric reflects reality?
A single enterprise customer files every issue as Urgent, drowning out genuine emergencies, and their contract guarantees priority handling. How do you re-educate them on severity definitions without violating the contract or souring the relationship?
You are merging two acquired support teams: one lives in JIRA Service Management, the other in Zendesk, with incompatible severity scales. Design the unified severity matrix and a migration plan that does not break in-flight SLA clocks.
A ticket bounced between three queues for nine days because each team's routing rules rejected it, and the customer just escalated to their account executive. How do you rescue the immediate ticket and then close the routing hole behind it?
Leadership wants average resolution time cut by 30%, and you know the easy path is closing tickets prematurely. What metric set do you propose instead so speed does not cannibalize quality, and how do you instrument it in your ticketing system?
An outage just generated 300 near-identical tickets in two hours. Describe how you use ticket linking, a parent incident record, and bulk macros so the team works the problem instead of drowning in the queue.
A strategic customer's contract specifies a 15-minute P1 response around the clock, but your follow-the-sun coverage has a known 90-minute gap they have not yet hit. Do you disclose, staff, or renegotiate, and how do you sequence those moves?
You return from two weeks of leave to 31 tickets where your backup only added 'monitoring' notes, and three customers feel abandoned. How do you triage your re-entry, and what handover artifact do you build so this never happens again?
Your CSAT sits at 4.6, yet churned customers' exit interviews cite support as a top frustration. How do you investigate the gap between survey scores and actual sentiment, and which ticket-level signals do you start mining first?
A customer demands their ticket stay open 'until we are confident', though the fix shipped and was verified three weeks ago, and the aging ticket is skewing your metrics. How do you close it without making them feel dismissed?
An audit finds 12% of your team's JIRA tickets had severity downgraded right before SLA breach with no customer-facing justification. How do you investigate whether this is gaming or legitimate retriage, and what controls do you add either way?
A regional outage floods support with 800 tickets an hour while you are the incident's customer-communications lead. Status page, ticket macros, enterprise calls, internal updates: sequence your first thirty minutes and explain the ordering.
Mid-incident, an enterprise customer demands their own dedicated bridge call while your only two engineers are deep in debugging. How do you satisfy a contractual dedicated-communications clause without pulling engineers off the actual fix?
The incident commander declares recovery, but your support queue still shows new failure reports trickling in from one geography. How do you challenge the all-clear with data, and what do you hold back from customers in the meantime?
Your status page says 'degraded performance' but customers are experiencing total failure and calling you a liar on social media. Who owns the wording, and how do you get it corrected fast without bypassing the incident process?
During a 14-hour P1, three of your support engineers have been on the bridge the entire time and quality is visibly slipping. Design the rotation schedule, handoff notes, and decision log that keep continuity through the night.
A major customer missed the incident entirely and discovers it from your postmortem email a week later, and they are furious about the silence. How do you audit why notification targeting failed them, and what do you change going forward?
Half the impact of the ongoing outage comes from your vendor's API, and their status page is silent while customers hold you responsible. How do you run vendor escalation and customer messaging in parallel without throwing the vendor under the bus?
You are the first to notice that five 'unrelated' tickets in twenty minutes share a failure signature, and no incident has been declared. Walk through how and when you pull the trigger on a major-incident declaration, and what evidence you bring.
Post-incident, enterprise customers demand an RCA within five days, but engineering's root cause is still at hypothesis level. How do you write an interim RCA that is honest about uncertainty yet specific enough for their change advisory boards?
During an outage, your workaround restores service but silently drops a small percentage of transactions. Support knows; the bridge has not decided whether to publicize it. What position do you push for, and what do you tell customers meanwhile?
Two simultaneous P1s hit: one affecting 4,000 small customers, the other affecting your single largest account. You have one incident commander and a thin night shift. How do you split command, staffing, and communications between them?
After the outage, 600 tickets remain in the queue: duplicates, side effects, and a few genuinely new bugs hidden in the noise. Describe your sweep strategy for closing the flood without losing the real defects buried inside it.
The postmortem reveals support held the key diagnostic clue ninety minutes before engineering found it independently, but nobody on the bridge listened. How do you change incident roles and rituals so support's signal reaches the commander with weight?
Your company's outage lands during the customer's own peak event, one they had pre-announced to you, and their contract has a shared-risk clause. What extra obligations kick in during the incident, and how does your handling differ from a normal P1?
Legal asks you to soften the customer-facing incident summary, but customers are asking pointed questions the softened version will not survive. As the senior support voice, how do you negotiate wording that is both safe and credible?
The same password-reset edge case generated 400 tickets this quarter despite a KB article that covers it. How do you determine whether the article fails on discoverability, clarity, or accuracy, and how do you prove your fix moved the ticket metric?
Your KB has 3,000 articles, a third outdated after a major release, and no owner per article. Design a triage-and-ownership model that makes the corpus trustworthy again without freezing the support team for a quarter.
A customer followed a KB article precisely and it broke their integration, because the article was written for the previous API version. Handle the immediate ticket, then describe the versioning scheme you would introduce for the KB.
Self-service deflection is up 20%, but a key enterprise customer complains the KB now buries advanced content under beginner walkthroughs. How do you serve both audiences in one knowledge base without forking and double-maintaining it?
You wrote a workaround article during an incident and 5,000 customers bookmarked it. The proper fix has shipped and the workaround is now actively harmful. What is your process for retiring dangerous-but-popular content without stranding readers?
Support engineers hoard fixes in private notes because publishing to the KB 'takes too long and gets edited to death'. Redesign the contribution pipeline so knowledge flows out of tickets within a day, with quality controls that do not strangle it.
Search analytics show the top failed query on your portal is a feature name customers actually use, while your docs only use internal jargon for it. How do you systematically mine this vocabulary gap and fix it across the corpus?
An AI answer bot on your portal confidently served a deprecated procedure stitched together from two old articles, and a customer ran it against production. How do you contain the damage and re-architect the content so the bot cannot synthesize from stale sources?
Your KB metrics reward article views, so authors optimize titles for clicks while ticket resolution stays flat. What measurement model ties articles to actual deflection and resolution outcomes, and how do you collect that data?
A regulated enterprise customer asks for a private, version-pinned knowledge base matching their locked deployment, which is two versions behind. How do you weigh the maintenance cost, and what alternative do you propose if the answer is no?
Five near-duplicate articles answer the same SSO error slightly differently, and two of them contradict each other. Walk through the consolidation: which article survives, how redirects work, and how you stop the duplication pattern from recurring.
Your self-service portal now deflects 60% of basic tickets, so the tickets that do arrive are disproportionately hard, and team morale plus handle-time metrics are cratering. How do you re-baseline metrics, staffing, and skills for the new mix?
Your support dashboard fired 200 alerts overnight; two were real customer-impacting issues and the team missed one of them. Walk through how you triage the morning after, and how you start cutting the noise floor so this stops happening.
A customer's synthetic monitor catches one-minute outages your internal monitoring never sees, and they send you the screenshots weekly. How do you instrument the gap across edge, DNS, region, and auth path, and reconcile the two views of uptime?
An alert for queue depth pages support every night at 2am and auto-resolves by 2:20, so everyone now ignores it. How do you decide whether to fix the threshold, the schedule, or the underlying batch job, and who owns each fix?
After a tooling migration, alert routing silently dropped the support team from three critical notification policies, and you discovered it during a missed customer outage. How do you audit routing coverage and build a canary that catches silent drops?
A major customer demands access to your internal monitoring dashboards as a trust measure after an incident. What do you expose, what do you refuse and why, and what purpose-built alternative do you offer them instead?
Your alert says the API error rate is 0.5%, comfortably within tolerance, but the errors are concentrated entirely on one enterprise tenant who is at 100% failure. Redesign the alerting so per-tenant pain is never averaged away again.
Support keeps learning about degradations from angry tickets ten minutes before the monitoring alert fires. Where in the detection pipeline do you look for the lag, and what leading indicators from the ticket stream itself could you alert on instead?
You are asked to define which monitoring alerts should page support, which page engineering, and which page nobody at 3am. Lay out your decision framework across customer impact, actionability, and ownership, then apply it to a noisy disk-usage alert.
A flapping health check has opened and closed the same incident 14 times today, sending customers contradictory status notifications each time. How do you stop the customer-facing fallout first, and then address the check itself?
Grafana shows P99 latency doubled for a customer's region, yet every individual service dashboard looks normal. How do you hunt the cross-service composition effect, such as retries, fan-out, and queueing, that no single panel will ever show you?
An on-call support engineer acknowledged a critical alert at 3am and went back to sleep without acting, and the customer outage ran four more hours. How do you handle the person, the process, and the customer-facing narrative?
Your team receives alerts from five tools, including CloudWatch, Datadog, Pingdom, Zendesk triggers, and a homegrown script, with no deduplication. Design the consolidation: the source of truth, the correlation keys, and what you decommission first.
A customer asks why you did not proactively alert them when their nightly sync started failing, since 'you monitor everything'. The failing component sits in their environment. How do you reset expectations about monitoring boundaries, and what shared options do you offer?
Your synthetic checks pass because they hit a cached healthy path, while real users on cache-miss paths fail 30% of the time. How do you redesign the checks to exercise the failure path without hammering production caches?
A customer on AWS says your product 'loses connection to the database' every night at 00:00 UTC, and their RDS instance shows a maintenance window at exactly that time. How do you confirm the correlation and help them reconfigure without overstepping into their account?
A customer's EC2-hosted deployment of your product degrades every time they scale their fleet, and they blame your licensing checks. How do you investigate instance metadata throttling, NAT gateway limits, and connection pool exhaustion across their topology?
The customer's S3-based export integration started failing with AccessDenied after 'no changes on our side'. What recent-change suspects, including bucket policies, SCPs, credential rotation, and block-public-access settings, do you walk them through, and in what order?
Your product runs inside the customer's VPC and they report intermittent timeouts to your control plane, but their security team will not share VPC flow logs. What minimum evidence do you negotiate for, and what can you infer from your side alone?
A customer claims your SaaS is 'slow from AWS but fine from the office'. Their workloads sit in ap-south-1 and your nearest region is Singapore. How do you build an honest latency picture across regions and present them realistic options?
After an AWS availability-zone event, your customer's self-hosted deployment failed to fail over even though they followed your HA guide. How do you reconstruct what happened from CloudWatch, your product logs, and their topology, and fix the guide if it is wrong?
The customer's Lambda-based integration retries your webhook so aggressively during your brownouts that it amplifies the outage. How do you diagnose the retry storm from your side, and coach them toward backoff without redesigning their stack for them?
A customer's AWS bill spiked because your product's misconfigured log shipping wrote terabytes into CloudWatch Logs in their account, and they want compensation. How do you verify the cause, quantify the spike precisely, and handle the money conversation?
An enterprise customer restricts you to a read-only IAM role for troubleshooting in their account. Which AWS surfaces, such as CloudTrail, CloudWatch, and configuration history, give you the most diagnostic value, and how do you avoid tripping their security alarms?
Your agent in the customer's EKS cluster gets OOM-killed only on their spot-instance node groups. How do you investigate node-level memory pressure, eviction ordering, and your agent's resource limits, and what deployment guidance do you change afterward?
A customer says your product 'broke' after they moved from long-lived IAM user keys to SSO with temporary credentials. What changes about credential lifetime and refresh do you investigate in your product's AWS SDK usage, and how do you reproduce the failure?
An enterprise customer wants your SaaS to reach their private subnets via PrivateLink instead of the public internet, claiming the public path causes their failures. How do you validate their claim and assess the support implications before committing engineering?
The customer's autoscaling doubled their fleet during a sale event, and your per-node product silently hit its license cap and dropped data. How do you piece together the timeline from both sides' logs, and what do you change about the silent failure mode?
Cross-account access your product depends on broke when the customer's security team added a permissions boundary, but they insist nothing changed. How do you use CloudTrail and IAM policy simulation to prove the cause without sounding accusatory?
You keep hand-checking 40 customer tenants for a bad config flag after every release. Describe the Python or Bash tool you would build, how you make it safe to run against production APIs, and where you draw the read-only line.
A teammate's cleanup script, run against the wrong customer tenant, deleted 2,000 records. You are asked to prevent recurrence without banning scripts. What guardrails, such as dry-run defaults, tenant confirmation, and audit logs, do you mandate, and how do you enforce them?
A customer sends a 2GB log bundle weekly and you spend two hours manually extracting the same ten signals each time. Sketch the parsing script you would write, how you handle format drift between product versions, and how you share it with the team.
Your Bash one-liner for log triage has grown into 80 unreadable lines that only you can run. How do you decide what to rewrite in Python, what safety to add, and how do you turn a personal tool into a supported team utility?
Support needs to replay a customer's failing API sequence 500 times to catch a one-in-a-hundred failure. Write up your approach: scripting the replay, capturing full request and response pairs, and stopping safely on the first reproduction.
You wrote a script that pulls diagnostics from customer environments, and a security review flags that it could exfiltrate secrets in environment variables. How do you redesign the collection with allowlists, redaction, and customer review steps to keep their trust?
A nightly Python script your team relies on for queue reports broke when the ticketing API changed pagination, and it failed silently for nine days. What do you change about error handling, alerting, and output validation so silent failures cannot recur?
You are automating triage: a script reads new tickets, runs known-issue fingerprints, and posts a suggested diagnosis. How do you keep false positives from misdirecting human engineers, and how do you measure whether the bot actually earns its keep?
The customer will only let you run scripts their security team has read line by line, but your usual diagnostic tool is 600 lines with external dependencies. How do you produce a minimal, auditable, dependency-free version without losing critical coverage?
Your team's collection of one-off support scripts is scattered across laptops with no versioning, secrets baked in, and three different Python versions. Design the migration to a shared, governed repository, including what you do with scripts nobody admits owning.
You need to diff 12,000 configuration files across two customer environments to find why one fails. Describe the script you would write, how you normalize noise like timestamps and hostnames, and how you rank the differences that remain by likely relevance.
A support script that calls your admin API ended up embedded in a customer's cron after an engineer shared it 'just this once'. It now breaks on every API change and pages your team. How do you unwind the dependency and set a sharing policy?
During an incident you must extract per-customer error frequencies from streaming logs in real time, and your usual tooling is down. What do you improvise with awk, sort, and uniq under pressure, and what do you double-check before quoting the numbers?
You want every diagnostic script to leave an audit trail of exactly what it read in the customer's environment, for compliance. Design the logging contract: what gets recorded, where it is stored, and who can read it, without doubling the script's runtime.
Your manager asks you to automate the weekly SLA-breach report that takes you three hours in spreadsheets. Walk through how you script it against the Zendesk API, validate the numbers against the old method, and hand it off so it survives you leaving the team.
Book a mock interview with a senior Application / Production Support Engineer mentor — structured scorecard, replay, and a gap plan.