Interview guide
DevOps Engineer Interview Questions & Answers Guide (2026)
A hiring-manager’s interview kit for devops engineers — with specific “what to look for” notes on every answer, red flags to watch, and a practical test.
Key facts
- Role
- DevOps Engineer
- Technical questions
- 14
- Behavioral
- 7
- Role-fit
- 5
- Red flags
- 8
- Practical test
- Included
How to use this guide
Pick 4-6 technical questions across difficulties, 2-3 behavioral, and 1-2 role-fit for a 45-minute interview. For senior roles, weight harder technical and role-fit higher. Always close with the practical test so you are hiring on evidence, not impressions. The “what to look for” notes are a scoring rubric: strong answers touch most points, weak answers miss them or replace them with platitudes.
Technical questions — Easy
1. Explain SLO vs SLA vs SLI. Give me an SLO for a user-facing API.
EasyWhat to look for: SLI = measurement (e.g., fraction of requests returning 2xx/3xx under 300ms), SLO = target (99.5% over 30 days), SLA = contractual commitment to customers. Realistic numbers, not 100%. Discusses error budget math.
Technical questions — Medium
1. Walk me through designing a CI/CD pipeline for a monorepo with 12 microservices. What matters?
MediumWhat to look for: Change detection (paths filter, Turborepo, Nx, Bazel), per-service builds, shared base images with caching, parallel test sharding, artifact promotion rather than rebuild across envs, branch protection, required checks. Mentions build time budget.
2. A pod is stuck in CrashLoopBackOff. Walk me through debugging.
MediumWhat to look for: kubectl describe pod for events and exit code, kubectl logs --previous, kubectl get events, check resource limits (OOMKilled?), liveness/readiness probe config, init container failure, secret/configmap mount failure, image pull errors. Ordered triage, not random.
3. Your GitHub Actions pipeline now takes 45 minutes. Walk me through cutting it in half.
MediumWhat to look for: Profile first (gha-runner timing, dependency install, test runtime). Wins: dependency caching (actions/cache), Docker layer caching via BuildKit, test sharding, matrix splits, removing unneeded steps, larger runners, self-hosted for heavy jobs.
4. Explain the difference between a liveness probe and a readiness probe, and a real-world way each can go wrong.
MediumWhat to look for: Liveness = restart if failing; readiness = remove from service endpoints. Liveness failing during startup causes infinite restart loop (fix: startupProbe). Aggressive readiness during a transient dep failure rips all pods out of service at once.
5. How do you manage secrets in a Kubernetes cluster so that they are not in git and not in plain YAML?
MediumWhat to look for: External Secrets Operator pulling from Vault/Secrets Manager, or sealed-secrets with bitnami controller, or SOPS-encrypted YAML in git with age keys. NOT: kubectl create secret with values in CI env vars then checked in.
6. What is GitOps, and how is Argo CD different from just running kubectl apply in CI?
MediumWhat to look for: Git as source of truth, pull-based reconciliation, drift detection, automatic sync, better audit trail, easier rollback (git revert), separation of CI and CD responsibilities. kubectl apply = push model, no drift detection, manual rollback.
7. Walk me through writing an HPA for an API that has spiky traffic. What signals do you autoscale on?
MediumWhat to look for: CPU is a lazy default — better: RPS per pod via custom metrics (Prometheus Adapter), or queue depth for workers. Mentions min/max replicas, scale-down stabilization, HPA v2 behavior section. Discusses why KEDA for event-driven workloads.
Technical questions — Hard
1. Design a progressive rollout for a new service version using Argo Rollouts. What are the gates?
HardWhat to look for: Canary at 5% → 25% → 50% → 100% with pause steps, automated analysis via Prometheus queries (error rate, p95 latency), automatic rollback on SLO breach, manual gate before 100%. Explains why blue/green vs canary for different workloads.
2. Walk me through an incident response from page to postmortem for a Sev1.
HardWhat to look for: Acknowledge page, declare incident in Slack, assign incident commander (IC) vs responders, status page update, mitigation first (rollback > diagnose), customer comms, timeline captured live, postmortem within 72h with blameless framing, action items assigned with due dates and tracked.
3. How do you alert on "users are having a bad time" without paging on every CPU spike?
HardWhat to look for: Symptom-based SLO burn alerts, multi-window multi-burn-rate (e.g., 2% budget in 1h OR 5% in 6h), not threshold-based CPU/memory alerts for paging. Keeps those as dashboards only. Cites Google SRE burn-rate alerting paper.
4. Your deploys pass CI but fail 2% of the time in production with transient 503s right after rollout. Diagnose.
HardWhat to look for: Readiness probe too generous, missing preStop hook + terminationGracePeriodSeconds, in-flight requests killed on pod termination, SIGTERM handling in app, PDB not protecting during node drain. Fix: preStop sleep + graceful shutdown in app.
5. How do you sign and verify container images in your pipeline?
HardWhat to look for: Cosign for signing (keyless via Fulcio + Sigstore or key-based), policy controller or Kyverno admission policy to enforce signatures on cluster, SBOM generation via Syft, vulnerability scan via Trivy/Grype before promotion.
6. Tell me about the worst production incident you ran. Walk me through the postmortem.
HardWhat to look for: Specific war story: timeline, detection source, MTTR, root cause explained causally (not "human error"), concrete action items shipped and verified. Blameless framing. No finger-pointing.
Behavioral questions
1. Describe a postmortem where the action items were never completed. What did you do about it?
What to look for: Owned follow-through, escalated, reconsidered whether items were well-scoped, accepted some action items are not worth doing and closed them explicitly. Learned about organizational friction.
2. Tell me about pushback from product/engineering on a release gate you enforced. How did you handle it?
What to look for: Explained error-budget policy, offered trade-offs (hotfix path, feature flag), not a gatekeeper for sport. Lands on shared understanding.
3. How do you run a blameless postmortem when the human action was clearly a mistake?
What to look for: Focus on the system that allowed the mistake (why did the guardrail not exist?), separates person from action, avoids the word "just". Shows the SRE culture understanding.
4. Describe a time you carried the pager and learned something about a service the hard way.
What to look for: Specific, reflective, updated runbook or dashboard after, humility about what was not understood before.
5. Walk me through a migration you led (e.g., Jenkins to GitHub Actions, or VMs to Kubernetes).
What to look for: Phased plan, parallel-run period, success metrics defined upfront, rollback plan, communication cadence, change management.
6. How do you balance reliability vs shipping velocity when product is under pressure?
What to look for: Error-budget framing: if budget is healthy, lean into velocity; if exhausted, gate releases. Not dogmatic either way.
7. Tell me about a time you were wrong about an infrastructure decision. How did you find out?
What to look for: Self-aware, concrete, data changed their mind (not politics), willing to reverse course.
Role-fit questions
1. Where do you see the line between DevOps and Cloud Engineering?
What to look for: DevOps = pipelines, deploys, reliability, observability, on-call culture. Cloud = platform architecture, IAM, cost, landing zone. Healthy collaboration, not turf war.
2. Our team ships 40 deploys a day to Kubernetes. What does your week look like in that environment?
What to look for: Pipeline health monitoring, dashboard review, on-call handoffs, pairing with product teams on deploy improvements, incident postmortems, runbook updates. Not firefighting 24/7.
3. How do you feel about carrying the pager in a follow-the-sun rotation with 4 hours of overlap?
What to look for: Realistic about coverage model, values runbooks and handoff quality over heroics, defines SLA in writing.
4. Our stack is GitHub Actions + Argo CD + EKS + Datadog. What have you used and where would you ramp?
What to look for: Honest gap assessment, concrete plan for gaps. Fakery is a red flag.
5. What does your first 30 days look like here?
What to look for: Infra audit, first pipeline fix week 1, shadow on-call week 2, first postmortem facilitated week 3-4, SLO review by end of month.
Red flags
Any one of these alone is usually reason to pass, especially combined with weak answers elsewhere.
- • Pages on CPU or memory thresholds and calls that "SRE monitoring".
- • Does not know the difference between push-based (kubectl apply) and pull-based (Argo CD) deployment models.
- • Writes postmortems that name individuals or use the phrase "human error" as root cause.
- • Stores secrets in env vars committed to .env files in the repo.
- • Runs deploys manually from a laptop in an emergency with no rollback plan.
- • Cannot explain error budgets or confuses SLO with SLA.
- • Uses latest image tags in production Deployments.
- • Has never run a restore test or a game day.
Practical test
6-hour take-home: given a small Go or Node service repo with a broken Dockerfile and no CI, (1) write a GitHub Actions pipeline that builds, tests, scans with Trivy, signs with Cosign, and pushes to a registry; (2) write Helm chart and Argo CD Application manifests to deploy to a kind/minikube cluster with HPA, PDB, and readiness/liveness probes; (3) wire up Prometheus scraping plus a Grafana dashboard for one SLO (latency p95) and a burn-rate alert rule. Graded on: pipeline correctness and speed (25%), Kubernetes manifests quality (25%), observability wiring (20%), supply chain rigor (15%), and a written trade-offs README (15%).
Scoring rubric
Score each answer 1-4: (1) Misses most of the rubric or gives platitudes; (2) Hits some points but cannot go deep when pressed; (3) Covers the rubric and can defend the answer under follow-ups; (4) Adds unprompted nuance, trade-offs, or real examples beyond the rubric. Hire at an average of 3.0+ across technical, behavioral, and role-fit, with zero red flags, and a pass on the practical test.
Related
Written by Syed Ali
Founder, Remoteria
Syed Ali founded Remoteria after a decade building distributed teams across 4 continents. He has helped 500+ companies source, vet, onboard, and scale pre-vetted offshore talent in engineering, design, marketing, and operations.
- • 10+ years building distributed remote teams
- • 500+ successful offshore placements across US, UK, EU, and APAC
- • Specialist in offshore vetting and cross-timezone team integration
Last updated: April 12, 2026