Job description template
DevOps Engineer Job Description Template (2026)
A free, copy-ready DevOps Engineer job description covering responsibilities, must-have skills, tools, seniority variants, and KPIs. Written for hiring managers, not for SEO filler.
Key facts
- Role
- DevOps Engineer
- Reports to
- Reports to the Head of Engineering
- Must-have skills
- 8 items
- Seniority tiers
- Junior / Mid / Senior
- KPIs defined
- 6 metrics
- Starting price (offshore)
- $3200/month
Role summary
A DevOps Engineer owns the pipeline from commit to production: CI/CD on GitHub Actions, GitLab, CircleCI, or Argo CD; Kubernetes deployments, services, ingress, and HPAs; an observability stack across Prometheus, Grafana, and Datadog; SLOs with error budgets; PagerDuty rotation with written runbooks; and blameless postmortems that produce real guardrails. Focused on automation, reliability, and deployment velocity rather than the underlying cloud platform itself.
Responsibilities
- • Build and maintain CI/CD pipelines on GitHub Actions, GitLab CI, CircleCI, or Jenkins — aim for commit-to-production under 15 minutes on standard services.
- • Run GitOps deployments via Argo CD or Flux with progressive delivery (Argo Rollouts, Flagger) and automated rollback on SLO regression.
- • Operate Kubernetes 1.28+ workloads: Deployments, Services, Ingress (NGINX, Traefik, or cloud LB), HPAs, PDBs, NetworkPolicies, and cluster upgrades.
- • Manage Helm 3 charts and Kustomize overlays with a disciplined release cadence and chart versioning.
- • Instrument services with Prometheus metrics, OpenTelemetry traces, and structured logs routed to Loki, Elasticsearch, or Datadog.
- • Build Grafana dashboards tied to SLOs (latency, error rate, saturation) and error-budget burn-rate alerts — page on user impact, not CPU spikes.
- • Define and track SLOs/SLIs with error budgets; drive release policy when the budget is exhausted.
- • Own PagerDuty or Opsgenie rotation with written runbooks; target MTTR under 30 minutes on tier-1 services.
- • Run incident response: declare incident, coordinate responders, drive mitigation, write blameless postmortem within 72 hours with action items tracked to completion.
- • Manage secrets through HashiCorp Vault, SOPS, sealed-secrets, or External Secrets Operator — zero hardcoded credentials, automated rotation where supported.
- • Harden the software supply chain: signed container images (Cosign/Sigstore), SBOM generation (Syft), image scanning (Trivy, Snyk), SLSA compliance targets.
- • Run blameless postmortems and ship the process (template, review cadence, action item tracking) — not just the artifact.
Must-have skills
- • 4+ years shipping CI/CD pipelines for production services on GitHub Actions, GitLab CI, CircleCI, Jenkins, or Argo CD.
- • Production Kubernetes 1.28+ operation experience: deployments, services, ingress, HPAs, RBAC, troubleshooting pods that crash-loop.
- • Strong observability stack fluency: Prometheus + Grafana at minimum; bonus for Datadog, New Relic, or Honeycomb.
- • Terraform 1.5+ or Pulumi for the IaC slice DevOps owns (pipeline infra, cluster add-ons, observability stack).
- • Written SLOs and error budgets — has run a service by them, not just read the SRE book.
- • On-call experience with PagerDuty/Opsgenie and a track record of shipping runbooks.
- • Secrets management experience: Vault, SOPS, sealed-secrets, or External Secrets Operator — NOT secrets in env files.
- • Strong written English for incident docs, postmortems, and async review.
Nice-to-have skills
- • Progressive delivery: Argo Rollouts, Flagger, LaunchDarkly, or Unleash integration with SLO-driven rollback.
- • Supply chain security: Cosign, Sigstore, SLSA, Rekor, or in-toto.
- • Service mesh experience: Istio, Linkerd, or Consul Connect.
- • Chaos engineering: Chaos Mesh, Litmus, or Gremlin.
- • Experience migrating a team from Jenkins to modern CI (GitHub Actions, GitLab).
Tools and technology
- GitHub Actions / GitLab CI / CircleCI
- Argo CD / Flux (GitOps)
- Kubernetes 1.28+ / Helm 3 / Kustomize
- Terraform 1.5+
- Prometheus / Grafana / Loki
- Datadog / New Relic
- PagerDuty / Opsgenie
- HashiCorp Vault / SOPS / External Secrets
- Cosign / Trivy / Syft
- Argo Rollouts / Flagger
Reporting structure
Reports to the Head of Engineering, Head of Platform, or SRE Lead. Partners daily with product engineering teams on pipeline and deploy concerns, with Cloud Engineering on the platform / tooling boundary, with Security on supply chain and secrets, and with the on-call rotation across engineering.
Seniority variants
How responsibilities shift across junior, mid, and senior levels.
junior
1-3 years
- • Ship pipeline fixes and small Helm chart changes under senior review.
- • Maintain Grafana dashboards and triage noisy alerts.
- • Shadow on-call and handle runbook-driven incidents with senior support.
- • Learn the SLO framework and error budget discipline.
mid
3-5 years
- • Own the CI/CD pipeline for a product area including build time, caching, and test sharding.
- • Carry the pager for your services with clear MTTR targets.
- • Lead a postmortem for a real incident including action item follow-through.
- • Define SLOs with product engineering and enforce error-budget policy.
senior
6+ years
- • Set deployment strategy: GitOps pattern, progressive delivery framework, supply chain policy.
- • Own the observability stack architecture and cost envelope.
- • Lead incident command on high-severity events and ship systemic guardrails after.
- • Mentor mid/junior engineers, run DevOps hiring loop, and represent DevOps in architecture reviews.
Success metrics (KPIs)
- • Deployment frequency: multiple deploys per day per service without incident spike.
- • Lead time for changes (commit to production) under 60 minutes on standard services, under 15 minutes on hotfix path.
- • Change failure rate under 15%.
- • MTTR on tier-1 incidents under 30 minutes from page to mitigation.
- • SLO attainment above target on all services owned, with error-budget policy enforced when breached.
- • Postmortem cadence: 100% of Sev1/Sev2 incidents have a blameless postmortem published within 72 hours with action items tracked to completion.
Full JD (copy-ready)
Paste this into your ATS or careers page. Edit the company name and any bracketed placeholders.
# DevOps Engineer — Job Description ## Role summary A DevOps Engineer owns the pipeline from commit to production: CI/CD on GitHub Actions, GitLab, CircleCI, or Argo CD; Kubernetes deployments, services, ingress, and HPAs; an observability stack across Prometheus, Grafana, and Datadog; SLOs with error budgets; PagerDuty rotation with written runbooks; and blameless postmortems that produce real guardrails. Focused on automation, reliability, and deployment velocity rather than the underlying cloud platform itself. ## Responsibilities - Build and maintain CI/CD pipelines on GitHub Actions, GitLab CI, CircleCI, or Jenkins — aim for commit-to-production under 15 minutes on standard services. - Run GitOps deployments via Argo CD or Flux with progressive delivery (Argo Rollouts, Flagger) and automated rollback on SLO regression. - Operate Kubernetes 1.28+ workloads: Deployments, Services, Ingress (NGINX, Traefik, or cloud LB), HPAs, PDBs, NetworkPolicies, and cluster upgrades. - Manage Helm 3 charts and Kustomize overlays with a disciplined release cadence and chart versioning. - Instrument services with Prometheus metrics, OpenTelemetry traces, and structured logs routed to Loki, Elasticsearch, or Datadog. - Build Grafana dashboards tied to SLOs (latency, error rate, saturation) and error-budget burn-rate alerts — page on user impact, not CPU spikes. - Define and track SLOs/SLIs with error budgets; drive release policy when the budget is exhausted. - Own PagerDuty or Opsgenie rotation with written runbooks; target MTTR under 30 minutes on tier-1 services. - Run incident response: declare incident, coordinate responders, drive mitigation, write blameless postmortem within 72 hours with action items tracked to completion. - Manage secrets through HashiCorp Vault, SOPS, sealed-secrets, or External Secrets Operator — zero hardcoded credentials, automated rotation where supported. - Harden the software supply chain: signed container images (Cosign/Sigstore), SBOM generation (Syft), image scanning (Trivy, Snyk), SLSA compliance targets. - Run blameless postmortems and ship the process (template, review cadence, action item tracking) — not just the artifact. ## Must-have skills - 4+ years shipping CI/CD pipelines for production services on GitHub Actions, GitLab CI, CircleCI, Jenkins, or Argo CD. - Production Kubernetes 1.28+ operation experience: deployments, services, ingress, HPAs, RBAC, troubleshooting pods that crash-loop. - Strong observability stack fluency: Prometheus + Grafana at minimum; bonus for Datadog, New Relic, or Honeycomb. - Terraform 1.5+ or Pulumi for the IaC slice DevOps owns (pipeline infra, cluster add-ons, observability stack). - Written SLOs and error budgets — has run a service by them, not just read the SRE book. - On-call experience with PagerDuty/Opsgenie and a track record of shipping runbooks. - Secrets management experience: Vault, SOPS, sealed-secrets, or External Secrets Operator — NOT secrets in env files. - Strong written English for incident docs, postmortems, and async review. ## Nice-to-have skills - Progressive delivery: Argo Rollouts, Flagger, LaunchDarkly, or Unleash integration with SLO-driven rollback. - Supply chain security: Cosign, Sigstore, SLSA, Rekor, or in-toto. - Service mesh experience: Istio, Linkerd, or Consul Connect. - Chaos engineering: Chaos Mesh, Litmus, or Gremlin. - Experience migrating a team from Jenkins to modern CI (GitHub Actions, GitLab). ## Tools and technology - GitHub Actions / GitLab CI / CircleCI - Argo CD / Flux (GitOps) - Kubernetes 1.28+ / Helm 3 / Kustomize - Terraform 1.5+ - Prometheus / Grafana / Loki - Datadog / New Relic - PagerDuty / Opsgenie - HashiCorp Vault / SOPS / External Secrets - Cosign / Trivy / Syft - Argo Rollouts / Flagger ## Reporting structure Reports to the Head of Engineering, Head of Platform, or SRE Lead. Partners daily with product engineering teams on pipeline and deploy concerns, with Cloud Engineering on the platform / tooling boundary, with Security on supply chain and secrets, and with the on-call rotation across engineering. ## Success metrics (KPIs) - Deployment frequency: multiple deploys per day per service without incident spike. - Lead time for changes (commit to production) under 60 minutes on standard services, under 15 minutes on hotfix path. - Change failure rate under 15%. - MTTR on tier-1 incidents under 30 minutes from page to mitigation. - SLO attainment above target on all services owned, with error-budget policy enforced when breached. - Postmortem cadence: 100% of Sev1/Sev2 incidents have a blameless postmortem published within 72 hours with action items tracked to completion.
Frequently asked questions
What does a DevOps Engineer do day-to-day?
A DevOps Engineer owns the pipeline from commit to production: CI/CD on GitHub Actions, GitLab, CircleCI, or Argo CD; Kubernetes deployments, services, ingress, and HPAs; an observability stack across Prometheus, Grafana, and Datadog; SLOs with error budgets; PagerDuty rotation with written runbooks; and blameless postmortems that produce real guardrails. Focused on automation, reliability, and deployment velocity rather than the underlying cloud platform itself.
How many years of experience should a mid-level DevOps Engineer have?
A mid-level DevOps Engineer typically has 3-5 years of experience. At that level they should own the ci/cd pipeline for a product area including build time, caching, and test sharding.
Which KPIs should I hold a DevOps Engineer accountable to?
The most important KPIs for a DevOps Engineer are: Deployment frequency: multiple deploys per day per service without incident spike.; Lead time for changes (commit to production) under 60 minutes on standard services, under 15 minutes on hotfix path.; Change failure rate under 15%.; MTTR on tier-1 incidents under 30 minutes from page to mitigation..
What is your take on Terraform versus Pulumi versus CloudFormation?
Terraform is the default because the talent pool is deepest and it works across AWS, GCP, and Azure. Pulumi earns its cost when your team strongly prefers writing infrastructure in TypeScript or Python and you want real control flow and testing. CloudFormation only makes sense if you are all-in on AWS and value tighter native integration with Service Catalog and StackSets. A good DevOps engineer will not start a religious war, they will match whatever you already run and suggest a migration only when the pain of your current tool is higher than the cost of switching.
How do they approach cost optimization without breaking production?
Measure first, cut second. Standard approach is Cost Explorer and Kubecost baseline reports for two weeks to see where money actually goes, then target the top three line items. Common wins are right-sizing oversized EC2 and RDS through CloudWatch metrics, moving non-production workloads to spot or preemptible, reserved instances or savings plans on steady-state workloads, S3 lifecycle rules to Glacier on logs older than 90 days, and killing abandoned snapshots, unattached EBS volumes, and idle load balancers. They will never touch production capacity without modeling load first.
Related
Written by Syed Ali
Founder, Remoteria
Syed Ali founded Remoteria after a decade building distributed teams across 4 continents. He has helped 500+ companies source, vet, onboard, and scale pre-vetted offshore talent in engineering, design, marketing, and operations.
- • 10+ years building distributed remote teams
- • 500+ successful offshore placements across US, UK, EU, and APAC
- • Specialist in offshore vetting and cross-timezone team integration
Last updated: April 12, 2026