Remoteria — Pre-Vetted Offshore Staffing

Syed Ali

Job description template

Data Engineer Job Description Template (2026)

A free, copy-ready Data Engineer job description covering responsibilities, must-have skills, tools, seniority variants, and KPIs. Written for hiring managers, not for SEO filler.

Hire a pre-vetted data engineer

Key facts

Role: Data Engineer
Reports to: Reports to the Head of Data
Must-have skills: 8 items
Seniority tiers: Junior / Mid / Senior
KPIs defined: 6 metrics
Starting price (offshore): $3400/month

Role summary

A Data Engineer owns the pipelines and warehouse that everything else in the company runs on: ingesting from SaaS sources and production databases, modeling in dbt, orchestrating with Airflow or Dagster, landing curated data in Snowflake, BigQuery, or Redshift, and keeping it fresh, tested, and cheap. This is a backend engineering role with production on-call responsibility — not dashboard work, not ML modeling.

Responsibilities

• Build and maintain ELT pipelines using Fivetran, Airbyte, or custom Python connectors for sources like Salesforce, Stripe, HubSpot, and production Postgres.
• Model the warehouse in dbt with staging, intermediate, and mart layers; write incremental models on billion-row tables; keep full refresh runtime under cost budget.
• Orchestrate DAGs in Airflow, Dagster, or Prefect with proper retries, SLA alerts, and dependency-aware scheduling.
• Design dimensional models (Kimball star/snowflake) with slowly-changing dimensions where the business actually needs history.
• Build CDC pipelines with Debezium, Fivetran HVR, or native Snowflake streams for near-real-time replication from OLTP databases.
• Ship streaming ingestion through Kafka, Kinesis, or Pub/Sub into Snowpipe, BigQuery streaming inserts, or Redshift when 15-minute micro-batches are not enough.
• Write data quality tests with dbt tests, Great Expectations, or Soda Core on primary keys, referential integrity, null rates, and business rules.
• Monitor freshness, volume, and schema drift through Monte Carlo, Elementary, or Datafold; own the pager when pipelines break.
• Optimize warehouse cost — clustering keys, partitioning, materialized views, query profile review, and Snowflake warehouse right-sizing.
• Implement column-level access controls, PII tokenization, and row-level security for HIPAA, SOC 2, or GDPR scopes.
• Run data diffs on dbt refactors through Datafold or SQL compare so changes to core models do not silently break downstream dashboards.
• Document lineage, ownership, and SLAs in dbt docs, DataHub, or Atlan so the analytics team knows who to page.

Must-have skills

• 4+ years building production data pipelines against a cloud warehouse (Snowflake, BigQuery, Redshift, or Databricks).
• Fluent in SQL and Python — can write window functions, recursive CTEs, and idempotent Python connectors without hand-holding.
• Production experience with dbt including incremental models, snapshots, tests, and macros.
• Hands-on orchestration with Airflow, Dagster, or Prefect in production — not a tutorial project.
• Dimensional modeling knowledge (star schema, SCD Type 1/2) and understanding of OLTP-to-OLAP modeling trade-offs.
• Data quality discipline: knows when to use dbt tests vs Great Expectations vs Datafold and has shipped all three.
• Working knowledge of at least one streaming system (Kafka, Kinesis, Pub/Sub, Flink, or Spark Streaming).
• Comfortable with Docker, Git, CI/CD, and Terraform or the equivalent infra-as-code for data platform resources.

Nice-to-have skills

• Spark on Databricks or EMR for heavy transforms that exceed warehouse SQL.
• CDC experience with Debezium, Fivetran HVR, or Snowflake streams.
• Observability tooling (Monte Carlo, Elementary, Bigeye) for freshness and anomaly detection.
• Experience with a data catalog (DataHub, Atlan, Alation, Collibra).
• FinOps for data — has actually cut a 6-figure warehouse bill.
• dbt Mesh or data mesh patterns for multi-domain org structures.

Tools and technology

Python / SQL
dbt Core / Cloud
Airflow / Dagster
Snowflake / BigQuery / Redshift
Fivetran / Airbyte
Kafka / Kinesis
Debezium
Great Expectations
Monte Carlo / Elementary
Terraform / Docker

Reporting structure

Reports to the Head of Data, Data Platform Lead, or VP Engineering. Partners daily with data analysts (the primary consumers of marts), ML engineers (feature pipelines), software engineers on the source systems, and DevOps on platform infrastructure.

Seniority variants

How responsibilities shift across junior, mid, and senior levels.

junior

1-3 years

• Ship staging and intermediate dbt models under review from a senior engineer.
• Own scoped Fivetran or Airbyte source configurations and monitor their freshness.
• Write dbt tests and Great Expectations checks on existing models.
• Triage P2/P3 pipeline alerts and escalate P1s.

mid

3-6 years

• Own a domain of the warehouse end-to-end (e.g. revenue, product, or marketing marts).
• Design new Airflow / Dagster DAGs and the testing strategy for them.
• Review PRs from junior engineers on dbt and Python connector code.
• Participate in pipeline on-call rotation as primary responder.

senior

6+ years

• Set warehouse architecture, dbt project conventions, and orchestration patterns across the data org.
• Lead platform migrations (Redshift→Snowflake, Airflow→Dagster) and major cost-optimization projects.
• Mentor mid and junior engineers and run data engineering hiring loops.
• Partner with security, legal, and platform on PII handling, RBAC, and compliance controls.

Success metrics (KPIs)

• Pipeline SLA: greater than 99% of critical marts land on time per SLA, measured in Monte Carlo or Elementary.
• Data quality incidents: zero silent data quality bugs reaching a production dashboard per quarter.
• Warehouse cost per analytical model trending flat or down quarter-over-quarter.
• dbt test coverage on primary-key and not-null assertions greater than 90% on mart-layer models.
• On-call health: mean time to detection under 15 minutes, mean time to recovery under 2 hours for P1 pipeline incidents.
• Time-to-new-source: median under 5 business days from request to production ingestion.

Full JD (copy-ready)

Paste this into your ATS or careers page. Edit the company name and any bracketed placeholders.

# Data Engineer — Job Description

## Role summary
A Data Engineer owns the pipelines and warehouse that everything else in the company runs on: ingesting from SaaS sources and production databases, modeling in dbt, orchestrating with Airflow or Dagster, landing curated data in Snowflake, BigQuery, or Redshift, and keeping it fresh, tested, and cheap. This is a backend engineering role with production on-call responsibility — not dashboard work, not ML modeling.

## Responsibilities
- Build and maintain ELT pipelines using Fivetran, Airbyte, or custom Python connectors for sources like Salesforce, Stripe, HubSpot, and production Postgres.
- Model the warehouse in dbt with staging, intermediate, and mart layers; write incremental models on billion-row tables; keep full refresh runtime under cost budget.
- Orchestrate DAGs in Airflow, Dagster, or Prefect with proper retries, SLA alerts, and dependency-aware scheduling.
- Design dimensional models (Kimball star/snowflake) with slowly-changing dimensions where the business actually needs history.
- Build CDC pipelines with Debezium, Fivetran HVR, or native Snowflake streams for near-real-time replication from OLTP databases.
- Ship streaming ingestion through Kafka, Kinesis, or Pub/Sub into Snowpipe, BigQuery streaming inserts, or Redshift when 15-minute micro-batches are not enough.
- Write data quality tests with dbt tests, Great Expectations, or Soda Core on primary keys, referential integrity, null rates, and business rules.
- Monitor freshness, volume, and schema drift through Monte Carlo, Elementary, or Datafold; own the pager when pipelines break.
- Optimize warehouse cost — clustering keys, partitioning, materialized views, query profile review, and Snowflake warehouse right-sizing.
- Implement column-level access controls, PII tokenization, and row-level security for HIPAA, SOC 2, or GDPR scopes.
- Run data diffs on dbt refactors through Datafold or SQL compare so changes to core models do not silently break downstream dashboards.
- Document lineage, ownership, and SLAs in dbt docs, DataHub, or Atlan so the analytics team knows who to page.

## Must-have skills
- 4+ years building production data pipelines against a cloud warehouse (Snowflake, BigQuery, Redshift, or Databricks).
- Fluent in SQL and Python — can write window functions, recursive CTEs, and idempotent Python connectors without hand-holding.
- Production experience with dbt including incremental models, snapshots, tests, and macros.
- Hands-on orchestration with Airflow, Dagster, or Prefect in production — not a tutorial project.
- Dimensional modeling knowledge (star schema, SCD Type 1/2) and understanding of OLTP-to-OLAP modeling trade-offs.
- Data quality discipline: knows when to use dbt tests vs Great Expectations vs Datafold and has shipped all three.
- Working knowledge of at least one streaming system (Kafka, Kinesis, Pub/Sub, Flink, or Spark Streaming).
- Comfortable with Docker, Git, CI/CD, and Terraform or the equivalent infra-as-code for data platform resources.

## Nice-to-have skills
- Spark on Databricks or EMR for heavy transforms that exceed warehouse SQL.
- CDC experience with Debezium, Fivetran HVR, or Snowflake streams.
- Observability tooling (Monte Carlo, Elementary, Bigeye) for freshness and anomaly detection.
- Experience with a data catalog (DataHub, Atlan, Alation, Collibra).
- FinOps for data — has actually cut a 6-figure warehouse bill.
- dbt Mesh or data mesh patterns for multi-domain org structures.

## Tools and technology
- Python / SQL
- dbt Core / Cloud
- Airflow / Dagster
- Snowflake / BigQuery / Redshift
- Fivetran / Airbyte
- Kafka / Kinesis
- Debezium
- Great Expectations
- Monte Carlo / Elementary
- Terraform / Docker

## Reporting structure
Reports to the Head of Data, Data Platform Lead, or VP Engineering. Partners daily with data analysts (the primary consumers of marts), ML engineers (feature pipelines), software engineers on the source systems, and DevOps on platform infrastructure.

## Success metrics (KPIs)
- Pipeline SLA: greater than 99% of critical marts land on time per SLA, measured in Monte Carlo or Elementary.
- Data quality incidents: zero silent data quality bugs reaching a production dashboard per quarter.
- Warehouse cost per analytical model trending flat or down quarter-over-quarter.
- dbt test coverage on primary-key and not-null assertions greater than 90% on mart-layer models.
- On-call health: mean time to detection under 15 minutes, mean time to recovery under 2 hours for P1 pipeline incidents.
- Time-to-new-source: median under 5 business days from request to production ingestion.

Frequently asked questions

What does a Data Engineer do day-to-day?

A Data Engineer owns the pipelines and warehouse that everything else in the company runs on: ingesting from SaaS sources and production databases, modeling in dbt, orchestrating with Airflow or Dagster, landing curated data in Snowflake, BigQuery, or Redshift, and keeping it fresh, tested, and cheap. This is a backend engineering role with production on-call responsibility — not dashboard work, not ML modeling.

How many years of experience should a mid-level Data Engineer have?

A mid-level Data Engineer typically has 3-6 years of experience. At that level they should own a domain of the warehouse end-to-end (e.g. revenue, product, or marketing marts).

Which KPIs should I hold a Data Engineer accountable to?

The most important KPIs for a Data Engineer are: Pipeline SLA: greater than 99% of critical marts land on time per SLA, measured in Monte Carlo or Elementary.; Data quality incidents: zero silent data quality bugs reaching a production dashboard per quarter.; Warehouse cost per analytical model trending flat or down quarter-over-quarter.; dbt test coverage on primary-key and not-null assertions greater than 90% on mart-layer models..

ELT or ETL — what is your take?

ELT in most modern stacks. Cheap compute and elastic storage in Snowflake, BigQuery, and Redshift mean it is almost always faster and cheaper to land raw data and transform in the warehouse than to run heavy ETL on a Python box. The exceptions are when source data contains PII that cannot leave a specific region, when the raw data is so large that filtering at extract saves real money, or when the source system cannot handle a full table scan. Your data engineer will ask about those constraints before picking a pattern.

How do they keep data quality from degrading over time?

Tests, monitoring, and ownership. Every critical table gets dbt tests on primary keys, referential integrity, and null rates. Every SaaS source gets a Monte Carlo, Elementary, or Datafold freshness and volume monitor with alerts going to the right Slack channel. Every dbt mart gets a named owner in the model YAML so when something breaks the right person is paged. They also run data diffs on refactors through Datafold or a homegrown SQL compare so changes to core models do not silently break downstream dashboards.