Careers

Data Engineer / Analytics Engineer (Residual Datasets + Lineage)

Focus: build residual datasets (post co-timing + gating + cancellation), enforce lineage/receipts, and deliver clean feature tables.

About the Role

Our platform’s core idea is: align first, then cancel, then interpret. The most valuable data we produce is not raw telemetry—it’s the residual after co-timing, validity gating, and nuisance cancellation, plus the receipts that explain exactly how that residual was produced.

This role owns the data layer that makes that usable: dataset construction pipelines, feature views, schema discipline, and lineage that lets anyone answer “which data went into this output, under what gates, with what versions?”

What You’ll Own

  • Residual dataset pipelines: raw streams → co-timed streams → canceled residuals → windowed features, at scale and with clear contracts.
  • Lineage + receipts in data form: every dataset row carries pointers to inputs, versions, configs, calibration state, and validity decisions.
  • Feature tables / analytics views: stable feature sets for ML and product analytics (validity uptime, alarm quality, commissioning outcomes).
  • Data quality & validation: prevent silent corruption (missingness, skew, misaligned timestamps, schema drift) with tests and checks.
  • Backfills & schema evolution: reprocessing and versioned datasets when algorithms/configs change without breaking consumers.

What You’ll Do

  • Design canonical schemas for runs/windows, validity verdicts (valid/borderline/invalid + reasons), residual artifacts, and derived features.
  • Build reproducible pipelines (batch + incremental) with deterministic windowing and feature computation.
  • Implement dataset receipts: snapshot IDs/hashes, code + config versions, upstream algorithm versions, calibration state, gating rules used.
  • Create queryable analytics: validity uptime by site/zone/device, abstain reason distributions, commissioning pass rates, pipeline health.
  • Partner with ML + backend to support training-ready datasets, inference joins, and portal drilldowns (“show the evidence behind this alert”).

Concrete Deliverables

  • A Residual Dataset Spec: schemas, partitions, naming conventions, versioning strategy, lineage fields.
  • A working pipeline that produces window-level feature tables suitable for ML training plus validity gate tables with reason codes.
  • A data quality test suite (CI + scheduled): missingness, timestamp sanity, schema drift detection, outlier flags.
  • A backfill & migration playbook: recompute safely when co-timing/cancellation code changes while keeping old versions accessible.
  • A simple analytics dashboard: validity uptime, abstain reasons, pipeline freshness.

Required Qualifications

  • Strong experience with data engineering: reliable pipelines, data models, production-grade datasets.
  • Proficiency with common stacks (choose what fits): SQL + warehouses (Postgres/BigQuery/Snowflake), object storage patterns, orchestration (Airflow/Dagster/Prefect), Python for transforms/validation.
  • Ability to design schema evolution and versioning strategies that don’t break downstream consumers.
  • Comfort with time-series data realities: ordering, late arrivals, missing segments, and clock weirdness.

Preferred Qualifications

  • Experience with lineage/provenance systems (manifests, immutable logs, dataset versioning).
  • Familiarity with ML feature engineering workflows (feature tables, splits, leakage prevention).
  • Experience with IoT/device telemetry ingestion and fleet-scale data quality issues.
  • Basic DSP literacy (windowing, spectral summaries) helpful for feature correctness and sanity checks.

How You’ll Be Measured (First 60–90 Days)

  • You ship a training-ready residual dataset (and feature view) for at least one pilot type.
  • Lineage is queryable: given a feature row or alert, the team can trace it back to raw inputs + versions.
  • Pipeline quality improves: fewer mystery gaps, clearer freshness and missingness reporting.
  • Backfills are safe and repeatable when co-timing/cancellation code changes.

Working Style

  • You treat every dataset as a product: documented, versioned, tested, reproducible.
  • You don’t tolerate silent schema drift or “unknown joins.”
  • You design for audits: the system can always answer “how was this computed?”

Title & Level

Data Engineer / Analytics Engineer (Residual Datasets + Lineage) (mid-to-senior; can scale to Staff if owning the data model + lineage architecture), partnering with backend/data, Scientific ML, validation, and product/UI.

Apply

Send a short note and your resume.

Back to roles

We only use this to respond to your application. No spam.