About the Role
Our platform’s core idea is: align first, then cancel, then interpret. The most valuable data we produce is not
raw telemetry—it’s the residual after co-timing, validity gating, and nuisance cancellation, plus the receipts
that explain exactly how that residual was produced.
This role owns the data layer that makes that usable: dataset construction pipelines, feature views, schema
discipline, and lineage that lets anyone answer “which data went into this output, under what gates, with what
versions?”
What You’ll Own
-
Residual dataset pipelines: raw streams → co-timed streams → canceled residuals → windowed features, at scale and with clear contracts.
-
Lineage + receipts in data form: every dataset row carries pointers to inputs, versions, configs, calibration state, and validity decisions.
-
Feature tables / analytics views: stable feature sets for ML and product analytics (validity uptime, alarm quality, commissioning outcomes).
-
Data quality & validation: prevent silent corruption (missingness, skew, misaligned timestamps, schema drift) with tests and checks.
-
Backfills & schema evolution: reprocessing and versioned datasets when algorithms/configs change without breaking consumers.
What You’ll Do
-
Design canonical schemas for runs/windows, validity verdicts (valid/borderline/invalid + reasons), residual artifacts, and derived features.
-
Build reproducible pipelines (batch + incremental) with deterministic windowing and feature computation.
-
Implement dataset receipts: snapshot IDs/hashes, code + config versions, upstream algorithm versions, calibration state, gating rules used.
-
Create queryable analytics: validity uptime by site/zone/device, abstain reason distributions, commissioning pass rates, pipeline health.
-
Partner with ML + backend to support training-ready datasets, inference joins, and portal drilldowns (“show the evidence behind this alert”).
Concrete Deliverables
-
A Residual Dataset Spec: schemas, partitions, naming conventions, versioning strategy, lineage fields.
-
A working pipeline that produces window-level feature tables suitable for ML training plus validity gate tables with reason codes.
-
A data quality test suite (CI + scheduled): missingness, timestamp sanity, schema drift detection, outlier flags.
-
A backfill & migration playbook: recompute safely when co-timing/cancellation code changes while keeping old versions accessible.
-
A simple analytics dashboard: validity uptime, abstain reasons, pipeline freshness.
Required Qualifications
-
Strong experience with data engineering: reliable pipelines, data models, production-grade datasets.
-
Proficiency with common stacks (choose what fits): SQL + warehouses (Postgres/BigQuery/Snowflake), object storage patterns, orchestration (Airflow/Dagster/Prefect), Python for transforms/validation.
-
Ability to design schema evolution and versioning strategies that don’t break downstream consumers.
-
Comfort with time-series data realities: ordering, late arrivals, missing segments, and clock weirdness.
Preferred Qualifications
- Experience with lineage/provenance systems (manifests, immutable logs, dataset versioning).
- Familiarity with ML feature engineering workflows (feature tables, splits, leakage prevention).
- Experience with IoT/device telemetry ingestion and fleet-scale data quality issues.
- Basic DSP literacy (windowing, spectral summaries) helpful for feature correctness and sanity checks.
How You’ll Be Measured (First 60–90 Days)
-
You ship a training-ready residual dataset (and feature view) for at least one pilot type.
-
Lineage is queryable: given a feature row or alert, the team can trace it back to raw inputs + versions.
-
Pipeline quality improves: fewer mystery gaps, clearer freshness and missingness reporting.
-
Backfills are safe and repeatable when co-timing/cancellation code changes.
Working Style
- You treat every dataset as a product: documented, versioned, tested, reproducible.
- You don’t tolerate silent schema drift or “unknown joins.”
- You design for audits: the system can always answer “how was this computed?”
Title & Level
Data Engineer / Analytics Engineer (Residual Datasets + Lineage) (mid-to-senior; can scale to Staff if owning the data model + lineage architecture),
partnering with backend/data, Scientific ML, validation, and product/UI.
Apply
Send a short note and your resume.