Careers

ML Research Engineer (Optimization & Training)

Focus: hyperparameter search, training stability, ablation rigor, compute efficiency, and making structured dynamics models converge.

About the Role

We’re building structured ML on top of a contract-driven signal stack: co-timing → validity gating → nuisance cancellation → residual dynamics learning. This role lives in the middle: make the models train well.

You own the optimization and training craft needed to turn promising dynamics models into stable, high-performing systems: hyperparameter search, training diagnostics, ablations, compute efficiency, and evidence-based answers to “why didn’t it converge?”

This is not “try 1,000 random configs.” It’s disciplined experimentation with controlled comparisons and tight reporting.

What You’ll Own

  • Hyperparameter optimization: systematic searches (Bayes opt / bandits / PBT where appropriate) with reproducible configs.
  • Training stability: diagnose divergence, pathological gradients, stiffness, numerical instability, and non-identifiability.
  • Ablation discipline: prove which architectural/feature choices matter via controlled ablations.
  • Compute efficiency: profiling, batching, mixed precision/compile modes; keep training and inference costs bounded.
  • Model selection criteria: ship-ready criteria beyond loss curves: calibration, robustness across regimes, abstention behavior, failure modes.

What You’ll Do

  • Build stable training recipes for structured dynamics models (and strong baselines) on residual datasets.
  • Develop training diagnostics: gradient norms, loss decomposition, sensitivity to window length/sampling rate/validity weighting.
  • Run hyperparameter studies with receipts (seeds, dataset hashes, gate definitions, code versions) and interpret results.
  • Stress-test generalization under regime shifts and across sites/zones; handle validity collapse scenarios where usable data shrinks.
  • Collaborate with Scientific ML, Applied Statistics, and MLOps on targets, evaluation, and pipeline integration.

Concrete Deliverables

  • A tuning and training framework (v1) integrated with receipts (configs, seeds, datasets, metrics).
  • Stable training baselines for residual dynamics (AR/VAR/state-space) plus at least one structured model that converges reliably.
  • An ablation report template and first ablation suite (which state features matter, which regularizers help, what’s brittle).
  • A compute profile + optimization plan with bottlenecks identified and mitigations implemented.
  • A model selection rubric tied to pilot value (lead time, false alarms, calibration, robustness thresholds).

Required Qualifications

  • Strong experience training ML models with nontrivial optimization challenges (time-series, dynamical systems, or physics-informed models).
  • Demonstrated skill in hyperparameter optimization and experiment design (not just running sweeps—interpreting them).
  • Strong Python + PyTorch/JAX (or equivalent) proficiency and ability to write clean, testable training code.
  • Practical numerical instincts: stability, step sizes, stiffness, normalization, failure-mode debugging.

Preferred Qualifications

  • Experience with Neural ODEs / continuous-depth models, stiff solvers, adjoint methods, or related numerical methods.
  • Familiarity with structured dynamics inductive biases (energy/Lagrangian/Hamiltonian styles) where practical.
  • Experience operating tuning infrastructure at scale (distributed training, scheduling, spot instances).
  • Experience with robustness evaluation and uncertainty calibration under nonstationarity.

How You’ll Be Measured (First 60–90 Days)

  • A structured model (or closest practical equivalent) trains reliably on at least one pilot dataset and beats baseline on a meaningful metric.
  • Hyperparameter studies become reproducible and interpretable (not “we tried a bunch of stuff”).
  • Training failures are diagnosable: you can explain why a run diverged and what fixes it.
  • Compute costs come under control (clear profiles, faster iterations, fewer wasted sweeps).

Working Style

  • You prefer controlled experiments over “try everything.”
  • You treat convergence as an engineering problem with instrumentation and receipts.
  • You like turning training folklore into repeatable playbooks.

Title & Level

ML Research Engineer (Optimization & Training) (senior IC; can scale to Staff if owning experimentation/tuning platform), partnering with Scientific ML, Applied Statistics, and MLOps.

Apply

Send a short note and your resume.

Back to roles

We only use this to respond to your application. No spam.