G2M Talent
Job Summary
We are a well‑funded early‑stage research team building the engineering foundations required to understand how modern neural networks learn and behave. We are seeking a Machine Learning Engineer to design, scale, and operate the infrastructure behind large‑scale training, experimentation, and mechanistic analysis of transformer‑based models.
Responsibilities
Design, build, and maintain end‑to‑end machine learning pipelines supporting large‑scale training and evaluation of deep neural networks
Optimize training and inference throughput for transformer‑based and novel model architectures
Develop and operate distributed training infrastructure across multi‑GPU and multi‑node environments
Collaborate closely with researchers to translate experimental goals into reliable, scalable engineering systems
Instrument training runs to surface meaningful signals around optimization behavior, representations, and model performance
Implement tooling for experiment tracking, reproducibility, and comparative analysis across runs
Debug training instabilities, performance bottlenecks, and systems‑level failures in complex ML workloads
Support rapid iteration on architectures, optimizers, and training regimes through robust infrastructure design
Required Qualifications
Strong experience in deep learning and large‑scale model training
Deep familiarity with
PyTorch or JAX ; working knowledge of lower‑level tooling (e.g., Triton or custom kernels) is a plus
Proven experience building or operating ML infrastructure for training, evaluation, or experimentationStrong systems intuition around performance, memory, and distributed execution
Ability to design and orchestrate end‑to‑end ML workflows, from data ingestion to training and evaluation
Clear written and verbal communication skills, especially when working with research collaborators
Ability to learn quickly and operate effectively in an ambiguous, research‑driven environment
Pay: $150,000.00 – $400,000.00 per year
Benefits
Relocation assistance
Work Location: In person
#J-18808-Ljbffr
Responsibilities
Design, build, and maintain end‑to‑end machine learning pipelines supporting large‑scale training and evaluation of deep neural networks
Optimize training and inference throughput for transformer‑based and novel model architectures
Develop and operate distributed training infrastructure across multi‑GPU and multi‑node environments
Collaborate closely with researchers to translate experimental goals into reliable, scalable engineering systems
Instrument training runs to surface meaningful signals around optimization behavior, representations, and model performance
Implement tooling for experiment tracking, reproducibility, and comparative analysis across runs
Debug training instabilities, performance bottlenecks, and systems‑level failures in complex ML workloads
Support rapid iteration on architectures, optimizers, and training regimes through robust infrastructure design
Required Qualifications
Strong experience in deep learning and large‑scale model training
Deep familiarity with
PyTorch or JAX ; working knowledge of lower‑level tooling (e.g., Triton or custom kernels) is a plus
Proven experience building or operating ML infrastructure for training, evaluation, or experimentationStrong systems intuition around performance, memory, and distributed execution
Ability to design and orchestrate end‑to‑end ML workflows, from data ingestion to training and evaluation
Clear written and verbal communication skills, especially when working with research collaborators
Ability to learn quickly and operate effectively in an ambiguous, research‑driven environment
Pay: $150,000.00 – $400,000.00 per year
Benefits
Relocation assistance
Work Location: In person
#J-18808-Ljbffr