Medium
ML Infrastructure Engineer (Staff / Principal)
Medium, California, Missouri, United States, 65018
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a
ML Infrastructure Engineer (Staff / Principal)
in
California (USA) .
This role offers the opportunity to lead the development and optimization of cutting-edge ML infrastructure for large-scale generative and predictive AI models. You will work at the intersection of machine learning, physics, and computational chemistry, driving scalable, high-performance systems that accelerate AI research in molecular modeling. The position involves designing distributed training pipelines, optimizing GPU operations, and building robust MLOps frameworks that push the boundaries of AI performance. You will collaborate closely with researchers, engineers, and scientists, mentoring junior team members while contributing to long-term technical strategy. This is a hands‑on, high‑impact role where your work directly enables groundbreaking discoveries in molecular AI.
Accountabilities
Lead engineering efforts for building and scaling distributed ML training and inference infrastructure across GPU clusters and cloud environments.
Optimize model efficiency in terms of throughput, latency, memory, and GPU utilization, pushing hardware to its performance limits.
Design and implement MLOps tools and frameworks for automated, reliable deployment and evaluation of AI models.
Collaborate with researchers and cross‑functional teams to integrate infrastructure with generative and predictive AI workflows.
Drive long‑term platform vision, contributing to architectural decisions, tooling improvements, and best practices.
Mentor junior engineers and research interns, fostering a culture of technical excellence and innovation.
Requirements
Extensive experience in distributed ML training and inference on large‑scale GPU clusters.
Proficiency in PyTorch, PyTorch Lightning, PyTorch Geometric, Ray, or similar frameworks.
Strong engineering skills with the ability to design, implement, and maintain robust, scalable systems.
Experience optimizing GPU workloads and performance engineering for high‑throughput ML pipelines.
Independent thinker with a strong sense of ownership and ability to deliver from first principles to production‑quality systems.
Curiosity and problem‑solving mindset for working at the intersection of AI, physics, chemistry, and biology.
Nice to Have
Experience building and maintaining cluster infrastructure with Kubernetes and Terraform.
Expertise in GPU programming, XLA, Triton, CUDA, or deep learning compiler stacks.
Familiarity with molecular systems (proteins, small molecules, 3D structures), ML force fields, or point cloud data.
Experience contributing to highly collaborative, cross‑functional teams in research or production ML environments.
Benefits
Competitive salary and equity package.
Comprehensive health benefits: medical, dental, and vision fully covered for employees.
401(k) plan.
Open (unlimited) PTO policy and paid family leave (maternity and paternity).
Life, long‑term, and short‑term disability insurance.
Free meals at office locations and other employee perks.
Opportunities for growth, mentorship, and hands‑on impact in cutting‑edge molecular AI research.
Why Apply Through Jobgether?
We use an
AI‑powered matching process
to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top‑fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice:
By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre‑contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#J-18808-Ljbffr
ML Infrastructure Engineer (Staff / Principal)
in
California (USA) .
This role offers the opportunity to lead the development and optimization of cutting-edge ML infrastructure for large-scale generative and predictive AI models. You will work at the intersection of machine learning, physics, and computational chemistry, driving scalable, high-performance systems that accelerate AI research in molecular modeling. The position involves designing distributed training pipelines, optimizing GPU operations, and building robust MLOps frameworks that push the boundaries of AI performance. You will collaborate closely with researchers, engineers, and scientists, mentoring junior team members while contributing to long-term technical strategy. This is a hands‑on, high‑impact role where your work directly enables groundbreaking discoveries in molecular AI.
Accountabilities
Lead engineering efforts for building and scaling distributed ML training and inference infrastructure across GPU clusters and cloud environments.
Optimize model efficiency in terms of throughput, latency, memory, and GPU utilization, pushing hardware to its performance limits.
Design and implement MLOps tools and frameworks for automated, reliable deployment and evaluation of AI models.
Collaborate with researchers and cross‑functional teams to integrate infrastructure with generative and predictive AI workflows.
Drive long‑term platform vision, contributing to architectural decisions, tooling improvements, and best practices.
Mentor junior engineers and research interns, fostering a culture of technical excellence and innovation.
Requirements
Extensive experience in distributed ML training and inference on large‑scale GPU clusters.
Proficiency in PyTorch, PyTorch Lightning, PyTorch Geometric, Ray, or similar frameworks.
Strong engineering skills with the ability to design, implement, and maintain robust, scalable systems.
Experience optimizing GPU workloads and performance engineering for high‑throughput ML pipelines.
Independent thinker with a strong sense of ownership and ability to deliver from first principles to production‑quality systems.
Curiosity and problem‑solving mindset for working at the intersection of AI, physics, chemistry, and biology.
Nice to Have
Experience building and maintaining cluster infrastructure with Kubernetes and Terraform.
Expertise in GPU programming, XLA, Triton, CUDA, or deep learning compiler stacks.
Familiarity with molecular systems (proteins, small molecules, 3D structures), ML force fields, or point cloud data.
Experience contributing to highly collaborative, cross‑functional teams in research or production ML environments.
Benefits
Competitive salary and equity package.
Comprehensive health benefits: medical, dental, and vision fully covered for employees.
401(k) plan.
Open (unlimited) PTO policy and paid family leave (maternity and paternity).
Life, long‑term, and short‑term disability insurance.
Free meals at office locations and other employee perks.
Opportunities for growth, mentorship, and hands‑on impact in cutting‑edge molecular AI research.
Why Apply Through Jobgether?
We use an
AI‑powered matching process
to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top‑fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice:
By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre‑contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#J-18808-Ljbffr