Strativ Group

Senior ML Infrastructure / MLOps Engineer (Sunnyvale)

Strativ Group, Sunnyvale, California, United States, 94087

Senior ML Infrastructure / MLOps Engineer

We are partnered with a cutting-edge AI research company building large-scale intelligent systems that operate in the physical world. The company is well-funded, operates at real-world scale, and is backed by major institutional partners. Their work sits at the intersection of machine learning, robotics, and systems engineering, pushing the boundaries of what AI systems can design, experiment, and manufacture.

The business is developing advanced AI platforms with broad physical capability - enabling models to reason, act, and learn across complex real-world environments. With deep technical roots and proven progress on large, mission-critical programs, they are building end-to-end systems that integrate large models, reinforcement learning, and real-world data at scale. The team is composed of highly technical engineers and researchers with backgrounds spanning ML research, systems engineering, and large-scale infrastructure. This is a highly technical environment focused on first-principles thinking, ownership of real systems, and building capabilities that do not yet exist.

They are hiring a Senior ML Infrastructure / MLOps Engineer to own and scale the infrastructure that underpins model training, fine-tuning, evaluation, and deployment. You will work closely with ML researchers, RL scientists, and systems engineers to enable rapid, reliable experimentation across large language models, RL agents, and surrogate models. This role sits at the core of the ML stack, with significant scope for technical ownership and long-term progression.

What Youll Work On

Building and operating scalable infrastructure for large-scale training, fine-tuning, and RLHF/DPO workflows Designing distributed training systems and high-performance experimentation pipelines Developing data pipelines, dataset versioning, experiment tracking, and reproducibility frameworks Operating containerized training and inference environments, CI/CD for models, and automated evaluation pipelines Collaborating cross-functionally to support fast iteration and robust deployment of research models into production

Key Experience Required

Strong experience in ML infrastructure, MLOps, or production ML systems Hands-on experience with distributed training, experiment management, and model deployment workflows Familiarity with containerization, orchestration, data/version governance, and evaluation pipelines Ability to design reliable, scalable systems that support high-throughput experimentation Comfortable operating across research, infrastructure, and engineering in fast-paced environments

Above all, they are looking for engineers who demonstrate exceptional technical excellence, strong ownership, and the ability to build foundational systems from first principles.

Please apply ASAP if interested.