Remote: This role is based remotely but if you live within a 50-mile radius of Mountain View, you are expected to report to that location three times a week, at minimum.
The Role:
We are seeking an experienced, technically oriented, impact-driven expert in ML Training Infrastructure with a strong ability to execute hands-on technical work. In this role, you will be responsible for designing and building scalable, reliable, and high-performance AI/ML platform infrastructure to support advanced AI research and model development initiatives. As a Staff ML System Engineer, you will collaborate closely with machine learning engineers, research scientists, and other partners to develop state-of-the-art AI solutions that enable the future of intelligent driving technologies across General Motors vehicles.
What You'll Do:
Lead the design and development of scalable, reliable, high-performance ML frameworks to support model training at scale.
Lead model training performance analysis and optimization solutions to scale distributed training workflows, maximize resource utilization across heterogeneous hardware environments, and reduce costs.
Enhance system observability, debuggability, operational excellence, and user experience.
Collaborate with cross-functional teams to integrate new features and technologies into the platform.