ZipRecruiter

Machine Learning Engineer, Training Infrastructure

ZipRecruiter, San Francisco, California, United States, 94199

Overview Job Title:

Machine Learning Engineer, Training Infrastructure

Position Type:

Full time

Location:

San Francisco, CA, USA

Salary Range:

$150,000 - $250,000 (USD)

Job ID#:

158135

Job Description:

We are looking for an ML Engineer with 3+ years of experience in high-performance computing systems to manage and optimize our computational infrastructure for training and deploying our machine learning models. The ideal candidate has diverse experience managing ML workloads at scale, supporting our 3DVAE and video diffusion models. We encourage you to apply even if you don\'t meet every requirement — we value curiosity, creativity, and the drive to solve hard problems.

Responsibilities

Design, implement, and maintain scalable computing solutions for training and deploying ML models, ensuring infrastructure can handle large video datasets.

Manage and optimize the performance of computing clusters or cloud instances, such as AWS or Google Cloud, to support distributed training.

Ensure that infrastructure can handle the resource-intensive tasks associated with training large generative models.

Monitor system performance and implement improvements to maximize efficiency and utilization, using tools like Airflow for orchestration.

Collaborate across research teams to understand their computational needs and provide appropriate solutions, facilitating seamless model deployment.

Requirements

Bachelor’s degree in Computer Science, Information Technology, or a related field, with a focus on system administration.

Experience with cloud computing platforms such as Amazon Web Services, Google Cloud, or Microsoft Azure, essential for managing large-scale ML workloads.

Focus on deployment and scalability to ensure the computational backbone supports the company’s ML efforts.

Values engineering processes and version control (CI/CD).

Knowledge of containerization technologies like Docker and Kubernetes required for deployments at scale.

Understanding of distributed training techniques and how to scale models across multi-node clusters aligned with video needs.

Strong problem-solving and communication skills, given the need to collaborate with diverse teams.

About Us Founded in 2009, IntelliPro is a global leader in talent acquisition and HR solutions. Our commitment to delivering service, fostering employee growth, and building enduring partnerships sets us apart. We operate in over 160 countries, including the USA, China, Canada, Singapore, Japan, Philippines, UK, India, Netherlands, and the EU. IntelliPro is an Equal Opportunity Employer and values diversity and inclusivity. Our hiring and interview processes accommodate the needs of all applicants.

Compensation: The pay offered to a successful candidate will be determined by factors including education, work experience, location, job responsibilities, and certifications. IntelliPro provides a comprehensive benefits package, all subject to eligibility.

#J-18808-Ljbffr