Logo
The Walt Disney Company

Sr ML Ops Engineer

The Walt Disney Company, Nicasio, California, United States, 94946

Save Job

Overview

Join to apply for the

Sr ML Ops Engineer

role at

The Walt Disney Company . The Skywalker Sound Development Group is seeking a highly skilled Sr ML Ops Engineer to build and maintain the infrastructure powering our machine learning and AI frameworks. This position enables seamless workflows for model training, retraining, and deployment, ensuring that cutting-edge AI solutions operate reliably at scale. This role is considered Hybrid, which means the employee will work 2-3 days onsite at our Nicasio, CA office and occasionally from home. What You’ll Do

Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference. Design and optimize CI/CD pipelines specifically tailored for machine learning workflows, ensuring efficient delivery from research to production. Implement robust monitoring and logging systems to track model performance and identify potential issues in production environments. Collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation. Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks. Containerize machine learning models and applications using Docker and deploy them via Kubernetes or equivalent orchestration systems. Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI. Implement model versioning, rollback strategies, and governance for maintaining production stability. Optimize cost efficiency and performance of machine learning workflows in cloud environments such as AWS, GCP, or Azure. Stay updated with emerging ML Ops tools and practices, integrating them into existing workflows to improve performance and reliability. What We’re Looking For

Bachelor’s in Computer Science, Engineering, or a related field. Master’s Degree is preferred. 5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops. Expertise in building and maintaining CI/CD pipelines for machine learning applications. Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes). Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs. Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization. Experience managing large-scale distributed training workflows and optimizing resource allocation. Familiarity with tools like MLflow, DVC, Weight & Biases, or similar for data and model tracking and versioning. Solid understanding of security best practices for machine learning systems and sensitive data handling. Strong scripting and programming skills in Python, Bash, or Go. Preferred Qualifications

Experience with data orchestration tools like Weights & Biases, DataChain, etc., for managing ML workflows. Hands-on experience with automated hyperparameter tuning and optimization frameworks. Familiarity with model monitoring tools like Prometheus, Grafana, or custom solutions for model drift and data quality checks. Experience integrating pre-trained foundational models and managing their deployment at scale. Contributions to open-source ML Ops projects or relevant research publications. Compensation and Benefits

The hiring range for this position in San Francisco, CA is

$152,100

to

$203,900

per year. The base pay actually offered will take into account internal equity and may vary depending on geographic region, job-related knowledge, skills, and experience, among other factors. A bonus and/or long-term incentive units may be provided as part of the compensation package, in addition to the full range of medical, financial, and other benefits, dependent on the level and position offered. Job Details

Seniority level: Mid-Senior level Employment type: Full-time Job function: Information Technology Industries: Entertainment Providers Note: This job description reflects the duties and responsibilities of the position at the time of posting and is subject to change without notice.

#J-18808-Ljbffr