Disney
Overview
The Skywalker Sound Development Group is seeking a highly skilled Sr ML Ops Engineer to build and maintain the infrastructure powering our machine learning and AI frameworks. This position is crucial in enabling seamless workflows for model training, retraining, and deployment, ensuring that cutting-edge AI solutions operate reliably at scale. As a Sr ML Ops Engineer, you will act as the backbone of our AI/ML efforts, bridging the gap between data science, research, and production engineering. Your expertise in DevOps principles, model deployment strategies, and scalable infrastructure will support the development of transformative audio solutions for speech processing, style transfer, and source separation in media production workflows. This role is considered Hybrid, which means the employee will work 2-3 days onsite at our Nicasio, CA office and occasionally from home. What You’ll Do
Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference.
Design and optimize CI/CD pipelines specifically tailored for machine learning workflows, ensuring efficient delivery from research to production.
Implement robust monitoring and logging systems to track model performance and identify potential issues in production environments.
Collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation.
Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks.
Containerize machine learning models and applications using Docker and deploy them via Kubernetes or equivalent orchestration systems.
Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI.
Implement model versioning, rollback strategies, and governance for maintaining production stability.
Optimize cost efficiency and performance of machine learning workflows in cloud environments such as AWS, GCP, or Azure.
Stay updated with emerging ML Ops tools and practices, integrating them into existing workflows to improve performance and reliability.
What We’re Looking For
Bachelor’s in Computer Science, Engineering, or a related field. Master’s Degree is preferred
5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops.
Expertise in building and maintaining CI/CD pipelines for machine learning applications.
Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes).
Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs.
Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization.
Experience managing large-scale distributed training workflows and optimizing resource allocation.
Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning.
Solid understanding of security best practices for machine learning systems and sensitive data handling.
Strong scripting and programming skills in Python, Bash, or Go.
Preferred Qualifications
Experience with data orchestration tools like DataChain, Weights and Biases, etc, for managing ML workflows.
Hands-on experience with automated hyperparameter tuning and optimization frameworks.
Familiarity with model monitoring tools like Prometheus, Grafana, or custom solutions for model drift and data quality checks.
Experience integrating pre-trained foundational models and managing their deployment at scale.
Contributions to open-source ML Ops projects or relevant research publications.
Compensation
The hiring range for this position in San Francisco, CA is $152,100 to $203,900 per year. The base pay actually offered will take into account internal equity and also may vary depending on the candidate’s geographic region, job-related knowledge, skills, and experience among other factors. A bonus and/or long-term incentive units may be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits, dependent on the level and position offered. Disability Accommodation for Employment Applications
The Walt Disney Company and its Affiliated Companies are Equal Employment Opportunity employers and welcome all job seekers including individuals with disabilities and veterans with disabilities. If you have a disability and believe you need a reasonable accommodation in order to search for a job opening or apply for a position, visit the Disney candidate disability accommodations FAQs. We will only respond to those requests that are related to the accessibility of the online application system due to a disability.
#J-18808-Ljbffr
The Skywalker Sound Development Group is seeking a highly skilled Sr ML Ops Engineer to build and maintain the infrastructure powering our machine learning and AI frameworks. This position is crucial in enabling seamless workflows for model training, retraining, and deployment, ensuring that cutting-edge AI solutions operate reliably at scale. As a Sr ML Ops Engineer, you will act as the backbone of our AI/ML efforts, bridging the gap between data science, research, and production engineering. Your expertise in DevOps principles, model deployment strategies, and scalable infrastructure will support the development of transformative audio solutions for speech processing, style transfer, and source separation in media production workflows. This role is considered Hybrid, which means the employee will work 2-3 days onsite at our Nicasio, CA office and occasionally from home. What You’ll Do
Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference.
Design and optimize CI/CD pipelines specifically tailored for machine learning workflows, ensuring efficient delivery from research to production.
Implement robust monitoring and logging systems to track model performance and identify potential issues in production environments.
Collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation.
Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks.
Containerize machine learning models and applications using Docker and deploy them via Kubernetes or equivalent orchestration systems.
Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI.
Implement model versioning, rollback strategies, and governance for maintaining production stability.
Optimize cost efficiency and performance of machine learning workflows in cloud environments such as AWS, GCP, or Azure.
Stay updated with emerging ML Ops tools and practices, integrating them into existing workflows to improve performance and reliability.
What We’re Looking For
Bachelor’s in Computer Science, Engineering, or a related field. Master’s Degree is preferred
5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops.
Expertise in building and maintaining CI/CD pipelines for machine learning applications.
Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes).
Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs.
Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization.
Experience managing large-scale distributed training workflows and optimizing resource allocation.
Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning.
Solid understanding of security best practices for machine learning systems and sensitive data handling.
Strong scripting and programming skills in Python, Bash, or Go.
Preferred Qualifications
Experience with data orchestration tools like DataChain, Weights and Biases, etc, for managing ML workflows.
Hands-on experience with automated hyperparameter tuning and optimization frameworks.
Familiarity with model monitoring tools like Prometheus, Grafana, or custom solutions for model drift and data quality checks.
Experience integrating pre-trained foundational models and managing their deployment at scale.
Contributions to open-source ML Ops projects or relevant research publications.
Compensation
The hiring range for this position in San Francisco, CA is $152,100 to $203,900 per year. The base pay actually offered will take into account internal equity and also may vary depending on the candidate’s geographic region, job-related knowledge, skills, and experience among other factors. A bonus and/or long-term incentive units may be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits, dependent on the level and position offered. Disability Accommodation for Employment Applications
The Walt Disney Company and its Affiliated Companies are Equal Employment Opportunity employers and welcome all job seekers including individuals with disabilities and veterans with disabilities. If you have a disability and believe you need a reasonable accommodation in order to search for a job opening or apply for a position, visit the Disney candidate disability accommodations FAQs. We will only respond to those requests that are related to the accessibility of the online application system due to a disability.
#J-18808-Ljbffr