Logo
Disneyland Hong Kong

Sr ML Ops Engineer

Disneyland Hong Kong, Nicasio, California, United States, 94946

Save Job

Overview

The Skywalker Sound Development Group is seeking a highly skilled Sr ML Ops Engineer to build and maintain the infrastructure powering our machine learning and AI frameworks. This role supports seamless workflows for model training, retraining, and deployment, ensuring cutting-edge AI solutions operate reliably at scale. This role is hybrid: 2-3 days onsite at the Nicasio, CA office and occasionally from home. What You’ll Do

Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference. Design and optimize CI/CD pipelines tailored for machine learning workflows to enable efficient delivery from research to production. Implement robust monitoring and logging systems to track model performance and identify issues in production. Collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation. Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference. Containerize ML models and applications using Docker and deploy via Kubernetes or equivalent orchestration systems. Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving, and FastAPI. Implement model versioning, rollback strategies, and governance to maintain production stability. Optimize cost efficiency and performance of ML workflows in cloud environments (AWS, GCP, Azure). Stay updated with emerging ML Ops tools and practices and integrate them to improve performance and reliability. What We’re Looking For

Bachelor’s in Computer Science, Engineering, or related field; Master’s degree preferred. 5+ years in DevOps, Site Reliability Engineering, or related roles, with at least 2+ years focused on ML Ops. Expertise in building and maintaining CI/CD pipelines for ML applications. Strong proficiency with Docker and Kubernetes. Experience deploying ML models with TensorFlow Serving, TorchServe, or custom APIs. Deep understanding of cloud infrastructure for ML workloads (AWS, GCP, Azure), including GPUs/TPUs. Experience managing large-scale distributed training and optimizing resource allocation. Familiarity with ML tracking/versioning tools (MLflow, DVC, Weight & Biases or similar). Security best practices for ML systems and sensitive data handling. Strong scripting/programming skills in Python, Bash, or Go. Preferred Qualifications

Experience with data orchestration tools for ML workflows. Hands-on experience with automated hyperparameter tuning and optimization frameworks. Familiarity with model monitoring tools for drift and data quality checks. Experience integrating pre-trained foundational models at scale. Contributions to open-source ML Ops projects or relevant research publications. Compensation

The hiring range for this position in San Francisco, CA is $152,100 to $203,900 per year. The base pay offered will reflect internal equity and may vary by geographic region, knowledge, skills, and experience. A bonus and/or long-term incentive units may be provided, in addition to a full range of benefits. Disability Accommodation for Employment Applications

The Walt Disney Company and its Affiliated Companies are Equal Employment Opportunity employers and welcome all job seekers including individuals with disabilities and veterans with disabilities. If you need a reasonable accommodation to search for or apply to a job opening, visit the Disney candidate disability accommodations FAQs. We will respond to requests related to the accessibility of the online application system due to a disability.

#J-18808-Ljbffr