Logo
Adobe

ML Ops Engineer

Adobe, San Jose, California, United States, 95112

Save Job

The Opportunity

Join Adobe as a skilled and proactive Machine Learning Ops Engineer to drive the operational reliability, scalability, and performance of our AI systems! This role is foundational in ensuring our AI systems operate seamlessly across environments while meeting the needs of both developers and end users. You will lead efforts to automate and optimize the full machine learning lifecyclefrom data pipelines and model deployment to monitoring, governance, and incident response. What You'll Do

Model Lifecycle Management Manage model versioning, deployment strategies, rollback mechanisms, and A/B testing frameworks for LLM agents and RAG systems. Coordinate model registries, artifacts, and promotion workflows in collaboration with ML Engineers Monitoring & Observability Implement real-time monitoring of model performance (accuracy, latency, drift, degradation). Track conversation quality metrics and user feedback loops for production agents. CI/CD for AI Develop automated pipelines for timely/agent testing, validation, and deployment. Integrate unit/integration tests into model and workflow updates for safe rollouts. Infrastructure Automation Provision and manage scalable infrastructure (Kubernetes, Terraform, serverless stacks). Enable auto-scaling, resource optimization, and load balancing for AI workloads. Data Pipeline Management Craft and maintain data ingestion pipelines for both structured and unstructured sources. Ensure reliable feature extraction, transformation, and data validation workflows. Performance Optimization Monitor and optimize AI stack performance (model latency, API efficiency, GPU/compute utilization). Drive cost-aware engineering across inference, retrieval, and orchestration layers. Incident Response & Reliability Build alerting and triage systems to identify and resolve production issues. Maintain SLAs and develop rollback/recovery strategies for AI services. Compliance & Governance Enforce model governance, audit trails, and explainability standards. Support documentation and regulatory frameworks (e.g., GDPR, SOC 2, internal policy alignment). What You Need To Succeed

35+ years in MLOps, DevOps, or ML platform engineering. Strong experience with cloud infrastructure (AWS/GCP/Azure), container orchestration (Kubernetes), and IaC tools (Terraform, Helm). Familiarity with ML model serving tools (e.g., MLflow, Seldon, TorchServe, BentoML). Proficiency in Python and CI/CD automation (e.g., GitHub Actions, Jenkins, Argo Workflows). Experience with monitoring tools (Prometheus, Grafana, Datadog, ELK, Arize AI, etc.). Preferred Qualifications Experience supporting LLM applications, RAG pipelines, or AI agent orchestration. Understanding of vector databases, embedding workflows, and model retraining triggers. Exposure to privacy, safety, and responsible AI principles in operational contexts. Bachelor's or equivalent experience in Computer Science, Engineering, or a related technical field.