Logo
Qualcomm

MLOps Engineer - ML Platform

Qualcomm, San Diego, California, United States, 92189

Save Job

Overview We are seeking a highly skilled and experienced Staff MLOps Engineer to join our team and contribute to the development and maintenance of our ML platform both on premises and AWS Cloud. As a Staff MLOps Engineer, you will architect, deploy, and optimize the ML platform that supports training of Machine Learning Models using NVIDIA DGX clusters and Kubernetes, including technologies like Helm, ArgoCD, Argo Workflow, Prometheus, and Grafana. Your expertise in AWS services such as EKS, EC2, VPC, IAM, S3, and EFS will be crucial for ensuring smooth operation and scalability of our ML infrastructure. You will work closely with data scientists, software engineers, and infrastructure specialists to enable efficient training and deployment of ML models.

Responsibilities

Architect, develop, and maintain the ML platform to support training and inference of ML models.

Design and implement scalable and reliable infrastructure solutions for NVIDIA clusters both on premises and AWS Cloud.

Collaborate with data scientists and software engineers to define requirements and ensure seamless integration of ML and data workflows into the platform.

Optimize the platform’s performance and scalability, considering factors such as GPU resource utilization, data ingestion, model training, and deployment.

Monitor and troubleshoot system performance, identifying and resolving issues to ensure availability and reliability of the ML platform.

Implement and maintain CI/CD pipelines for automated model training, evaluation, and deployment using technologies like ArgoCD and Argo Workflow.

Implement and maintain monitoring stack using Prometheus and Grafana to ensure the health and performance of the platform.

Manage AWS services including EKS, EC2, VPC, IAM, S3, and EFS to support the platform.

Implement logging and monitoring solutions using AWS CloudWatch and other relevant tools.

Stay updated with the latest advancements in MLOps, distributed computing, and GPU acceleration technologies, and proactively propose improvements to enhance the ML platform.

Qualifications

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.

Proven experience as an MLOps Engineer or similar role, with a focus on large‑scale ML and data infrastructure and GPU clusters.

Strong expertise in configuring and optimizing NVIDIA DGX clusters for deep learning workloads.

Proficient in using the Kubernetes platform, including technologies like Helm, ArgoCD, Argo Workflow, Prometheus, and Grafana.

Solid programming skills in languages like Python, Go, and experience with relevant ML frameworks (e.g., TensorFlow, PyTorch).

In‑depth understanding of distributed computing, parallel computing, and GPU acceleration techniques.

Familiarity with containerization technologies such as Docker and orchestration tools.

Experience with CI/CD pipelines and automation tools for ML workflows (e.g., Jenkins, GitHub, ArgoCD).

Experience with AWS services such as EKS, EC2, VPC, IAM, S3, and EFS.

Experience with AWS logging and monitoring tools.

Strong problem‑solving skills and the ability to troubleshoot complex technical issues.

Excellent communication and collaboration skills to work effectively within a cross‑functional team.

Additional Skills We Would Love To See

Experience with training and deploying models.

Knowledge of ML model optimization techniques and memory management on GPUs.

Familiarity with ML‑specific data storage and retrieval systems.

Understanding of security and compliance requirements in ML infrastructure.

Minimum Qualifications

4+ years of Software Engineering or related work experience and Bachelor’s degree in Engineering, Information Systems, Computer Science, or related field.

3+ years of Software Engineering or related work experience and Master’s degree in Engineering, Information Systems, Computer Science, or related field.

2+ years of Software Engineering or related work experience and PhD in Engineering, Information Systems, Computer Science, or related field.

2+ years of work experience with programming languages such as C, C++, Java, Python, etc.

Pay Range $134,800.00 - $202,200.00

EEO Statement Qualcomm is an equal opportunity employer; all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or any other protected classification.

#J-18808-Ljbffr