Virtue AI

ML Infrastructure Engineer

Virtue AI, San Francisco, California, United States, 94199

Overview Join to apply for the

ML Infrastructure Engineer

role at

Virtue AI .

About the Company: Virtue AI is at the forefront of AI security. As enterprises increasingly adopt Large Language Models, the need for robust, trustworthy, and safe AI has never been greater. Our mission is to build the essential guardrails and red-teaming tools that enable organizations to deploy multi-modal AI applications confidently and responsibly. We are a well-funded, early-stage startup founded by industry veterans, and we're looking for passionate builders to join our core team.

About the Role As a foundational ML Infrastructure Engineer at Virtue AI, you will be the architect of the engine that drives our research and product development. Reporting directly to the Head of Engineering, you will have the autonomy and resources to build a world-class MLOps platform from the ground up. Your work will directly support our growing team of Machine Learning Engineers and Scientists, enabling them to train, tune, and benchmark the sophisticated models that sit at the core of our security products and you will play a critical role in architecting not just the internal platform, but also how our final product is delivered, deployed, and integrated into customer environments.

Responsibilities

Architect our core ML platform with a cloud-agnostic approach, running managed Kubernetes services, managed training and fine-tuning runs on Vertex AI, Sagemaker etc.

Implement and manage scalable distributed computing workflows that empower our ML engineers and scientists to train and tune models across hundreds of GPUs.

Build and own the observability stack for our entire platform, implementing comprehensive monitoring, logging, and alerting to ensure system health and performance.

Develop and automate CI/CD pipelines for the full machine learning lifecycle—from data processing and model training to packaging and artifact versioning.

Engineer our product for deep interoperability, ensuring our platform can seamlessly integrate with customers\' existing enterprise stack, including their monitoring systems (Prometheus, Grafana), data stores, identity providers, and CI/CD pipelines.

Develop a sophisticated, cost-effective compute strategy, leveraging a mix of on-demand and short-lived VMs for GPU-intensive workloads.

Collaborate with Forward Deployed Engineers to create standardized, reliable software and model packages that are abstracted from the underlying infrastructure.

Design and enforce a secure cloud environment, implementing best practices for networking, IAM, and data security from the ground up.

Automate robust AI/ML model benchmarking and test suites, helping to build and maintain a QA infrastructure for model performance, accuracy, and reliability.

Collaborate with engineering and leadership to implement infrastructure, security controls, and processes required to achieve and maintain ISO 27001 and/or SOC 2.

Uphold operational excellence and site reliability, lead incident response, conduct blameless post-mortems, and drive improvements to system resiliency.

Qualifications

Proven experience building and managing infrastructure on at least one major public cloud (AWS, GCP, or Azure).

Hands-on experience with Kubernetes in a production environment including Helm Charts, ArgoCD.

A strong, principles-based understanding of Infrastructure as Code (IaC) and CI/CD best practices including Terraform and Github Actions.

Hands-on experience with modern observability tools (Prometheus, Grafana, Datadog).

Knowledge of secure cloud architecture principles, including network security, IAM, and secrets management.

Proficiency in a high-level programming language (Python) and shell scripting.

A builder\'s mindset with a passion for creating tools and platforms that make other engineers more productive.

Enterprise customer facing experience is a bonus.

Required Skills

Architecting a distributed training platform (e.g., deploying Ray on Kubernetes with autoscaling and fault-tolerance).

Diving deep into model optimization (e.g., automate compilation and quantization with TensorRT for low latency and high throughput).

Building a world-class monitoring setup (Prometheus and Grafana for GPU utilization, training performance, and cluster health).

Designing a deployment kit (standardized Helm charts and Terraform modules for repeatable deployments).

Connecting the MLOps components (model registry integration into CI/CD to trigger benchmarking and security scans).

Designing for diverse enterprise stacks (integration with customers\' databases and dashboards).

Solving scheduling puzzles (cost-effective orchestration across hybrid on-demand and Spot GPU clusters).

Making the deployment process GitOps-native (Kubernetes manifests managed by customer CI/CD and ArgoCD).

Seniority level Mid-Senior level

Employment type Full-time

Job function

Engineering and Information Technology

Location: San Francisco, CA

Referrals increase your chances of interviewing at Virtue AI by 2x

#J-18808-Ljbffr