Logo
Largeton Group

HPC/AI Systems Architect

Largeton Group, Milpitas, California, United States, 95035

Save Job

Overview

Lead the design and prototyping of a scalable, automated AI/HPC cluster for multi-division use. Architect and deploy cluster infrastructure utilizing Kubernetes, SaltStack, and Ansible. Automate end-to-end cluster lifecycle: from hardware setup and OS imaging to application deployment and monitoring. Standardize node provisioning with SUSE-based golden images and SaltStack role-based configuration. Implement and manage container orchestration (Kubernetes via Kubespray) with AppTainer/Docker support. Design persistent storage using CSI for efficient data management. Configure advanced networking (multi-VLAN, IPMI, storage, RDMA). Integrate observability and monitoring tools (Grafana, Prometheus, CheckMK) for system health. Develop strategies for identity/access management, including Active Directory integration. Support deployment of ML stacks (e.g., ClearML) and GPU-aware scheduling. Define/document automation processes, image boundaries, and integration points with legacy systems. Collaborate with infrastructure and engineering teams to ensure clear role delineation and smooth project handoffs. Ensure standardization to minimize site-specific customization and accelerate cluster deployments. Maintain excellent documentation and communication throughout the project. Security-focused, with attention to image scanning and compliance processes. Responsibilities

Lead the design and prototyping of a scalable, automated AI/HPC cluster for multi-division use. Architect and deploy cluster infrastructure utilizing Kubernetes, SaltStack, and Ansible. Automate end-to-end cluster lifecycle: from hardware setup and OS imaging to application deployment and monitoring. Standardize node provisioning with SUSE-based golden images and SaltStack role-based configuration. Implement and manage container orchestration (Kubernetes via Kubespray) with AppTainer/Docker support. Design persistent storage using CSI for efficient data management. Configure advanced networking (multi-VLAN, IPMI, storage, RDMA). Integrate observability and monitoring tools (Grafana, Prometheus, CheckMK) for system health. Develop strategies for identity/access management, including Active Directory integration. Support deployment of ML stacks (e.g., ClearML) and GPU-aware scheduling. Define/document automation processes, image boundaries, and integration points with legacy systems. Collaborate with infrastructure and engineering teams to ensure clear role delineation and smooth project handoffs. Ensure standardization to minimize site-specific customization and accelerate cluster deployments. Maintain excellent documentation and communication throughout the project. Security-focused, with attention to image scanning and compliance processes. Qualifications

Experience designing and prototyping scalable AI/HPC clusters. Strong background in Kubernetes, SaltStack, Ansible, and container orchestration (Kubespray). Experience with SUSE-based provisioning, image management, and CI/CD for clusters. Knowledge of CSI-based storage, multi-VLAN networking, IPMI, RDMA. Experience with observability tools (Grafana, Prometheus, CheckMK). Familiarity with identity management (Active Directory) and security/compliance practices. Experience deploying ML stacks and GPU-aware scheduling. Excellent documentation and cross-functional collaboration skills. Location

Hybrid (Milpitas, CA; onsite first few weeks, then Remote) Employment type

Full-time

#J-18808-Ljbffr