Logo
ExecutivePlacements.com

Site Reliability Engineer (Sonoma)

ExecutivePlacements.com, Sonoma, California, United States, 95476

Save Job

Senior Platform Engineer / Site Reliability Engineer (AI Infrastructure) Join a stealth‑mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full‑scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer you will own the reliability, performance, and automation of this GPU‑powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access, and supporting new products.

Responsibilities

Design, deploy, and maintain large‑scale GPU clusters (H100/H200/B200) for training and inference workloads.

Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.

Develop observability, alerting, and auto‑healing systems for high‑availability GPU workloads.

Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.

Implement infrastructure‑as‑code, CI/CD pipelines, and reliability standards across thousands of nodes.

Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Required Skills & Experience

Customer‑facing experience and the attitude to be a Swiss army knife.

Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management.

Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).

Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.

Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.

Familiarity with high‑performance computing (HPC) or AI/ML training infrastructure at scale.

Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

#J-18808-Ljbffr