Senior Site Reliability Engineer — GPU Infrastructure
Join to apply for the Senior Site Reliability Engineer — GPU Infrastructure role at Genmo
Senior Site Reliability Engineer — GPU Infrastructure
Join to apply for the Senior Site Reliability Engineer — GPU Infrastructure role at Genmo
We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.
What You’ll Do
- Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.
- Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi‑cluster federation.
- Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.
- Build CI/CD pipelines, automated testing, and rollout strategies for infra changes.
- Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.
- Optimize high‑performance networking (InfiniBand/RDMA) and debug perf bottlenecks.
- Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.
- Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.
- BS/MS/PhD in CS, EE, or related field.
- 3+ yrs SRE/DevOps in production; 2+ yrs managing large Kubernetes fleets.
- Expert‑level Kubernetes experience.
- Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).
- Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.
- Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.
- Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator)
- GPU schedulers such as Slurm or Kueue.
- Familiarity with CI/CD tooling (GitHub Actions, BuildKit).
- Prior work with distributed training, model‑serving patterns, or other ML/GPU workloads.
Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.
Seniority level
Seniority level
Not Applicable
Employment type
Employment type
Full-time
Job function
Job function
Engineering and Information TechnologyIndustries
Software Development
Referrals increase your chances of interviewing at Genmo by 2x
Get notified about new Senior Site Reliability Engineer jobs in San Francisco, CA .
Software Engineer Internship (7 openings)
San Francisco, CA $90,000.00-$110,000.00 3 months ago
San Francisco, CA $132,700.00-$196,600.00 1 day ago
Software Engineer, HTML - AI Training (Freelance, Remote)
San Francisco, CA $150,000.00-$175,000.00 1 month ago
Software Engineer (Seller) - Fullstack (React, Typescript, Golang, SQL)
United States $143,000.00-$196,900.00 1 week ago
San Francisco, CA $150,000.00-$190,000.00 1 month ago
San Francisco, CA $105,000.00-$180,000.00 4 weeks ago
Software Engineer, Node.js - AI Training (Freelance, Remote)
Software Engineer - AI Training (Freelance, Remote)
San Francisco, CA $40,000.00-$100,000.00 2 weeks ago
San Francisco, CA $175,000.00-$250,000.00 1 month ago
Software Engineer, Python - AI Training (Freelance, Remote)
Coders - AI Training (Freelance, Remote)
Software Engineer, SQL - AI Training (Freelance, Remote)
Software Engineer, TypeScript - AI Training (Freelance, Remote)
San Francisco, CA $40,000.00-$100,000.00 2 weeks ago
Software Engineer, C# - AI Training (Freelance, Remote)
San Francisco, CA $140,000.00-$170,000.00 3 months ago
Software Engineer, C - AI Training (Freelance, Remote)
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr