Logo
Lambda

Senior Site Reliability Engineer - Managed Kubernetes

Lambda, Seattle

Save Job

Senior Site Reliability Engineer – Managed Kubernetes

Join Lambda’s mission to make compute as ubiquitous as electricity. The Senior Site Reliability Engineer role focuses on operating production Kubernetes clusters for AI/ML workloads.

What You’ll Do

  • Operate and maintain bare‑metal Kubernetes clusters scaling to thousands of nodes.
  • Handle cluster degradation, recovery, resizing, and incident response using fleet management tools.
  • Participate in a well‑managed on‑call rotation for critical incidents.
  • Assist customers with Kubernetes questions, workload integration, storage, and authentication.
  • Work closely with HPC Ops and Datacenter Ops teams for low‑level or cross‑functional issues.
  • Use Python and Go to create tooling and automate validation of platform quality.
  • Design, build, and maintain scalable control plane services, operators, and custom controllers for Kubernetes.
  • Develop automation for cluster lifecycle management: provisioning, upgrades, patching, and deletion.
  • Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability.

About You – Must‑Have

  • 6+ years of experience in SRE, operations engineering, or similar roles with deep Linux cluster knowledge.
  • Strong programming skills in Go and Python; experience with GitOps (ArgoCD), Helm, and Kubernetes operators.
  • Proven experience operating Kubernetes clusters in production environments (on‑prem, EKS, GKE, or similar).
  • Can work independently or as part of a team, handling incidents via tickets or live messaging.
  • Familiarity with observability tools such as Prometheus, Grafana, FluentBit and CI/CD pipelines.
  • Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar.

Nice‑to‑Have

  • Deep Kubernetes expertise: CRDs, CSI, CNI, and operator coding.
  • Experience with HPC clusters, AI/ML workloads, or large‑scale GPU clusters.
  • Hybrid or multi‑cloud Kubernetes environment experience.
  • Contributions to CNCF projects or Kubernetes SIGs.

Benefits

  • Generous cash and equity compensation.
  • Health, dental, and vision coverage.
  • 401(k) with company match.
  • Paid time off and flexible paid time off plans.
  • Wellness and commuter stipends for select roles.

Equal Opportunity Employer

Lambda is an Equal Opportunity Employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation, veteran status, citizenship, or any other factors prohibited by law.

#J-18808-Ljbffr