Logo
Amiri Recruiting

Site Reliability Engineer

Amiri Recruiting, Mountain View, California, us, 94039

Save Job

Site Reliability Engineer Onsite- Bay Area, CA Skills Relevant Skills and Experience What You’ll Do (Day-to-Day)

Own and manage our cloud infrastructure (GCP or AWS, on-prem).

Build, maintain, and optimize Kubernetes clusters (including GPU-backed clusters).

Implement and improve CI/CD pipelines (GitHub Actions).

Write and maintain Infrastructure as Code (Terraform).

Monitor system health and performance using Grafana and other observability tools.

Ensure high availability, reliability, and uptime across platforms.

Handle infrastructure maintenance, upgrades, and scaling.

Administer and improve our platform architecture and apply general security best practices across the stack.

Note: This is an internal-facing role — no customer interaction.

Must-Have:

4+ years in SRE, DevOps, or Infrastructure Engineering

Solid experience with GCP or AWS (hybrid/on-prem a plus)

Experience with Kubernetes cluster management (GPU experience a bonus)

Hands-on with Terraform and CI/CD (GitHub)

Experience with monitoring/observability (Grafana, etc.)

Strong understanding of high availability and infrastructure reliability

Familiarity with platform/cluster architecture and administration

Security mindset and ability to apply best practice

Nice-to-Have:

Startup experience (you enjoy building, not just maintaining)

Experience with scalable GPU infrastructure for AI/ML

#J-18808-Ljbffr