Logo
xAI

Site Reliability Engineer - US Government

xAI, Palo Alto, California, United States, 94306

Save Job

Overview

Site Reliability Engineer - US Government We are seeking a highly skilled Senior Infrastructure Engineer to join our US Government Team, focused on designing, building, and operating secure, scalable infrastructure for critical government projects. The role involves developing and managing training and inference clusters, as well as highly reliable applications across bare metal, classified cloud, and hybrid cloud architectures. You will leverage Kubernetes and GPU hardware to deliver robust, secure systems that support large-scale AI workloads while meeting stringent federal compliance requirements. This role emphasizes automation, observability, and maintaining system integrity in a fast-paced, high-security environment.

Responsibilities

Develop and optimize software to provision and manage infrastructure across on-premise, virtual machine, and classified cloud environments to enable efficient scaling for US government initiatives.

Enhance reliability, performance, and cost-effectiveness of infrastructure to support large-scale AI and application workloads in secure settings.

Collaborate with xAI engineers to understand workload requirements and design solutions that meet government-specific needs and compliance standards.

Implement robust observability, monitoring, and security practices to ensure integrity, availability, and confidentiality of critical systems in line with federal protocols.

Manage storage infrastructure using IaC tools (Pulumi, Terraform, Ansible) with a focus on secure data handling.

Drive reliability through incident management, postmortems, and defined SLAs/SLOs while maintaining security and compliance.

This is an in-person role based in Palo Alto, CA or Washington, DC, with up to 50% travel.

Required Qualifications

Active Top Secret (TS) security clearance.

5+ years of experience as an Infrastructure Engineer, Site Reliability Engineer, or similar role in secure or government environments.

Proficiency in managing storage infrastructure with IaC tools (Pulumi, Terraform, Ansible).

Deep understanding of the Kubernetes stack (CNI, CRI, CSI, and related components).

Proven ability to improve system reliability through incident management, postmortems, and SLAs/SLOs.

Excellent communication and documentation skills for handling sensitive information.

Preferred Qualifications

Experience with GPU hardware installation and reliability.

Experience optimizing Kubernetes for large-scale deployments in secure or federal settings.

Familiarity with chaos engineering, capacity planning, or similar resilience practices.

Proficiency with tools such as Kyverno, ArgoCD, or Go for infrastructure automation.

Security-focused certifications (e.g., CISSP) or experience in secure federal environments.

Interview Process After submitting your application, our team will review your CV. If your application advances, you will be invited to a 15-minute phone interview to discuss basic qualifications. The main process includes:

Technical deep-dive on infrastructure and secure systems experience.

Hands-on challenge focused on designing or troubleshooting infrastructure for secure environments.

Meet-and-greet with the wider team.

We aim to complete the main interview process within one week.

Compensation and Benefits Annual salary range: $180,000 - $440,000 USD.

Benefits include equity, comprehensive medical, vision, and dental coverage, a 401(k) retirement plan, short & long-term disability insurance, life insurance, and other employee discounts and perks.

xAI is an equal opportunity employer.

Job Details

Location: Palo Alto, CA or Washington, DC (in-person)

Seniority level: Mid-Senior level

Employment type: Full-time

Job function: Engineering and Information Technology

Industries: Technology, Information and Internet

#J-18808-Ljbffr