Andiamo
Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader
Andiamo, San Francisco, California, United States, 94199
Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader Sr Site Reliability Engineer / HPC – Pre-IPO Tech Leader
About The Role
Skills, Experience, Qualifications, If you have the right match for this opportunity, then make sure to apply today. We are seeking a highly skilled
Senior Site Reliability Engineer (SRE) / High-Performance Computing (HPC) Engineer
to design, build, and operate the large-scale infrastructure that powers a $2.5B pre-IPO technology company. Our systems run on massive distributed clusters, handling some of the most demanding workloads in cloud, AI, and data-driven computing. In this role, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical platforms. You will optimize HPC workloads, streamline CI/CD for large-scale clusters, and enable research and product teams to deliver innovations with speed and confidence. This is a hands-on position with the opportunity to influence architecture, lead reliability initiatives, and solve some of the hardest problems in distributed systems and performance engineering. What You’ll Do
Design Reliable Infrastructure: Architect and maintain large-scale, distributed HPC and cloud-native systems with a focus on uptime, scalability, and resilience. Optimize HPC Workloads: Tune scheduling, job orchestration, and performance for compute- and memory-intensive workloads (AI/ML, simulations, large-scale analytics). Build Observability: Implement monitoring, logging, and alerting systems that provide full visibility into cluster and service health. Automate Everything: Develop tooling and automation for provisioning, scaling, and recovery of critical systems. Ensure Security & Compliance: Implement best practices for access control, encryption, and governance across HPC and cloud environments. Collaborate Cross-Functionally: Work with engineering, research, and product teams to deliver reliable infrastructure for next-gen applications. Incident Response: Lead troubleshooting, root cause analysis, and postmortems for high-severity incidents. What We’re Looking For
Professional Experience: 7+ years in SRE, infrastructure engineering, or HPC roles with a proven track record of supporting large-scale distributed systems. Technical Skills: Expertise in Linux systems, Python or Go, and infrastructure-as-code (Terraform, Ansible, or similar). HPC Expertise: Strong knowledge of job schedulers (Slurm, Kubernetes, or Mesos), workload managers, and parallel/distributed computing. Cloud & Hybrid: Hands-on experience with AWS, GCP, or Azure in combination with on-premises HPC clusters. Observability: Proficiency with monitoring and logging frameworks (Prometheus, Grafana, ELK, OpenTelemetry). Resilience Engineering: Experience with chaos engineering, failure testing, and disaster recovery planning. Collaboration: Strong communication skills and the ability to work with research scientists, engineers, and operations teams. Education: Bachelor’s or Master’s degree in Computer Science, Engineering, or related field. Why Join This is an opportunity to join a pre-IPO technology leader valued at $2.5B, at a time of rapid growth and innovation. As a Senior SRE / HPC Engineer, you will shape the infrastructure that powers next-generation AI, analytics, and large-scale computing. You’ll solve some of the most complex reliability and performance challenges, collaborate with world-class teams, and play a key role in preparing the company for IPO and beyond.
#J-18808-Ljbffr
Skills, Experience, Qualifications, If you have the right match for this opportunity, then make sure to apply today. We are seeking a highly skilled
Senior Site Reliability Engineer (SRE) / High-Performance Computing (HPC) Engineer
to design, build, and operate the large-scale infrastructure that powers a $2.5B pre-IPO technology company. Our systems run on massive distributed clusters, handling some of the most demanding workloads in cloud, AI, and data-driven computing. In this role, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical platforms. You will optimize HPC workloads, streamline CI/CD for large-scale clusters, and enable research and product teams to deliver innovations with speed and confidence. This is a hands-on position with the opportunity to influence architecture, lead reliability initiatives, and solve some of the hardest problems in distributed systems and performance engineering. What You’ll Do
Design Reliable Infrastructure: Architect and maintain large-scale, distributed HPC and cloud-native systems with a focus on uptime, scalability, and resilience. Optimize HPC Workloads: Tune scheduling, job orchestration, and performance for compute- and memory-intensive workloads (AI/ML, simulations, large-scale analytics). Build Observability: Implement monitoring, logging, and alerting systems that provide full visibility into cluster and service health. Automate Everything: Develop tooling and automation for provisioning, scaling, and recovery of critical systems. Ensure Security & Compliance: Implement best practices for access control, encryption, and governance across HPC and cloud environments. Collaborate Cross-Functionally: Work with engineering, research, and product teams to deliver reliable infrastructure for next-gen applications. Incident Response: Lead troubleshooting, root cause analysis, and postmortems for high-severity incidents. What We’re Looking For
Professional Experience: 7+ years in SRE, infrastructure engineering, or HPC roles with a proven track record of supporting large-scale distributed systems. Technical Skills: Expertise in Linux systems, Python or Go, and infrastructure-as-code (Terraform, Ansible, or similar). HPC Expertise: Strong knowledge of job schedulers (Slurm, Kubernetes, or Mesos), workload managers, and parallel/distributed computing. Cloud & Hybrid: Hands-on experience with AWS, GCP, or Azure in combination with on-premises HPC clusters. Observability: Proficiency with monitoring and logging frameworks (Prometheus, Grafana, ELK, OpenTelemetry). Resilience Engineering: Experience with chaos engineering, failure testing, and disaster recovery planning. Collaboration: Strong communication skills and the ability to work with research scientists, engineers, and operations teams. Education: Bachelor’s or Master’s degree in Computer Science, Engineering, or related field. Why Join This is an opportunity to join a pre-IPO technology leader valued at $2.5B, at a time of rapid growth and innovation. As a Senior SRE / HPC Engineer, you will shape the infrastructure that powers next-generation AI, analytics, and large-scale computing. You’ll solve some of the most complex reliability and performance challenges, collaborate with world-class teams, and play a key role in preparing the company for IPO and beyond.
#J-18808-Ljbffr