Super Micro Computer Spain, S.L.
Sr. Reliability Engineer
Super Micro Computer Spain, S.L., San Jose, California, United States, 95199
Overview
Sr. Reliability Engineer
role at
Super Micro Computer Spain, S.L. Location: San Jose, California, United States Company: Super Micro Computer Job Req ID: 26861 Posting date: Oct 3, 2025 Responsibilities
Cloud Infra Automation: Design and provision cloud infrastructure using Infrastructure as Code (Terraform, Ansible, or Helm) on bare metal or cloud platforms. Develop custom automation and tooling in Python or Go to extend deployment workflows and streamline operations. Platform Reliability: Deploy, scale, maintain, and optimize uptime for AI cloud services including GPU clusters, Kubernetes (K8s), and storage systems (e.g., Ceph, BeeGFS, or Weka). Ensure consistent application performance. Monitoring & Alerting: Implement observability tools (e.g., Prometheus, Grafana, ELK, Loki, Fluentd) to monitor system health and alert on anomalies or performance degradation. Capacity Planning: Analyze usage trends and forecast infrastructure needs to support AI workloads and large-scale model training/inference. Incident Management: Lead root cause analysis and resolution for system outages or degraded performance. Define and maintain service level objectives (SLOs), indicators (SLIs), and agreements (SLAs) aligned with uptime and performance goals. CI/CD Integration: Collaborate with DevOps and MLOps teams to ensure reliable delivery pipelines using GitLab CI/CD, ArgoCD, or similar tools. Security & Compliance: Harden Linux systems, manage TLS certificates, and enforce secure access controls via RBAC, LDAP-integrated SSO, TLS, and network segmentation policies. Documentation & Playbooks: Maintain clear, version-controlled documentation, including architecture diagrams, runbooks, and incident response playbooks to support cross-team knowledge transfer and rapid onboarding. Qualifications
Minimum qualifications Bachelor’s degree in Computer Science, Engineering, or a related field—or equivalent experience with 8 years of experience in the areas below Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes) Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm) Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.) Strong scripting and coding skills (Bash, Python, or Go) Exposure to secure multi-tenant environments and zero trust architectures Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics Excellent collaboration and communication skills for cross-team, partner, and customer initiatives Preferred qualifications Understanding of AI/ML reference architectures and experience with workflows, MLFlow, or Kubeflow Familiarity with storage backends optimized for AI (CephFS, BeeGFS, WekaFS) Prior experience in bare-metal provisioning via PXE, Ironic, or Foreman Understanding of NVIDIA GPU telemetry and NCCL testing for performance benchmarking Familiarity with ITIL processes or structured change management in production systems is a plus Certifications: CKA, CKAD, Linux+, or related credentials Salary
$145,000 - $165,000 The salary offered will depend on several factors, including location, level, education, training, specific skills, years of experience, and comparison to other employees. A comprehensive benefits package and potential participation in bonus and equity programs may apply. EEO Statement
Supermicro is an Equal Opportunity Employer and embraces diversity in our employee population. It is the policy of Supermicro to provide equal opportunity to all qualified applicants and employees without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, marital status, pregnancy, genetic information, or any other legally protected status.
#J-18808-Ljbffr
Sr. Reliability Engineer
role at
Super Micro Computer Spain, S.L. Location: San Jose, California, United States Company: Super Micro Computer Job Req ID: 26861 Posting date: Oct 3, 2025 Responsibilities
Cloud Infra Automation: Design and provision cloud infrastructure using Infrastructure as Code (Terraform, Ansible, or Helm) on bare metal or cloud platforms. Develop custom automation and tooling in Python or Go to extend deployment workflows and streamline operations. Platform Reliability: Deploy, scale, maintain, and optimize uptime for AI cloud services including GPU clusters, Kubernetes (K8s), and storage systems (e.g., Ceph, BeeGFS, or Weka). Ensure consistent application performance. Monitoring & Alerting: Implement observability tools (e.g., Prometheus, Grafana, ELK, Loki, Fluentd) to monitor system health and alert on anomalies or performance degradation. Capacity Planning: Analyze usage trends and forecast infrastructure needs to support AI workloads and large-scale model training/inference. Incident Management: Lead root cause analysis and resolution for system outages or degraded performance. Define and maintain service level objectives (SLOs), indicators (SLIs), and agreements (SLAs) aligned with uptime and performance goals. CI/CD Integration: Collaborate with DevOps and MLOps teams to ensure reliable delivery pipelines using GitLab CI/CD, ArgoCD, or similar tools. Security & Compliance: Harden Linux systems, manage TLS certificates, and enforce secure access controls via RBAC, LDAP-integrated SSO, TLS, and network segmentation policies. Documentation & Playbooks: Maintain clear, version-controlled documentation, including architecture diagrams, runbooks, and incident response playbooks to support cross-team knowledge transfer and rapid onboarding. Qualifications
Minimum qualifications Bachelor’s degree in Computer Science, Engineering, or a related field—or equivalent experience with 8 years of experience in the areas below Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes) Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm) Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.) Strong scripting and coding skills (Bash, Python, or Go) Exposure to secure multi-tenant environments and zero trust architectures Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics Excellent collaboration and communication skills for cross-team, partner, and customer initiatives Preferred qualifications Understanding of AI/ML reference architectures and experience with workflows, MLFlow, or Kubeflow Familiarity with storage backends optimized for AI (CephFS, BeeGFS, WekaFS) Prior experience in bare-metal provisioning via PXE, Ironic, or Foreman Understanding of NVIDIA GPU telemetry and NCCL testing for performance benchmarking Familiarity with ITIL processes or structured change management in production systems is a plus Certifications: CKA, CKAD, Linux+, or related credentials Salary
$145,000 - $165,000 The salary offered will depend on several factors, including location, level, education, training, specific skills, years of experience, and comparison to other employees. A comprehensive benefits package and potential participation in bonus and equity programs may apply. EEO Statement
Supermicro is an Equal Opportunity Employer and embraces diversity in our employee population. It is the policy of Supermicro to provide equal opportunity to all qualified applicants and employees without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, marital status, pregnancy, genetic information, or any other legally protected status.
#J-18808-Ljbffr