Mirantis
About Mirantis
Mirantis is a
Kubernetes-native AI infrastructure company
that enables organizations to build and operate scalable, secure, and sovereign infrastructure for modern AI, machine learning, and data‑intensive applications. By combining open‑source innovation with deep expertise in Kubernetes orchestration, Mirantis empowers platform engineering teams to deliver composable, production‑ready developer platforms across any environment—on‑premises, in the cloud, at the edge, or in sovereign data centres. As enterprises navigate the growing complexity of AI‑driven workloads, Mirantis delivers the automation, GPU orchestration, and policy‑driven control needed to manage infrastructure with confidence and agility. Mirantis is committed to open standards and freedom from lock‑in, ensuring that customers retain full control of their infrastructure strategy. Position Overview
The Site Reliability Engineer (SRE) ensures the reliability, scalability, and performance of complex distributed systems deployed across private, public, and hybrid cloud environments. This hands‑on technical leadership role combines deep infrastructure knowledge with software engineering expertise to build systems that are automated, observable, and operationally sustainable. The SRE works within a DevIntegration team, integrating multiple layers of the product stack to enable automated, Kubernetes‑based GPU workload provisioning using the Cluster API framework. The role contributes strategically to architectural direction and tactically to implementation, troubleshooting, and mentorship, shaping the resilience of mission‑critical systems deployed at customer premises and influencing long‑term trust and satisfaction of enterprise clients. Key Responsibilities
Design, deploy, and maintain highly available, fault‑tolerant systems running on Kubernetes and bare‑metal infrastructure. Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to balance innovation velocity with operational stability. Lead system reliability initiatives, ensuring uptime and performance targets are consistently achieved. Work within the DevIntegration team to integrate diverse components of the product stack, enabling end‑to‑end cluster provisioning and management. Build automation pipelines using Infrastructure as Code (IaC) and CI/CD frameworks to ensure consistent, repeatable deployments. Develop scripts, frameworks, and tools to eliminate manual interventions and improve system resilience. Participate in and lead architectural discussions—from high‑level design to low‑level implementation—to ensure alignment with reliability, security, and scalability goals. Collaborate with development and product teams to address functional gaps and propose sustainable technical solutions in a fast‑paced environment. Ensure long‑term operational sustainability of the deployed product, including updates, incident management, and integration with third‑party enterprise systems such as PKI, IAM, and SIEM. Conduct performance optimisation, capacity planning, and root cause analysis to maintain system health. Champion automation of day‑2 operations, such as monitoring, scaling, patching, and recovery. Take ownership beyond engineering scope when needed—leading planning, coordination, and execution activities with an end‑to‑end accountability mindset. Mentor and support team members, sharing deep expertise in reliability engineering, infrastructure design, and troubleshooting best practices. Actively contribute to defining and refining SRE standards and processes across the organization. Required Qualifications
8+ years of hands‑on experience managing mission‑critical, high‑availability production environments. Proven background in Site Reliability Engineering, DevOps, or Infrastructure Engineering. Strong understanding of cloud infrastructure (AWS, GCP, Azure) and private clouds; experience with bare‑metal environments is a plus. Proficiency in at least one general‑purpose programming language (Python or Go preferred). Solid grasp of Infrastructure as Code principles and modern deployment methodologies (Terraform, Ansible, Helm, ArgoCD, or similar). Expertise in containerisation and orchestration technologies (Docker, Kubernetes, Cluster API). Demonstrated experience with scalable, distributed systems and building high‑availability architectures. Deep understanding of Linux systems and networking fundamentals (TCP/IP, DNS, routing, firewalls); experience in a network provider or ISP environment is a plus. Strong knowledge of modern observability stacks (Prometheus, VictoriaMetrics, ClickHouse, OpenSearch/Elasticsearch) and root‑cause analysis techniques. Familiarity with security and compliance frameworks such as OWASP, ISO 27001, CSA, and PCI DSS. Exceptional analytical, problem‑solving, and debugging abilities. Proven experience working effectively in distributed teams, fostering collaboration across multiple functions. A mindset of extreme ownership—driving continuous improvement, accountability, and operational excellence. Preferred Qualifications
Experience in GPU‑based workload orchestration and performance optimisation. Familiarity with chaos engineering and proactive reliability testing. Experience contributing to or leading incident response frameworks and on‑call rotations. Exposure to edge computing, AI/ML infrastructure, or data‑intensive systems. Prior experience mentoring engineers and influencing technical direction across teams. Key Competencies
Systemic thinking and architectural foresight. Proactive automation and continuous improvement mindset. Advanced troubleshooting and observability orientation. Ability to operate effectively across multiple technical domains. Leadership presence with strong communication and mentoring skills. Additional Information
Work with an established Silicon Valley leader in the cloud infrastructure industry. Collaborate with passionate, talented, and engaging colleagues to help Fortune 500 and Global 2000 customers implement next‑generation cloud technologies. Be part of cutting‑edge, open‑source innovation. Thrive in a high‑energy environment of a young company where openness, collaboration, risk‑taking, and continuous growth are valued. Professional development and training opportunities, including conferences and working groups. Customised workstation (macOS, Windows). A competitive compensation package with strong benefits plan and stock options. Seniority level
Mid‑Senior level Employment type
Full‑time Job function
Software Development By submitting your resume, you consent to the processing and storage of your personal data in accordance with applicable data protection laws, for the purposes of considering your application for current and future job opportunities. It is understood that Mirantis, Inc. may use automated decision‑making technology (ADMT) for specific employment‑related decisions. Opting out of ADMT use is requested for decisions about evaluation and review connected with the specific employment decision for the position applied for. You also have the right to appeal any decisions made by ADMT by sending your request to isamoylova@mirantis.com.
#J-18808-Ljbffr
Mirantis is a
Kubernetes-native AI infrastructure company
that enables organizations to build and operate scalable, secure, and sovereign infrastructure for modern AI, machine learning, and data‑intensive applications. By combining open‑source innovation with deep expertise in Kubernetes orchestration, Mirantis empowers platform engineering teams to deliver composable, production‑ready developer platforms across any environment—on‑premises, in the cloud, at the edge, or in sovereign data centres. As enterprises navigate the growing complexity of AI‑driven workloads, Mirantis delivers the automation, GPU orchestration, and policy‑driven control needed to manage infrastructure with confidence and agility. Mirantis is committed to open standards and freedom from lock‑in, ensuring that customers retain full control of their infrastructure strategy. Position Overview
The Site Reliability Engineer (SRE) ensures the reliability, scalability, and performance of complex distributed systems deployed across private, public, and hybrid cloud environments. This hands‑on technical leadership role combines deep infrastructure knowledge with software engineering expertise to build systems that are automated, observable, and operationally sustainable. The SRE works within a DevIntegration team, integrating multiple layers of the product stack to enable automated, Kubernetes‑based GPU workload provisioning using the Cluster API framework. The role contributes strategically to architectural direction and tactically to implementation, troubleshooting, and mentorship, shaping the resilience of mission‑critical systems deployed at customer premises and influencing long‑term trust and satisfaction of enterprise clients. Key Responsibilities
Design, deploy, and maintain highly available, fault‑tolerant systems running on Kubernetes and bare‑metal infrastructure. Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to balance innovation velocity with operational stability. Lead system reliability initiatives, ensuring uptime and performance targets are consistently achieved. Work within the DevIntegration team to integrate diverse components of the product stack, enabling end‑to‑end cluster provisioning and management. Build automation pipelines using Infrastructure as Code (IaC) and CI/CD frameworks to ensure consistent, repeatable deployments. Develop scripts, frameworks, and tools to eliminate manual interventions and improve system resilience. Participate in and lead architectural discussions—from high‑level design to low‑level implementation—to ensure alignment with reliability, security, and scalability goals. Collaborate with development and product teams to address functional gaps and propose sustainable technical solutions in a fast‑paced environment. Ensure long‑term operational sustainability of the deployed product, including updates, incident management, and integration with third‑party enterprise systems such as PKI, IAM, and SIEM. Conduct performance optimisation, capacity planning, and root cause analysis to maintain system health. Champion automation of day‑2 operations, such as monitoring, scaling, patching, and recovery. Take ownership beyond engineering scope when needed—leading planning, coordination, and execution activities with an end‑to‑end accountability mindset. Mentor and support team members, sharing deep expertise in reliability engineering, infrastructure design, and troubleshooting best practices. Actively contribute to defining and refining SRE standards and processes across the organization. Required Qualifications
8+ years of hands‑on experience managing mission‑critical, high‑availability production environments. Proven background in Site Reliability Engineering, DevOps, or Infrastructure Engineering. Strong understanding of cloud infrastructure (AWS, GCP, Azure) and private clouds; experience with bare‑metal environments is a plus. Proficiency in at least one general‑purpose programming language (Python or Go preferred). Solid grasp of Infrastructure as Code principles and modern deployment methodologies (Terraform, Ansible, Helm, ArgoCD, or similar). Expertise in containerisation and orchestration technologies (Docker, Kubernetes, Cluster API). Demonstrated experience with scalable, distributed systems and building high‑availability architectures. Deep understanding of Linux systems and networking fundamentals (TCP/IP, DNS, routing, firewalls); experience in a network provider or ISP environment is a plus. Strong knowledge of modern observability stacks (Prometheus, VictoriaMetrics, ClickHouse, OpenSearch/Elasticsearch) and root‑cause analysis techniques. Familiarity with security and compliance frameworks such as OWASP, ISO 27001, CSA, and PCI DSS. Exceptional analytical, problem‑solving, and debugging abilities. Proven experience working effectively in distributed teams, fostering collaboration across multiple functions. A mindset of extreme ownership—driving continuous improvement, accountability, and operational excellence. Preferred Qualifications
Experience in GPU‑based workload orchestration and performance optimisation. Familiarity with chaos engineering and proactive reliability testing. Experience contributing to or leading incident response frameworks and on‑call rotations. Exposure to edge computing, AI/ML infrastructure, or data‑intensive systems. Prior experience mentoring engineers and influencing technical direction across teams. Key Competencies
Systemic thinking and architectural foresight. Proactive automation and continuous improvement mindset. Advanced troubleshooting and observability orientation. Ability to operate effectively across multiple technical domains. Leadership presence with strong communication and mentoring skills. Additional Information
Work with an established Silicon Valley leader in the cloud infrastructure industry. Collaborate with passionate, talented, and engaging colleagues to help Fortune 500 and Global 2000 customers implement next‑generation cloud technologies. Be part of cutting‑edge, open‑source innovation. Thrive in a high‑energy environment of a young company where openness, collaboration, risk‑taking, and continuous growth are valued. Professional development and training opportunities, including conferences and working groups. Customised workstation (macOS, Windows). A competitive compensation package with strong benefits plan and stock options. Seniority level
Mid‑Senior level Employment type
Full‑time Job function
Software Development By submitting your resume, you consent to the processing and storage of your personal data in accordance with applicable data protection laws, for the purposes of considering your application for current and future job opportunities. It is understood that Mirantis, Inc. may use automated decision‑making technology (ADMT) for specific employment‑related decisions. Opting out of ADMT use is requested for decisions about evaluation and review connected with the specific employment decision for the position applied for. You also have the right to appeal any decisions made by ADMT by sending your request to isamoylova@mirantis.com.
#J-18808-Ljbffr