Optomi
Sr. Director, Site Reliability and Platform Engineering
Optomi, Tacoma, Washington, us, 98417
Sr. Director, Site Reliability and Platform Engineering
Optomi, in partnership with our premier client in the Technology industry, is seeking a Senior Director to join our client’s SRE and DevOps team in Tacoma, WA, reporting to the Vice President, Engineering. In this pivotal role, you will foster a culture of product reliability across all of Engineering, drive and support the SRE team in conducting risk analyses, and work with Engineering leadership to ensure operational excellence of cloud-scale, high-availability systems. Collaborating closely with Product Management, Engineering, IT, and Product Security Engineering, you will manage the SRE and DevOps teams, using your abilities to incorporate roadmap objectives. This is an essential position in the Engineering organization with executive-level visibility, driving change with other senior leaders to achieve departmental and corporate goals. You are the ideal candidate if you are a visionary who lives and breathes reliability at scale.
Be a Contributor - What You\'ll Do
Lead and mentor a team of reliability and platform engineers, championing a culture of reliability, scalability, and continuous improvement across all customer products, both on-prem and SaaS
Establish a charter for best-in-class site reliability engineering, and drive Engineering teams toward achieving these best practices
Institute a set of tools and processes that ensure monitoring, observability, capacity planning, disaster recovery, and incident management systems can support 99.999 availability for critical services
Manage large-scale infrastructure and applications across multiple cloud providers using a mix of native cloud, open-source, and commercial off-the-shelf tools
Work with stakeholders, including Engineering, IT, Product Management, and Customer Support, to define and ensure customer-driven SLIs/SLOs exist for both new and existing functionality
Communicate progress by highlighting the accomplishments, risks, mitigation, and other pertinent key performance indicators that feed into our overarching business strategy
Facilitate continuous training programs for Engineering that reduce risk, including completion of annual reliability training for Engineering staff
Drive product reliability, operational, and efficiency metrics with automation, allowing management to understand the maturity and risk levels in various product areas
Be Prepared - What You Bring
15+ years of experience in SRE, platform engineering, or related roles with at least 5 years of this time in a director-level role
10+ years of experience with cloud infrastructure, such as AWS, GCP, and Azure, and DevOps practices
Proven experience managing large-scale, high-availability systems with an emphasis on containers and Kubernetes environments
Experience with CI/CD pipelines, monitoring tools, and incident management processes
Experience with automation and scripting like Python and Go and experience with monitoring and observability tools, such as Prometheus, Grafana, etc.
Experience maintaining SOC2, FedRAMP, or ISO 27001 certifications
Experience working within a global team structure
Excellent leadership, communication, and interpersonal skills
Solid business analysis or financial modeling skills to run the analysis for various projects and good understanding of product and software development principles and industry frameworks
Excellent problem-solving and analysis skills combined with impeccable business judgment
#J-18808-Ljbffr
Be a Contributor - What You\'ll Do
Lead and mentor a team of reliability and platform engineers, championing a culture of reliability, scalability, and continuous improvement across all customer products, both on-prem and SaaS
Establish a charter for best-in-class site reliability engineering, and drive Engineering teams toward achieving these best practices
Institute a set of tools and processes that ensure monitoring, observability, capacity planning, disaster recovery, and incident management systems can support 99.999 availability for critical services
Manage large-scale infrastructure and applications across multiple cloud providers using a mix of native cloud, open-source, and commercial off-the-shelf tools
Work with stakeholders, including Engineering, IT, Product Management, and Customer Support, to define and ensure customer-driven SLIs/SLOs exist for both new and existing functionality
Communicate progress by highlighting the accomplishments, risks, mitigation, and other pertinent key performance indicators that feed into our overarching business strategy
Facilitate continuous training programs for Engineering that reduce risk, including completion of annual reliability training for Engineering staff
Drive product reliability, operational, and efficiency metrics with automation, allowing management to understand the maturity and risk levels in various product areas
Be Prepared - What You Bring
15+ years of experience in SRE, platform engineering, or related roles with at least 5 years of this time in a director-level role
10+ years of experience with cloud infrastructure, such as AWS, GCP, and Azure, and DevOps practices
Proven experience managing large-scale, high-availability systems with an emphasis on containers and Kubernetes environments
Experience with CI/CD pipelines, monitoring tools, and incident management processes
Experience with automation and scripting like Python and Go and experience with monitoring and observability tools, such as Prometheus, Grafana, etc.
Experience maintaining SOC2, FedRAMP, or ISO 27001 certifications
Experience working within a global team structure
Excellent leadership, communication, and interpersonal skills
Solid business analysis or financial modeling skills to run the analysis for various projects and good understanding of product and software development principles and industry frameworks
Excellent problem-solving and analysis skills combined with impeccable business judgment
#J-18808-Ljbffr