Optomi

Sr. Director, Site Reliability and Platform Engineering

Optomi, Tacoma, Washington, us, 98417

Sr. Director, Site Reliability and Platform Engineering Optomi, in partnership with our premier client in the Technology industry, is seeking a Senior Director to join our client’s SRE and DevOps team in Tacoma, WA, reporting to the Vice President, Engineering. In this pivotal role, you will foster a culture of product reliability across all of Engineering, drive and support the SRE team in conducting risk analyses, and work with Engineering leadership to ensure operational excellence of cloud-scale, high-availability systems. Collaborating closely with Product Management, Engineering, IT, and Product Security Engineering, you will manage the SRE and DevOps teams, using your abilities to incorporate roadmap objectives. This is an essential position in the Engineering organization with executive-level visibility, driving change with other senior leaders to achieve departmental and corporate goals. You are the ideal candidate if you are a visionary who lives and breathes reliability at scale.

Be a Contributor - What You\'ll Do

Lead and mentor a team of reliability and platform engineers, championing a culture of reliability, scalability, and continuous improvement across all customer products, both on-prem and SaaS

Establish a charter for best-in-class site reliability engineering, and drive Engineering teams toward achieving these best practices

Institute a set of tools and processes that ensure monitoring, observability, capacity planning, disaster recovery, and incident management systems can support 99.999 availability for critical services

Manage large-scale infrastructure and applications across multiple cloud providers using a mix of native cloud, open-source, and commercial off-the-shelf tools

Work with stakeholders, including Engineering, IT, Product Management, and Customer Support, to define and ensure customer-driven SLIs/SLOs exist for both new and existing functionality

Communicate progress by highlighting the accomplishments, risks, mitigation, and other pertinent key performance indicators that feed into our overarching business strategy

Facilitate continuous training programs for Engineering that reduce risk, including completion of annual reliability training for Engineering staff

Drive product reliability, operational, and efficiency metrics with automation, allowing management to understand the maturity and risk levels in various product areas

Be Prepared - What You Bring

15+ years of experience in SRE, platform engineering, or related roles with at least 5 years of this time in a director-level role

10+ years of experience with cloud infrastructure, such as AWS, GCP, and Azure, and DevOps practices

Proven experience managing large-scale, high-availability systems with an emphasis on containers and Kubernetes environments

Experience with CI/CD pipelines, monitoring tools, and incident management processes

Experience with automation and scripting like Python and Go and experience with monitoring and observability tools, such as Prometheus, Grafana, etc.

Experience maintaining SOC2, FedRAMP, or ISO 27001 certifications

Experience working within a global team structure

Excellent leadership, communication, and interpersonal skills

Solid business analysis or financial modeling skills to run the analysis for various projects and good understanding of product and software development principles and industry frameworks

Excellent problem-solving and analysis skills combined with impeccable business judgment

#J-18808-Ljbffr