ZipRecruiter
Director of Site Reliability Engineering
ZipRecruiter, Palo Alto, California, United States, 94306
Overview
Job DescriptionFounded in 2017, Obsidian Security was created to close a critical gap: securing the SaaS applications where modern business happens—platforms like Microsoft 365, Salesforce, and hundreds more. Backed by top investors including Greylock, Norwest Venture Partners, and IVP, we've built a complete SaaS security platform to reduce risk, detect and respond to threats, and prevent breaches at the source. Our team includes leaders who helped define the categories of endpoint and security at CrowdStrike, Okta, Cylance, and Carbon Black. Now, we're transforming how SaaS is secured—in the era of agentic AI. Today, Obsidian is trusted by global enterprises like Snowflake, T-Mobile, and Pure Storage. We protect more than 200 organizations across North America, Europe, the Middle East, Southeast Asia, Australia, and New Zealand—including many of the world's largest Fortune 1000 and Global 2000 companies.With strong global momentum, a growing partner ecosystem including SentinelOne, Databricks, and Google Cloud, and a major fundraise on the horizon, we're scaling quickly toward long-term growth and IPO readiness. Join us as we define the future of SaaS security!
About the Role We are looking for a
Director of Site Reliability Engineering (SRE)
to lead the evolution of our production reliability function. This role will partner closely with the
Director of DevOps , who owns developer enablement, platform scalability and reliability, infrastructure security and compliance, and automation. You will be responsible for
operationalizing reliability at the service/module level ,
embedding SREs into core product teams , and ensuring that reliability is built into every layer of our product as we scale.
You will collaborate with Engineering, Security, and Support to define and enforce service-level objectives (SLOs), lead the incident management program, and help establish a culture of accountability and operational excellence.
Responsibilities Reliability Ownership
Define and drive the adoption of
service-level indicators (SLIs)
and
objectives (SLOs)
for all core services and modules in partnership with product and engineering leads
Establish
error budget
policies and guide trade-off decisions between velocity and reliability
Team Structure & Collaboration
Embed SREs
directly into core engineering teams to ensure tight alignment on architecture, operations, and reliability goals
Coach and mentor embedded SREs and partner engineers on operational readiness and best practices
Operational Readiness
Lead the buildout of reliability playbooks, service verification strategies, and production readiness checklists
Own incident response process: drive blameless postmortems, escalation practices, and reliability-focused retrospectives
Cross-Functional Leadership
Serve as a thought partner to the Director of DevOps — jointly ensure smooth CI/CD, release engineering, and platform reliability practices
Work with Support, Security, and Customer Success to ensure SLAs are met and reliability concerns are addressed proactively
Observability & Insights
Oversee system-wide monitoring, alerting, and observability infrastructure, enabling teams to make data-driven reliability decisions
Maintain visibility into key production metrics and advocate for improvements to reduce toil, latency, and risk
What We're Looking For
8+ years of experience in SRE, DevOps, or Infrastructure roles, with 3+ years in technical leadership or director-level roles
Deep experience with SaaS production systems running in cloud environments (AWS, GCP)
Proven success defining and managing
SLOs/SLIs ,
embedding SREs
in product teams, and driving
incident management programs
Strong technical background in observability, distributed systems, and reliability engineering principles
Working knowledge of Kubernetes, Helm, Prometheus, Grafana, and CI/CD systems (GitLab CI/CD, ArgoCD)
Programming/scripting proficiency in Go, Python, or equivalent
Collaborative leadership style with the ability to influence across Engineering, Product, and GTM organizations
Employee Benefits Our competitive benefits packages are designed to support our employees' well-being, both at work and at home. Our US based employees enjoy:
Competitive compensation with equity and 401k
Comprehensive healthcare with dental and vision coverage
Flexible paid time off and paid holiday time off
12 weeks of new parent or family leave
Personal and professional development resources
For more details on our US benefits, or for information on our international benefits, please see here.
Pay Transparency Please note that the base pay range is a guideline and for candidates who receive an offer, the base pay will vary based on factors such as work location, as well as the knowledge, skills and experience of the candidate. In addition to a competitive base salary, this position is eligible for equity awards and may be eligible for sales commission or incentive compensation based on the role or function within the company.
At Obsidian, we are proud to be an equal-opportunity employer. We value and hire for talent, passion, and compassion. In compliance with federal law, all persons hired will be required to submit satisfactory proof of and legal authorization. If you have a need that requires accommodation, please contact accommodations@obsidiansecurity.com
Information collected and processed as part of any job applications you choose to submit is subject to Obsidian's Applicant Privacy Policy.
Base Salary Range$235,000—$279,000 USD
#J-18808-Ljbffr
About the Role We are looking for a
Director of Site Reliability Engineering (SRE)
to lead the evolution of our production reliability function. This role will partner closely with the
Director of DevOps , who owns developer enablement, platform scalability and reliability, infrastructure security and compliance, and automation. You will be responsible for
operationalizing reliability at the service/module level ,
embedding SREs into core product teams , and ensuring that reliability is built into every layer of our product as we scale.
You will collaborate with Engineering, Security, and Support to define and enforce service-level objectives (SLOs), lead the incident management program, and help establish a culture of accountability and operational excellence.
Responsibilities Reliability Ownership
Define and drive the adoption of
service-level indicators (SLIs)
and
objectives (SLOs)
for all core services and modules in partnership with product and engineering leads
Establish
error budget
policies and guide trade-off decisions between velocity and reliability
Team Structure & Collaboration
Embed SREs
directly into core engineering teams to ensure tight alignment on architecture, operations, and reliability goals
Coach and mentor embedded SREs and partner engineers on operational readiness and best practices
Operational Readiness
Lead the buildout of reliability playbooks, service verification strategies, and production readiness checklists
Own incident response process: drive blameless postmortems, escalation practices, and reliability-focused retrospectives
Cross-Functional Leadership
Serve as a thought partner to the Director of DevOps — jointly ensure smooth CI/CD, release engineering, and platform reliability practices
Work with Support, Security, and Customer Success to ensure SLAs are met and reliability concerns are addressed proactively
Observability & Insights
Oversee system-wide monitoring, alerting, and observability infrastructure, enabling teams to make data-driven reliability decisions
Maintain visibility into key production metrics and advocate for improvements to reduce toil, latency, and risk
What We're Looking For
8+ years of experience in SRE, DevOps, or Infrastructure roles, with 3+ years in technical leadership or director-level roles
Deep experience with SaaS production systems running in cloud environments (AWS, GCP)
Proven success defining and managing
SLOs/SLIs ,
embedding SREs
in product teams, and driving
incident management programs
Strong technical background in observability, distributed systems, and reliability engineering principles
Working knowledge of Kubernetes, Helm, Prometheus, Grafana, and CI/CD systems (GitLab CI/CD, ArgoCD)
Programming/scripting proficiency in Go, Python, or equivalent
Collaborative leadership style with the ability to influence across Engineering, Product, and GTM organizations
Employee Benefits Our competitive benefits packages are designed to support our employees' well-being, both at work and at home. Our US based employees enjoy:
Competitive compensation with equity and 401k
Comprehensive healthcare with dental and vision coverage
Flexible paid time off and paid holiday time off
12 weeks of new parent or family leave
Personal and professional development resources
For more details on our US benefits, or for information on our international benefits, please see here.
Pay Transparency Please note that the base pay range is a guideline and for candidates who receive an offer, the base pay will vary based on factors such as work location, as well as the knowledge, skills and experience of the candidate. In addition to a competitive base salary, this position is eligible for equity awards and may be eligible for sales commission or incentive compensation based on the role or function within the company.
At Obsidian, we are proud to be an equal-opportunity employer. We value and hire for talent, passion, and compassion. In compliance with federal law, all persons hired will be required to submit satisfactory proof of and legal authorization. If you have a need that requires accommodation, please contact accommodations@obsidiansecurity.com
Information collected and processed as part of any job applications you choose to submit is subject to Obsidian's Applicant Privacy Policy.
Base Salary Range$235,000—$279,000 USD
#J-18808-Ljbffr