Logo
Jobot

Director of Site Reliability Engineering

Jobot, WorkFromHome

Save Job

Director of Site Reliability Engineering

Base pay: $220,000 - $260,000 per year

Location: Fully remote, United States

Reporting to: VP of Engineering

Company Overview

We are a mission-driven organization dedicated to making AI adoption safe and secure for enterprises worldwide. As the leading provider of Security for AI, our platform protects agentic, generative, and predictive AI applications across the entire lifecycle—safeguarding intellectual property, ensuring compliance, and enabling organizations to innovate with confidence.

Founded by cybersecurity and machine learning veterans, we have secured backing from strategic investors including Microsoft’s Venture Fund (M12), Moore Strategic Ventures, Booz Allen Ventures, IBM Ventures, and Capital One Ventures.

Mission & Impact

Our work strengthens the security posture of organizations protecting their AI assets—from financial institutions and healthcare providers to government and Fortune 500 enterprises.

Responsibilities

  • Build and Lead the SRE Function
    • Define and execute the SRE strategy and roadmap, positioning reliability as a core product feature.
    • Build, mentor, and scale a high-performing SRE and Platform Engineering team.
    • Establish SRE principles, culture, and best practices across engineering.
    • Create clear career development paths and raise the hiring bar.
  • Drive Platform Reliability & Operational Excellence
    • Own reliability, availability, latency, and performance across multi‑cloud, multi‑region deployments (AWS, Azure, GCP).
    • Set and achieve SLOs/SLIs aligned with business objectives.
    • Architect multi‑region resiliency, automated failover, graceful degradation, and disaster recovery.
    • Build robust observability: distributed tracing, metrics, logging, and actionable alerting.
    • Lead incident management, on‑call processes, incident command, blameless post‑mortems, and systematic remediation.
  • Enable Developer Velocity & Platform Excellence
    • Own CI/CD pipelines and deployment infrastructure for safe, fast, reliable delivery.
    • Build internal developer platforms and tooling that reduce toil and improve productivity.
    • Implement progressive delivery (canaries, feature flags, automated rollbacks).
    • Partner with engineering teams to embed reliability requirements and design patterns early in development.
  • Security, Compliance & Enterprise Requirements
    • Ensure alignment with standards such as FedRAMP, SOC 2, ISO 27001, and other regulatory requirements.
    • Build and support air‑gapped and on‑premises deployment capabilities.
    • Implement infrastructure security controls, secrets management, and audit logging.
    • Support customer‑facing SLAs and maintain trust with enterprise and government clients.
  • Scale & Optimize the Platform
    • Lead capacity planning and performance engineering for platform growth.
    • Drive chaos engineering and resilience testing to validate system behavior under failure.
    • Optimize cost while maintaining reliability and performance.
    • Automate operational workflows to eliminate toil and improve efficiency.

Qualifications

  • Leadership & Experience
    • 8+ years in infrastructure, platform engineering, or SRE roles.
    • 4+ years in engineering leadership.
    • Experience supporting mission‑critical, always‑on systems at enterprise scale.
    • Strong people leadership and a track record of building high‑performing teams.
  • Technical Expertise
    • Deep knowledge of cloud infrastructure (AWS, Azure, GCP) and multi‑region systems.
    • Strong experience with Kubernetes, Docker, and infrastructure‑as‑code (Terraform, Pulumi, CloudFormation).
    • Proven ability to build and operate large‑scale distributed systems.
    • Expertise in observability tooling (Prometheus, Grafana, Datadog, New Relic, ELK/EFK, distributed tracing).
    • Proficiency in Python, Go, or similar languages.
    • Understanding of databases, data pipelines, message queues, and caching systems.
  • Strategic & Operational Skills
    • Experience driving SRE strategy, SLOs/SLIs, error budgets, and incident management.
    • Ability to partner across engineering, product, security, and customer success.
    • Strong communication skills across technical and non‑technical audiences.
    • Pragmatic problem‑solving and sound decision‑making.
  • Bonus Experience
    • Background in cybersecurity or AI/ML infrastructure.
    • Familiarity with compliance frameworks (FedRAMP, SOC 2, ISO 27001, NIST).
    • Experience supporting air‑gapped or on‑premise deployments.
    • Hands‑on experience with chaos engineering and game day exercises.
    • Open‑source contributions or SRE community leadership.

Benefits and Why Join

  • Be part of a new, fast‑growing category.
  • High‑impact mission protecting AI systems.
  • Cutting‑edge engineering in AI/ML security and distributed systems.
  • Backed by top‑tier investors: Microsoft, IBM, and others.
  • Build and shape the SRE and platform culture from the ground up.
  • High autonomy & ownership influencing roadmap and architecture.
  • Fully remote, U.S.‑based with flexible work/life balance.
  • Competitive pay + equity upside.
  • Elite team and steep career growth in a hyper‑growth environment.

Equal Opportunity

We are an equal opportunity employer and do not discriminate based on race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, or any legally protected status. We are committed to fostering an inclusive environment where all team members can thrive. If you need accommodations during the application or interview process, please let us know.

#J-18808-Ljbffr