Logo
Elios Talent

Staff Site Reliability Engineer

Elios Talent, Seattle, Washington, us, 98127

Save Job

Get AI-powered advice on this job and more exclusive features. Ensure reliability and performance of large-scale distributed systems Lead incident response and disaster recovery initiatives Build automation tools to streamline system operations Job Information

Title:

Staff Site Reliability Engineer Location:

Flexible / Remote Employment Type:

Full-Time Compensation:

$135,000 – $220,000 Role Summary

We are seeking a Staff Site Reliability Engineer (SRE) to ensure the availability, scalability, and performance of mission-critical systems. You will design disaster recovery processes, implement observability and alerting frameworks, and lead incident response efforts. This role combines system design expertise with a focus on automation, empowering teams to operate large-scale distributed environments efficiently and securely. Key Responsibilities

Design and maintain highly available, large-scale distributed systems. Lead disaster recovery planning, execution, and continuous improvement. Implement observability, monitoring, and alerting solutions. Drive incident response, root cause analysis, and post-mortem reviews. Build automation tools to optimize system operations and reduce manual tasks. Collaborate with engineering teams to embed reliability best practices. Requirements

6+ years of experience in Site Reliability Engineering or related roles. Expertise in system design and distributed system architecture. Proficiency in Go and Python for automation and tooling. Strong knowledge of Kubernetes and container orchestration. Experience with observability tools (monitoring, logging, and tracing). Proven ability to lead incident response and drive reliability culture. About the Opportunity

This role is ideal for an experienced engineer who thrives on ensuring reliability at scale. You will lead critical system initiatives, mentor teams, and implement automation to support resilient operations. Why Join

High-impact role at the intersection of reliability and scalability. Competitive compensation and leadership visibility. Opportunity to shape operational excellence and system resiliency.

#J-18808-Ljbffr