Providence Partners, LLC
Senior Site Reliability Engineer (Sr. SRE)
Location : Hybrid (1-2 days / week)
We are looking for a Senior Site Reliability Engineer (SRE) to help scale and operate highly available, cloud-based systems. In this role, you'll sit at the intersection of software engineering, DevOps, and platform reliability , ensuring our systems are resilient, observable, and built to perform at scale.
You'll lead incident response, drive automation, and partner closely with engineering teams to embed reliability into everything we build.
What You'll Do :
Own the reliability, availability, and performance of production systems
Lead incident response , on-call operations, and blameless post-mortems
Build and improve monitoring, alerting, logging, and observability
Define and manage SLIs, SLOs, and error budgets
Design and build automation and self-service tools to reduce toil
Support cloud infrastructure (AWS, Azure, GCP) using Infrastructure as Code
Improve CI / CD pipelines and deployment reliability
Partner with engineers on system design and architecture
Create runbooks and operational documentation
Mentor team members and promote SRE and DevOps best practices
What We're Looking For :
5+ years of experience in Site Reliability Engineering, DevOps, Platform, or Cloud Engineering
Strong Linux and production troubleshooting skills
Hands-on experience with AWS, Azure, or GCP
Proficiency in Python, Go, Java, Bash, or similar languages
Experience with Terraform, Ansible, or Infrastructure as Code
Experience supporting CI / CD pipelines and production deployments
Strong communication skills and a reliability-first mindset
Nice to Have :
Kubernetes and container orchestration experience
Observability tools like Prometheus, Grafana, Datadog, Splunk, or ELK
Experience with high-traffic, highly available systems
Knowledge of chaos engineering, error budgets, or AIOps
Cloud or Kubernetes certifications
Why Join Us :
Work on scalable, mission-critical platforms
Influence reliability and engineering best practices
Collaborative, blameless culture
Competitive compensation, benefits, and growth opportunities
#J-18808-Ljbffr
Location : Hybrid (1-2 days / week)
We are looking for a Senior Site Reliability Engineer (SRE) to help scale and operate highly available, cloud-based systems. In this role, you'll sit at the intersection of software engineering, DevOps, and platform reliability , ensuring our systems are resilient, observable, and built to perform at scale.
You'll lead incident response, drive automation, and partner closely with engineering teams to embed reliability into everything we build.
What You'll Do :
Own the reliability, availability, and performance of production systems
Lead incident response , on-call operations, and blameless post-mortems
Build and improve monitoring, alerting, logging, and observability
Define and manage SLIs, SLOs, and error budgets
Design and build automation and self-service tools to reduce toil
Support cloud infrastructure (AWS, Azure, GCP) using Infrastructure as Code
Improve CI / CD pipelines and deployment reliability
Partner with engineers on system design and architecture
Create runbooks and operational documentation
Mentor team members and promote SRE and DevOps best practices
What We're Looking For :
5+ years of experience in Site Reliability Engineering, DevOps, Platform, or Cloud Engineering
Strong Linux and production troubleshooting skills
Hands-on experience with AWS, Azure, or GCP
Proficiency in Python, Go, Java, Bash, or similar languages
Experience with Terraform, Ansible, or Infrastructure as Code
Experience supporting CI / CD pipelines and production deployments
Strong communication skills and a reliability-first mindset
Nice to Have :
Kubernetes and container orchestration experience
Observability tools like Prometheus, Grafana, Datadog, Splunk, or ELK
Experience with high-traffic, highly available systems
Knowledge of chaos engineering, error budgets, or AIOps
Cloud or Kubernetes certifications
Why Join Us :
Work on scalable, mission-critical platforms
Influence reliability and engineering best practices
Collaborative, blameless culture
Competitive compensation, benefits, and growth opportunities
#J-18808-Ljbffr