Highbrow LLC

Site Reliability Engineer

Highbrow LLC, Bellevue, Washington, us, 98009

Job Title

Site Reliability Engineer

Location

Bellevue, WA

Overview

We’re seeking a skilled

Site Reliability Engineer (SRE)

to join our growing engineering team. As an SRE, you will be responsible for building, maintaining, and scaling our production systems while improving reliability, availability, and performance. You’ll work at the intersection of

software engineering and infrastructure , automating everything from deployment to monitoring and incident response.

This role is ideal for someone with a passion for

operational excellence ,

infrastructure as code , and a

deep understanding of distributed systems .

Key Responsibilities Design, implement, and maintain scalable and reliable infrastructure using automation tools.

Develop and manage monitoring, alerting, and incident response systems to ensure

high availability

and

performance

of services.

Collaborate with development teams to ensure production readiness and enforce

best practices

for CI/CD, observability, and fault tolerance.

Troubleshoot and resolve production issues, conducting

root cause analysis

and implementing postmortem processes.

Continuously improve

deployment pipelines , configuration management, and system orchestration tools.

Manage cloud infrastructure (e.g. AWS, GCP, Azure), Kubernetes clusters, and containerized applications.

Define and enforce

SLOs/SLIs/SLAs

and work proactively to maintain service health and uptime.

Participate in an

on-call rotation , working to minimize pager fatigue through proactive systems improvements.

Support security, compliance, and audit readiness efforts through automation and monitoring.

Required Qualifications 3–7 years of experience in SRE, DevOps, or backend infrastructure roles.

Strong understanding of

Linux systems administration , networking, and performance tuning.

Proficiency in scripting and automation using

Python, Go, Bash, or similar

Experience with

CI/CD pipelines

(e.g. GitLab CI, Jenkins, ArgoCD, etc.).

Expertise in

monitoring and observability tools

(e.g. Prometheus, Grafana, ELK/EFK, Datadog).

Hands-on experience with

cloud providers

like AWS, GCP, or Azure.

Strong knowledge of

Kubernetes , Docker, and container orchestration best practices.

Familiarity with infrastructure as code (IaC) using

Terraform, Pulumi, or CloudFormation .

Excellent communication and collaboration skills; ability to work cross-functionally.

Preferred Qualifications Experience in

high-scale, high-availability

Background in

incident management , chaos engineering, or resilience testing.

Familiarity with

service mesh

technologies (e.g. Istio, Linkerd).

Experience working in

regulated industries

(e.g., fintech, healthcare, telecom).

Contributions to open-source SRE, DevOps, or cloud-native projects.

#J-18808-Ljbffr