Highbrow LLC
Job Title
Site Reliability Engineer
Location
Bellevue, WA
Overview
We’re seeking a skilled
Site Reliability Engineer (SRE)
to join our growing engineering team. As an SRE, you will be responsible for building, maintaining, and scaling our production systems while improving reliability, availability, and performance. You’ll work at the intersection of
software engineering and infrastructure , automating everything from deployment to monitoring and incident response.
This role is ideal for someone with a passion for
operational excellence ,
infrastructure as code , and a
deep understanding of distributed systems .
Key Responsibilities Design, implement, and maintain scalable and reliable infrastructure using automation tools.
Develop and manage monitoring, alerting, and incident response systems to ensure
high availability
and
performance
of services.
Collaborate with development teams to ensure production readiness and enforce
best practices
for CI/CD, observability, and fault tolerance.
Troubleshoot and resolve production issues, conducting
root cause analysis
and implementing postmortem processes.
Continuously improve
deployment pipelines , configuration management, and system orchestration tools.
Manage cloud infrastructure (e.g. AWS, GCP, Azure), Kubernetes clusters, and containerized applications.
Define and enforce
SLOs/SLIs/SLAs
and work proactively to maintain service health and uptime.
Participate in an
on-call rotation , working to minimize pager fatigue through proactive systems improvements.
Support security, compliance, and audit readiness efforts through automation and monitoring.
Required Qualifications 3–7 years of experience in SRE, DevOps, or backend infrastructure roles.
Strong understanding of
Linux systems administration , networking, and performance tuning.
Proficiency in scripting and automation using
Python, Go, Bash, or similar
Experience with
CI/CD pipelines
(e.g. GitLab CI, Jenkins, ArgoCD, etc.).
Expertise in
monitoring and observability tools
(e.g. Prometheus, Grafana, ELK/EFK, Datadog).
Hands-on experience with
cloud providers
like AWS, GCP, or Azure.
Strong knowledge of
Kubernetes , Docker, and container orchestration best practices.
Familiarity with infrastructure as code (IaC) using
Terraform, Pulumi, or CloudFormation .
Excellent communication and collaboration skills; ability to work cross-functionally.
Preferred Qualifications Experience in
high-scale, high-availability
Background in
incident management , chaos engineering, or resilience testing.
Familiarity with
service mesh
technologies (e.g. Istio, Linkerd).
Experience working in
regulated industries
(e.g., fintech, healthcare, telecom).
Contributions to open-source SRE, DevOps, or cloud-native projects.
#J-18808-Ljbffr
Site Reliability Engineer
Location
Bellevue, WA
Overview
We’re seeking a skilled
Site Reliability Engineer (SRE)
to join our growing engineering team. As an SRE, you will be responsible for building, maintaining, and scaling our production systems while improving reliability, availability, and performance. You’ll work at the intersection of
software engineering and infrastructure , automating everything from deployment to monitoring and incident response.
This role is ideal for someone with a passion for
operational excellence ,
infrastructure as code , and a
deep understanding of distributed systems .
Key Responsibilities Design, implement, and maintain scalable and reliable infrastructure using automation tools.
Develop and manage monitoring, alerting, and incident response systems to ensure
high availability
and
performance
of services.
Collaborate with development teams to ensure production readiness and enforce
best practices
for CI/CD, observability, and fault tolerance.
Troubleshoot and resolve production issues, conducting
root cause analysis
and implementing postmortem processes.
Continuously improve
deployment pipelines , configuration management, and system orchestration tools.
Manage cloud infrastructure (e.g. AWS, GCP, Azure), Kubernetes clusters, and containerized applications.
Define and enforce
SLOs/SLIs/SLAs
and work proactively to maintain service health and uptime.
Participate in an
on-call rotation , working to minimize pager fatigue through proactive systems improvements.
Support security, compliance, and audit readiness efforts through automation and monitoring.
Required Qualifications 3–7 years of experience in SRE, DevOps, or backend infrastructure roles.
Strong understanding of
Linux systems administration , networking, and performance tuning.
Proficiency in scripting and automation using
Python, Go, Bash, or similar
Experience with
CI/CD pipelines
(e.g. GitLab CI, Jenkins, ArgoCD, etc.).
Expertise in
monitoring and observability tools
(e.g. Prometheus, Grafana, ELK/EFK, Datadog).
Hands-on experience with
cloud providers
like AWS, GCP, or Azure.
Strong knowledge of
Kubernetes , Docker, and container orchestration best practices.
Familiarity with infrastructure as code (IaC) using
Terraform, Pulumi, or CloudFormation .
Excellent communication and collaboration skills; ability to work cross-functionally.
Preferred Qualifications Experience in
high-scale, high-availability
Background in
incident management , chaos engineering, or resilience testing.
Familiarity with
service mesh
technologies (e.g. Istio, Linkerd).
Experience working in
regulated industries
(e.g., fintech, healthcare, telecom).
Contributions to open-source SRE, DevOps, or cloud-native projects.
#J-18808-Ljbffr