InStride

Principal Site Reliability Engineer (SRE)

InStride, Los Angeles, California, United States, 90079

Overview

Principal Site Reliability Engineer (SRE) to join InStride's growing engineering team. This is a highly technical individual contributor role at the intersection of cloud architecture, automation, and reliability engineering. You will be the go-to AWS expert for complex initiatives, setting technical direction and raising the bar for operational excellence across our platform. Every system you design, every automation you implement, and every safeguard you put in place will directly support our mission of expanding access to life-changing education for working adults around the globe. Responsibilities

Elevate platform reliability: Design and operate multi-region, fault-tolerant systems to ensure the learning platform is highly available. Advance automation at scale: Deliver infrastructure as code libraries, CI/CD pipelines, and self-service capabilities that reduce operational toil and accelerate developer productivity. Champion security and compliance: Implement defense-in-depth strategies, policy-as-code guardrails, and proactive monitoring to protect data and meet regulatory standards. Drive observability maturity: Define and enforce SLIs/SLOs, manage error budgets, and build monitoring frameworks. Enable seamless service connectivity: Deploy and manage service mesh and AWS networking to secure and optimize service-to-service communication. Influence technical direction: Shape AWS strategy for scalability, resilience, and cost efficiency in collaboration with engineering and security stakeholders. Mentor and uplift engineers: Lead design reviews and uplift the team’s expertise in modern DevOps and SRE practices. Qualifications

10+ years of experience in SRE, DevOps, or Platform Engineering roles operating production AWS workloads. Hands-on expertise with AWS EKS, Kubernetes networking, Helm, autoscaling (Karpenter/Cluster Autoscaler), serverless architectures, and API Gateways. Proven delivery of service mesh solutions (Istio, Linkerd, or AWS App Mesh) for secure, observable service-to-service communication. Proficiency with Infrastructure as Code using AWS CDK (TypeScript or Python), Terraform, or CloudFormation. Strong programming and automation skills in Go, Python, or TypeScript, with Bash experience. Experience implementing policy-as-code with OPA/Rego or similar tooling in CI/CD pipelines. Solid understanding of SLI/SLO/error-budget methodologies and hands-on experience with monitoring and alerting stacks (Prometheus, Grafana, CloudWatch, etc.). Deep knowledge of AWS security best practices, including IAM, encryption, OS hardening, and compliance enforcement. Excellent communication skills to translate reliability metrics into business impact and guide incident/post-mortem discussions. Experience mentoring engineers and influencing enterprise AWS and DevOps strategies. Familiarity with Internal Developer Portals (Backstage, Port, Cortex) and self-service automation is a strong plus. Compensation

Compensation range:

$165,000—$185,000 USD . Final offer depends on location, depth of experience, interview performance, and equity with other team members. We encourage you to talk with your recruiter to learn more about total compensation and benefits for this role. Benefits

401(k) plan with company match Flexible vacation policy Paid family leave Best-in-class health care benefits And more Inclusive workplace

InStride is a diverse, inclusive employer that encourages applicants from all backgrounds. If you have a disability or special need that requires accommodation, please let your recruiter know.

#J-18808-Ljbffr