Logo
hackajob

Principal Site Reliability Engineer

hackajob, Minneapolis, Minnesota, United States, 55447

Save Job

1 day ago Be among the first 25 applicants This range is provided by hackajob. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range

$132,200.00/yr - $226,600.00/yr Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together. We are seeking a

Principal Site Reliability Engineer (SRE)

to lead the design and implementation of resilient, observable, and high-performing systems across our organization. This role is ideal for a strategic thinker and hands-on technologist who thrives in complex environments and is passionate about reliability, automation, and innovation—especially at the intersection of SRE and AI. You’ll enjoy the flexibility to work remotely from anywhere within the U.S. As for hires in the Minneapolis or Washington, D.C. area, you will be required to work in the office a minimum of four days per week. Primary Responsibilities

Observability & Monitoring Lead the implementation and standardization of OpenTelemetry across services to enhance observability and traceability Define and enforce SLIs, SLOs, and error budgets in collaboration with engineering teams Resiliency Engineering Design and execute resiliency tests, disaster recovery (DR) exercises, and chaos engineering game days to proactively identify and mitigate system weaknesses Develop automated failure injection and recovery validation tools CI/CD & Performance Engineering Enhance CI/CD pipelines with automated performance and load testing to ensure reliability and scalability before production deployment Collaborate with DevOps and QA to integrate performance benchmarks into release gates Cloud Architecture & Reliability Drive cloud adoption strategies with a focus on resiliency patterns, multi-region failover, and cost-effective scaling Partner with cloud architects to design fault-tolerant infrastructure and services AI & Innovation in SRE Explore and implement AI-driven solutions for anomaly detection, incident prediction, and intelligent alerting Innovate with AI agents to automate routine SRE tasks and improve incident response efficiency Leadership & Mentorship Serve as a thought leader and mentor for SRE best practices across the organization Lead cross-functional initiatives to improve system reliability, developer productivity, and customer experience What are the reasons to consider working for UnitedHealth Group? We offer competitive base pay, a comprehensive benefits program, performance rewards, and a management team committed to your success. Benefits include Paid Time Off, medical, dental, vision, life insurance, disability coverage, 401(k) and more. Some offerings include: Paid Time Off and 8 paid holidays Medical plans with Health Spending or Health Savings accounts Dental, Vision, Life & AD&D Insurance, Short-term and Long-Term Disability 401(k) Savings Plan and Employee Stock Purchase Plan Education Reimbursement Employee Discounts Employee Assistance Program Employee Referral Bonus Program Voluntary Benefits (pet insurance, legal insurance, LTC Insurance, etc.) Required Qualifications Bachelor’s degree in Computer Science, Information Technology or related field 10+ years of experience in software engineering, DevOps, or SRE roles, with at least 3 years in a principal or lead capacity 5+ years of experience with CI/CD tooling (e.g., Jenkins, GitHub Actions, ArgoCD) 5+ years of experience with container orchestration in cloud platforms (Azure or OCI preferred) 3+ years of deep experience in observability and monitoring tools (e.g., OpenTelemetry, Prometheus, Grafana, Datadog) 3+ years of experience with chaos engineering, DR planning, and performance testing Preferred Qualifications Hands-on experience with infrastructure as code (Terraform, Pulumi) and automation tools such as Ansible, Helm Experience with service mesh technologies (e.g., Istio, Linkerd) Familiarity with AI/ML concepts and experience applying them in operational contexts Proven excellent communication and leadership skills All Telecommuters will be required to adhere to UnitedHealth Group’s Telecommuter Policy. Pay is based on several factors including but not limited to local labor markets, education, work experience, certifications, etc. The salary for this role will range from $132,200 to $226,600 annually based on full-time employment. We comply with all minimum wage laws as applicable. UnitedHealth Group is an Equal Employment Opportunity employer and drug-free workplace. Candidates are required to pass a drug test before beginning employment. Seniority level

Mid-Senior level Employment type

Full-time Job function

Engineering and Information Technology Industries Software Development

#J-18808-Ljbffr