UnitedHealth Group
Principal Site Reliability Engineer
UnitedHealth Group, Minneapolis, Minnesota, United States, 55447
Overview
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start
Caring. Connecting. Growing together.
We are seeking a
Principal Site Reliability Engineer (SRE)
to lead the design and implementation of resilient, observable, and high-performing systems across our organization. This role is ideal for a strategic thinker and hands-on technologist who thrives in complex environments and is passionate about reliability, automation, and innovation—especially at the intersection of SRE and AI.
You'll enjoy the flexibility to work remotely from anywhere within the U.S. as you take on some tough challenges.
Primary Responsibilities
Observability & Monitoring
Lead the implementation and standardization of OpenTelemetry across services to enhance observability and traceability
Define and enforce SLIs, SLOs, and error budgets in collaboration with engineering teams
Resiliency Engineering
Design and execute resiliency tests, disaster recovery (DR) exercises, and chaos engineering game days to proactively identify and mitigate system weaknesses
Develop automated failure injection and recovery validation tools
CI/CD & Performance Engineering
Enhance CI/CD pipelines with automated performance and load testing to ensure reliability and scalability before production deployment
Collaborate with DevOps and QA to integrate performance benchmarks into release gates
Cloud Architecture & Reliability
Drive cloud adoption strategies with a focus on resiliency patterns, multi-region failover, and cost-effective scaling
Partner with cloud architects to design fault-tolerant infrastructure and services
AI & Innovation in SRE
Explore and implement AI-driven solutions for anomaly detection, incident prediction, and intelligent alerting
Innovate with AI agents to automate routine SRE tasks and improve incident response efficiency
Leadership & Mentorship
Serve as a thought leader and mentor for SRE best practices across the organization
Lead cross-functional initiatives to improve system reliability, developer productivity, and customer experience
Benefits and Why UnitedHealth Group Reasons to consider working for UnitedHealth Group include competitive base pay, a full and comprehensive benefit program, performance rewards, and a management team that demonstrates commitment to your success. Some offerings include:
Paid Time Off accrued from your first pay period plus 8 paid holidays
Medical plan options with Health Spending Account or Health Savings Account
Dental, vision, life and AD&D insurance, plus short-term and long-term disability
401(k) Savings Plan and Employee Stock Purchase Plan
Education reimbursement
Employee discounts
Employee Assistance Program
Employee referral bonus program
Voluntary benefits (pet insurance, legal insurance, long-term care insurance, etc.)
More information can be downloaded at: applicable benefits page
You'll be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role as well as provide development for other roles you may be interested in.
Required Qualifications
Bachelor's degree in Computer Science, Information Technology or related field
10+ years of experience in software engineering, DevOps, or SRE roles, with at least 3 years in a principal or lead capacity
5+ years of experience with CI/CD tooling (e.g., Jenkins, GitHub Actions, ArgoCD)
5+ years of experience with container orchestration in cloud platforms (Azure or OCI preferred)
3+ years of deep experience in observability and monitoring tools (e.g., OpenTelemetry, Prometheus, Grafana, Datadog)
3+ years of experience with chaos engineering, DR planning, and performance testing
Preferred Qualifications
Hands-on experience with infrastructure as code (Terraform, Pulumi) and automation tools such as Ansible, Helm
Experience with service mesh technologies (e.g., Istio, Linkerd)
Familiarity with AI/ML concepts and experience applying them in operational contexts
Proven excellent communication and leadership skills
All telecommuters will be required to adhere to UnitedHealth Group's Telecommuter Policy.
Compensation and Compliance Pay is based on factors including local labor markets, education, work experience and certifications. In addition to salary, we offer a comprehensive benefits package, incentive programs, equity stock purchase and 401k contributions (eligibility applies). The salary range for this role is $132,200 to $226,600 annually based on full-time employment. We comply with all applicable minimum wage laws.
Pursuant to the San Francisco Fair Chance Ordinance, we will consider qualified applicants with arrest and conviction records.
Application Deadline: This posting will be available for a minimum of 2 business days or until a sufficient candidate pool is collected. The posting may end early due to volume of applications.
UnitedHealth Group is an Equal Employment Opportunity employer. We are committed to mitigating our impact on the environment and delivering equitable care that addresses health disparities and improves health outcomes.
#J-18808-Ljbffr
Caring. Connecting. Growing together.
We are seeking a
Principal Site Reliability Engineer (SRE)
to lead the design and implementation of resilient, observable, and high-performing systems across our organization. This role is ideal for a strategic thinker and hands-on technologist who thrives in complex environments and is passionate about reliability, automation, and innovation—especially at the intersection of SRE and AI.
You'll enjoy the flexibility to work remotely from anywhere within the U.S. as you take on some tough challenges.
Primary Responsibilities
Observability & Monitoring
Lead the implementation and standardization of OpenTelemetry across services to enhance observability and traceability
Define and enforce SLIs, SLOs, and error budgets in collaboration with engineering teams
Resiliency Engineering
Design and execute resiliency tests, disaster recovery (DR) exercises, and chaos engineering game days to proactively identify and mitigate system weaknesses
Develop automated failure injection and recovery validation tools
CI/CD & Performance Engineering
Enhance CI/CD pipelines with automated performance and load testing to ensure reliability and scalability before production deployment
Collaborate with DevOps and QA to integrate performance benchmarks into release gates
Cloud Architecture & Reliability
Drive cloud adoption strategies with a focus on resiliency patterns, multi-region failover, and cost-effective scaling
Partner with cloud architects to design fault-tolerant infrastructure and services
AI & Innovation in SRE
Explore and implement AI-driven solutions for anomaly detection, incident prediction, and intelligent alerting
Innovate with AI agents to automate routine SRE tasks and improve incident response efficiency
Leadership & Mentorship
Serve as a thought leader and mentor for SRE best practices across the organization
Lead cross-functional initiatives to improve system reliability, developer productivity, and customer experience
Benefits and Why UnitedHealth Group Reasons to consider working for UnitedHealth Group include competitive base pay, a full and comprehensive benefit program, performance rewards, and a management team that demonstrates commitment to your success. Some offerings include:
Paid Time Off accrued from your first pay period plus 8 paid holidays
Medical plan options with Health Spending Account or Health Savings Account
Dental, vision, life and AD&D insurance, plus short-term and long-term disability
401(k) Savings Plan and Employee Stock Purchase Plan
Education reimbursement
Employee discounts
Employee Assistance Program
Employee referral bonus program
Voluntary benefits (pet insurance, legal insurance, long-term care insurance, etc.)
More information can be downloaded at: applicable benefits page
You'll be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role as well as provide development for other roles you may be interested in.
Required Qualifications
Bachelor's degree in Computer Science, Information Technology or related field
10+ years of experience in software engineering, DevOps, or SRE roles, with at least 3 years in a principal or lead capacity
5+ years of experience with CI/CD tooling (e.g., Jenkins, GitHub Actions, ArgoCD)
5+ years of experience with container orchestration in cloud platforms (Azure or OCI preferred)
3+ years of deep experience in observability and monitoring tools (e.g., OpenTelemetry, Prometheus, Grafana, Datadog)
3+ years of experience with chaos engineering, DR planning, and performance testing
Preferred Qualifications
Hands-on experience with infrastructure as code (Terraform, Pulumi) and automation tools such as Ansible, Helm
Experience with service mesh technologies (e.g., Istio, Linkerd)
Familiarity with AI/ML concepts and experience applying them in operational contexts
Proven excellent communication and leadership skills
All telecommuters will be required to adhere to UnitedHealth Group's Telecommuter Policy.
Compensation and Compliance Pay is based on factors including local labor markets, education, work experience and certifications. In addition to salary, we offer a comprehensive benefits package, incentive programs, equity stock purchase and 401k contributions (eligibility applies). The salary range for this role is $132,200 to $226,600 annually based on full-time employment. We comply with all applicable minimum wage laws.
Pursuant to the San Francisco Fair Chance Ordinance, we will consider qualified applicants with arrest and conviction records.
Application Deadline: This posting will be available for a minimum of 2 business days or until a sufficient candidate pool is collected. The posting may end early due to volume of applications.
UnitedHealth Group is an Equal Employment Opportunity employer. We are committed to mitigating our impact on the environment and delivering equitable care that addresses health disparities and improves health outcomes.
#J-18808-Ljbffr