Herbalife
Overview
THE ROLE:
We are seeking a highly experienced Principal II, Site Reliability Engineer (SRE) to lead the strategy and execution of reliability engineering across Herbalife’s global platforms. This role focuses on building and scaling resilient, observable systems, advancing multi-cloud operations, and embedding reliability, automation, and guidelines across engineering teams. You will define standards, drive adoption of modern infrastructure practices, and ensure that our services deliver performance, availability, and reliability at scale.
HOW YOU WOULD CONTRIBUTE:
Architect resilient platforms and tooling across Azure and GCP, bringing to bear Kubernetes, serverless technologies, and infrastructure as code.
Drive observability and monitoring practices with Dynatrace, Splunk, and OpenTelemetry, establishing metrics, tracing, alerting, and actionable dashboards.
Design and implement GitOps workflows for consistent, auditable, and secure infrastructure and application deployments.
Lead infrastructure automation with Terraform and related tooling to enable scalable, self-service provisioning and governance.
Define and enforce SLOs, SLIs, and error budgets to measure and improve system reliability and customer experience.
Develop operational standards and runbooks for incident response, disaster recovery, and performance management.
Partner with application and infrastructure teams to ensure reliability, scalability, and cost-efficiency are built into every layer of the stack.
Mentor and influence engineering teams to adopt modern SRE practices and drive a culture of operational excellence.
WHAT’S SPECIAL ABOUT THE TEAM:
The SRE team is evolving to expand its scope beyond traditional operations, embedding observability, automation, and cloud-native practices across Herbalife’s platform. Our mission is to ensure production systems are resilient, observable, and scalable, while enabling application teams to move quickly with confidence in Azure, GCP, and hybrid environments.
Qualifications SKILLS AND BACKGROUND REQUIRED TO BE SUCCESSFUL:
7+ years of engineering or SRE experience with modern distributed systems.
Proficiency in at least one modern programming language (Python, Go, Java, etc.).
Deep knowledge of observability and monitoring with Dynatrace, Splunk, and log/metrics pipelines.
Strong hands-on experience with multi-cloud environments (Azure + GCP), Kubernetes, and serverless platforms.
Proven expertise with GitOps practices and Terraform (IaC) for automation, scalability, and governance.
Experience defining SLOs, SLIs, and error budgets and embedding them into production systems.
Strong background in incident response, postmortems, and operational excellence.
Ability to mentor, guide, and influence technical and business collaborators.
Education
• Bachelor’s Degree in Computer Science, Engineering, or related field required.
US Benefits Statement Herbalife offers a variety of benefits to eligible employees in the U.S. (limited to the 50 States and the District of Columbia), which includes Group Health Programs, other Voluntary Benefit Programs, and Paid Time Off. Group Health Programs include Medical, Dental, Vision, Health Savings Account (HSA), Flexible Spending Accounts (FSA), Basic Life/AD&D; Short-Term and Long-Term Disability, and an Employee Assistance Program (EAP). Other Voluntary Benefit Programs include a 401(k) plan, Wellness Incentive Program, Employee Stock Purchase Plan (ESPP), Supplemental Life/Critical Illness/Hospitalization/Accident Insurance, and Pet Insurance. Paid time off includes Company-observed U.S. Holidays, Floating Holidays, Vacation, Sick Time, a Volunteer Program, Paid Maternity and Paternity Leave, Bereavement Leave, Personal Leave, and time off for voting.
#J-18808-Ljbffr
We are seeking a highly experienced Principal II, Site Reliability Engineer (SRE) to lead the strategy and execution of reliability engineering across Herbalife’s global platforms. This role focuses on building and scaling resilient, observable systems, advancing multi-cloud operations, and embedding reliability, automation, and guidelines across engineering teams. You will define standards, drive adoption of modern infrastructure practices, and ensure that our services deliver performance, availability, and reliability at scale.
HOW YOU WOULD CONTRIBUTE:
Architect resilient platforms and tooling across Azure and GCP, bringing to bear Kubernetes, serverless technologies, and infrastructure as code.
Drive observability and monitoring practices with Dynatrace, Splunk, and OpenTelemetry, establishing metrics, tracing, alerting, and actionable dashboards.
Design and implement GitOps workflows for consistent, auditable, and secure infrastructure and application deployments.
Lead infrastructure automation with Terraform and related tooling to enable scalable, self-service provisioning and governance.
Define and enforce SLOs, SLIs, and error budgets to measure and improve system reliability and customer experience.
Develop operational standards and runbooks for incident response, disaster recovery, and performance management.
Partner with application and infrastructure teams to ensure reliability, scalability, and cost-efficiency are built into every layer of the stack.
Mentor and influence engineering teams to adopt modern SRE practices and drive a culture of operational excellence.
WHAT’S SPECIAL ABOUT THE TEAM:
The SRE team is evolving to expand its scope beyond traditional operations, embedding observability, automation, and cloud-native practices across Herbalife’s platform. Our mission is to ensure production systems are resilient, observable, and scalable, while enabling application teams to move quickly with confidence in Azure, GCP, and hybrid environments.
Qualifications SKILLS AND BACKGROUND REQUIRED TO BE SUCCESSFUL:
7+ years of engineering or SRE experience with modern distributed systems.
Proficiency in at least one modern programming language (Python, Go, Java, etc.).
Deep knowledge of observability and monitoring with Dynatrace, Splunk, and log/metrics pipelines.
Strong hands-on experience with multi-cloud environments (Azure + GCP), Kubernetes, and serverless platforms.
Proven expertise with GitOps practices and Terraform (IaC) for automation, scalability, and governance.
Experience defining SLOs, SLIs, and error budgets and embedding them into production systems.
Strong background in incident response, postmortems, and operational excellence.
Ability to mentor, guide, and influence technical and business collaborators.
Education
• Bachelor’s Degree in Computer Science, Engineering, or related field required.
US Benefits Statement Herbalife offers a variety of benefits to eligible employees in the U.S. (limited to the 50 States and the District of Columbia), which includes Group Health Programs, other Voluntary Benefit Programs, and Paid Time Off. Group Health Programs include Medical, Dental, Vision, Health Savings Account (HSA), Flexible Spending Accounts (FSA), Basic Life/AD&D; Short-Term and Long-Term Disability, and an Employee Assistance Program (EAP). Other Voluntary Benefit Programs include a 401(k) plan, Wellness Incentive Program, Employee Stock Purchase Plan (ESPP), Supplemental Life/Critical Illness/Hospitalization/Accident Insurance, and Pet Insurance. Paid time off includes Company-observed U.S. Holidays, Floating Holidays, Vacation, Sick Time, a Volunteer Program, Paid Maternity and Paternity Leave, Bereavement Leave, Personal Leave, and time off for voting.
#J-18808-Ljbffr