Logo
UKG (Ultimate Kronos Group)

Principal Site Reliability Engineer

UKG (Ultimate Kronos Group), Alpharetta, Georgia, United States, 30239

Save Job

Why UKG:

At UKG, the work you do matters. The code you ship, the decisions you make, and the care you show a customer all add up to real impact. Today, tens of millions of workers start and end their days with our workforce operating platform. Helping people get paid, grow in their careers, and shape the future of their industries. That’s what we do. We never stop learning. We never stop challenging the norm. We push for better, and we celebrate the wins along the way. Here, you’ll get flexibility that’s real, benefits you can count on, and a team that succeeds together. Because at UKG, your work matters—and so do you. About the Team:

Site Reliability Engineers at UKG are critical team members that have a breadth of knowledge encompassing all aspects of service delivery. They develop software solutions to enhance, harden and support our service delivery processes. This can include building and managing CI/CD deployment pipelines, automated testing, capacity planning, performance analysis, monitoring, alerting, chaos engineering and auto remediation. Site Reliability Engineers must be passionate about learning and evolving with current technology trends. They strive to innovate and are relentless in pursuing a flawless customer experience. They have an “automate everything” mindset, helping us bring value to our customers by deploying services with incredible speed, consistency, and availability. About the Role

Site Reliability Engineers (SREs) at UKG play a critical role in delivering scalable, reliable, and secure services to our customers. As Principal SRE, you will be a force multiplier—combining deep software engineering expertise with systems knowledge to build robust automation, drive operational excellence, and elevate the overall reliability of our services. This role is highly technical and hands-on. You will design and implement solutions that eliminate toil and optimize performance, including developing automated testing frameworks, intelligent alerting systems, and self-healing mechanisms. Responsibilities

Architect, develop, and maintain scalable automation, internal tools, health checks, monitoring, auto-remediation to improve service availability, reliability, latency, scalability, and system resiliency—ensuring services withstand failures and recover gracefully to maintain high availability. Lead incident response effort to minimize customer impact and reduce MTTx, including leading post-incident reviews to identify root causes and implement long-term solutions. Provide strategic guidance and design consultation throughout the full-service lifecycle—from architecture and capacity planning to production readiness—while establishing and enforcing SRE standards for system architecture, observability, incident response, and reliability metrics. Partnership closely with product, infrastructure, and engineering teams to integrate reliability goals into the development process. Mentor and guide engineers across the organization on reliability principles and best practices and serve as a reliability evangelist to drive cultural and operational changes that improve engineering velocity. Leverage generative AI agents and automation tools to enhance operational efficiency, automate health checks, incident detection and resolution, and drive innovative solutions in site reliability engineering. Define, implement, and measure SLIs and SLOs to guide reliability-focused engineering decisions. Basic Qualifications

Minimum 8 years of engineering experience, including 5+ years in Site Reliability, DevOps, or Production Engineering roles. Advanced proficiency in one or more programming languages (e.g., Python, Go, Java, or C++) with the ability to write production-grade software. Strong Linux systems expertise, including scripting, performance tuning, and debugging. Hands-on experience operating large-scale distributed systems in public cloud environments, preferably GCP. Deep knowledge of Kubernetes and container orchestration patterns in production environments. Experience with GitHub Actions and modern CI/CD practices. Deep experience with SLI/SLO design, service health instrumentation, and production telemetry. Proven ability to build dashboards and alerts using Splunk and Grafana. Strong understanding of observability systems, including: Metrics pipelines, Distributed tracing, Log aggregation, Alerting strategies and incident triage Familiarity with infrastructure-as-code tools (e.g., Terraform, Ansible). Preferred Qualifications

Experience implementing chaos engineering, load testing, and resilience modeling. Google Cloud Professional Architect Certification is a plus. Understanding of OpenTelemetry (metrics, tracing, logs) and its integration into observability pipelines. Equal Opportunity Employer

UKG is an equal opportunity employer. We evaluate qualified applicants without regard to race, color, disability, religion, sex, age, national origin, veteran status, genetic information, and other legally protected categories.

#J-18808-Ljbffr