Capgemini
Metrics Platform Site Reliability Engineer
Join to apply for the
Metrics Platform Site Reliability Engineer
role at
Capgemini . Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d like, where you’ll be supported and inspired by a collaborative community of colleagues around the world, and where you’ll be able to reimagine what’s possible. Join us and help the world’s leading organizations unlock the value of technology and build a more sustainable, more inclusive world. Job Location – Atlanta, GA Job Description
We are looking for a Metrics Platform Site Reliability Engineer to manage and mentor a team of Site Reliability Engineers, define and implement SRE strategies and best practices in alignment with organizational objectives, monitor clients’ service level agreements (SLAs), service level objectives (SLOs) and service level indicators (SLIs), lead initiatives to improve system reliability, availability, scalability and performance, collaborate with development and operations teams to ensure reliability and resiliency goals are met, implement and improve incident management processes to minimize downtime and ensure timely resolutions, review and contribute to the architecture of critical systems ensuring they meet reliability and performance goals, and drive observability practices by implementing robust monitoring, logging and alerting systems. Key Responsibilities
Manage and mentor a team of Site Reliability Engineers Define and implement SRE strategies and best practices in alignment with organizational objectives Monitor clients’ service level agreements (SLAs), service level objectives (SLOs) and service level indicators (SLIs) Lead initiatives to improve system reliability, availability, scalability and performance Collaborate with development and operations teams to ensure reliability and resiliency goals are met Implement and improve incident management processes to minimize downtime and ensure timely resolutions Review and contribute to the architecture of critical systems ensuring they meet reliability and performance goals Drive observability practices by implementing robust monitoring, logging and alerting systems Skills Required
Proficiency in writing Splunk Queries and Alerts Hands‑on experience with at least one APM tool (NewRelic, AppDynamics, Honeycomb, DataDog) Expertise in automation tools and scripting languages (Python or JavaScript) Proficiency in scripting languages Python or NodeJs Proficiency in any cloud platform (AWS, GCP, Azure) Strong understanding of distributed systems, microservices architecture and container orchestration tools (e.g., Kubernetes) Experience with monitoring tools like Prometheus and Grafana Additional Responsibilities
Monitoring and alerting: Implement and maintain monitoring systems to proactively identify potential issues and alert engineers to problems before they impact users. Incident response: Respond to incidents and outages, diagnose problems and implement solutions to minimize downtime and restore service. Automation: Automate repetitive tasks and processes to improve efficiency and reduce manual effort. Performance optimization: Identify and address performance bottlenecks to ensure systems run efficiently and effectively. Infrastructure management: Manage and maintain the underlying infrastructure including servers, networks and cloud resources. Capacity planning: Plan for future capacity needs to ensure systems can handle anticipated workloads. Release engineering: Develop and maintain processes for deploying software updates and releases. Collaboration: Work closely with developers, operations teams and other stakeholders to ensure system reliability and availability. Documentation: Maintain clear and concise documentation of systems, processes and procedures. Continuous improvement: Identify areas for improvement and implement changes to enhance system reliability and performance. Life at Capgemini
Flexible work Healthcare including dental, vision, mental health, and well‑being programs Financial well‑being programs such as 401(k) and Employee Share Ownership Plan Paid time off and paid holidays Paid parental leave Family building benefits like adoption assistance, surrogacy, and cryopreservation Social well‑being benefits like subsidized back‑up child/elder care and tutoring Mentoring, coaching and learning programs Employee Resource Groups Disaster Relief Equal Opportunity Statement
Capgemini is an Equal Opportunity Employer encouraging diversity in the workplace. All qualified applicants will receive consideration for employment without regard to race, national origin, gender identity/expression, age, religion, disability, sexual orientation, genetics, veteran status, marital status or any other characteristic protected by law. This role may be eligible for other compensation, including variable compensation, bonus, or commission. Full time regular employees are eligible for paid time off, medical/dental/vision insurance, 401(k), and other benefits to eligible employees. Compensation
Salary range for this role: $100,000 – $130,000 per year. Additional compensation may include bonus or commission.
#J-18808-Ljbffr
Join to apply for the
Metrics Platform Site Reliability Engineer
role at
Capgemini . Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d like, where you’ll be supported and inspired by a collaborative community of colleagues around the world, and where you’ll be able to reimagine what’s possible. Join us and help the world’s leading organizations unlock the value of technology and build a more sustainable, more inclusive world. Job Location – Atlanta, GA Job Description
We are looking for a Metrics Platform Site Reliability Engineer to manage and mentor a team of Site Reliability Engineers, define and implement SRE strategies and best practices in alignment with organizational objectives, monitor clients’ service level agreements (SLAs), service level objectives (SLOs) and service level indicators (SLIs), lead initiatives to improve system reliability, availability, scalability and performance, collaborate with development and operations teams to ensure reliability and resiliency goals are met, implement and improve incident management processes to minimize downtime and ensure timely resolutions, review and contribute to the architecture of critical systems ensuring they meet reliability and performance goals, and drive observability practices by implementing robust monitoring, logging and alerting systems. Key Responsibilities
Manage and mentor a team of Site Reliability Engineers Define and implement SRE strategies and best practices in alignment with organizational objectives Monitor clients’ service level agreements (SLAs), service level objectives (SLOs) and service level indicators (SLIs) Lead initiatives to improve system reliability, availability, scalability and performance Collaborate with development and operations teams to ensure reliability and resiliency goals are met Implement and improve incident management processes to minimize downtime and ensure timely resolutions Review and contribute to the architecture of critical systems ensuring they meet reliability and performance goals Drive observability practices by implementing robust monitoring, logging and alerting systems Skills Required
Proficiency in writing Splunk Queries and Alerts Hands‑on experience with at least one APM tool (NewRelic, AppDynamics, Honeycomb, DataDog) Expertise in automation tools and scripting languages (Python or JavaScript) Proficiency in scripting languages Python or NodeJs Proficiency in any cloud platform (AWS, GCP, Azure) Strong understanding of distributed systems, microservices architecture and container orchestration tools (e.g., Kubernetes) Experience with monitoring tools like Prometheus and Grafana Additional Responsibilities
Monitoring and alerting: Implement and maintain monitoring systems to proactively identify potential issues and alert engineers to problems before they impact users. Incident response: Respond to incidents and outages, diagnose problems and implement solutions to minimize downtime and restore service. Automation: Automate repetitive tasks and processes to improve efficiency and reduce manual effort. Performance optimization: Identify and address performance bottlenecks to ensure systems run efficiently and effectively. Infrastructure management: Manage and maintain the underlying infrastructure including servers, networks and cloud resources. Capacity planning: Plan for future capacity needs to ensure systems can handle anticipated workloads. Release engineering: Develop and maintain processes for deploying software updates and releases. Collaboration: Work closely with developers, operations teams and other stakeholders to ensure system reliability and availability. Documentation: Maintain clear and concise documentation of systems, processes and procedures. Continuous improvement: Identify areas for improvement and implement changes to enhance system reliability and performance. Life at Capgemini
Flexible work Healthcare including dental, vision, mental health, and well‑being programs Financial well‑being programs such as 401(k) and Employee Share Ownership Plan Paid time off and paid holidays Paid parental leave Family building benefits like adoption assistance, surrogacy, and cryopreservation Social well‑being benefits like subsidized back‑up child/elder care and tutoring Mentoring, coaching and learning programs Employee Resource Groups Disaster Relief Equal Opportunity Statement
Capgemini is an Equal Opportunity Employer encouraging diversity in the workplace. All qualified applicants will receive consideration for employment without regard to race, national origin, gender identity/expression, age, religion, disability, sexual orientation, genetics, veteran status, marital status or any other characteristic protected by law. This role may be eligible for other compensation, including variable compensation, bonus, or commission. Full time regular employees are eligible for paid time off, medical/dental/vision insurance, 401(k), and other benefits to eligible employees. Compensation
Salary range for this role: $100,000 – $130,000 per year. Additional compensation may include bonus or commission.
#J-18808-Ljbffr