TEKsystems c/o Allegis Group
Overview
This customer is establishing a Site Reliability Engineering (SRE) practice/platform. The goal is to build solid observability of the platform and evaluate the tool stack to implement. They have a code team in place and are looking to augment the team with a staff engineer. They need someone with industry knowledge in SRE – experience working with vendors, hands-on technical ability, technical guidance to junior team members, and leadership to migrate from one tool to another. They are currently focused on dashboarding and setting up tools rather than SE or SA roles. Key Functions/Duties of Position
Define, and track reliability and observability OKRs, including Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Implement robust monitoring and alerting systems to proactively monitor health, identify potential issues, analyze system performance, and facilitate quick incident response. Implement AIOps functionality to enable auto-response, self-healing, and anomaly trend analysis. Drive automation to remove toil, streamline processes, reduce manual interventions, and enhance efficiency of product engineering and SRE teams. Identify and address performance bottlenecks in applications and infrastructure to improve efficiency and user experience. Collaborate with incident management to quickly address and resolve outages or performance issues to minimize downtime and user impact. Work with development and operations teams to implement observability and resiliency requirements for smooth deployment and operation of software systems. Lead coordination with product, development, infrastructure, and architecture teams for capacity planning to handle current and future demand and anticipate growth. Improve reliability by identifying and addressing gaps in architecture, services, and tooling. Modernize disaster recovery programs for both on-premises and cloud-based Berkley solutions. Provide technical leadership and mentorship to other engineers, fostering a culture of learning and continuous improvement. Education Requirement
Bachelor’s degree in computer science, Information Technology, or a related field (or a combination of education and equivalent experience). Qualifications
7+ years of IT experience with infrastructure support and development. 7+ years of experience in Site Reliability Engineering and DevOps. Proficient in scripting languages such as Python, Go, Bash, and/or JavaScript with Shell Scripting experience. Strong expertise in observability, monitoring, alerting, and logging tools (Dynatrace, Datadog, ELK Stack). Practical expertise in creating and implementing logging and monitoring architectures. Experience designing and implementing on-premises, cloud, and hybrid resiliency solutions (HA, AA, AP), disaster recovery, and business continuity planning. Deep understanding of cloud computing (IaaS, PaaS, SaaS). Experience with Kubernetes and auto-scaling tools; proficiency with Helm and Prometheus for deployment and monitoring. Proficient in GitOps with containerization, CI/CD pipelines. Experience with automation/configuration management tools (GitHub Actions, Terraform, Ansible, Chef, Puppet). Strong security practices for on-premises, cloud, and hybrid environments; understanding of security frameworks relevant to Berkley environments. Ability to drive critical issues and lead system design discussions across multiple technology teams. Demonstrated leadership, mentoring, and project leadership experience. Excellent problem-solving and cross-functional communication skills. Behavioral Core Competencies
Strategic Influential Organizational Navigation Balanced Approach Commandership Skills Composure Pay and Benefits
The pay range for this position is $80.00 – $80.00/hr. Eligibility requirements apply to some benefits and may depend on job classification and length of employment. Benefits may include Medical, dental & vision; Critical Illness, Accident, and Hospital coverage; 401(k) retirement plan; Life Insurance; Short and long-term disability; Health Spending Account (HSA); Transportation benefits; Employee Assistance Program; and Time Off/Leave (PTO/Vacation/Sick Leave). Workplace Type
This is a hybrid position in Atlanta, GA. Application Deadline
This position is anticipated to close on Oct 2, 2025. About TEKsystems
TEKsystems is an equal opportunity employer and will consider all applications without regard to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity or any characteristic protected by law. The company is an Allegis Group company.
#J-18808-Ljbffr
This customer is establishing a Site Reliability Engineering (SRE) practice/platform. The goal is to build solid observability of the platform and evaluate the tool stack to implement. They have a code team in place and are looking to augment the team with a staff engineer. They need someone with industry knowledge in SRE – experience working with vendors, hands-on technical ability, technical guidance to junior team members, and leadership to migrate from one tool to another. They are currently focused on dashboarding and setting up tools rather than SE or SA roles. Key Functions/Duties of Position
Define, and track reliability and observability OKRs, including Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Implement robust monitoring and alerting systems to proactively monitor health, identify potential issues, analyze system performance, and facilitate quick incident response. Implement AIOps functionality to enable auto-response, self-healing, and anomaly trend analysis. Drive automation to remove toil, streamline processes, reduce manual interventions, and enhance efficiency of product engineering and SRE teams. Identify and address performance bottlenecks in applications and infrastructure to improve efficiency and user experience. Collaborate with incident management to quickly address and resolve outages or performance issues to minimize downtime and user impact. Work with development and operations teams to implement observability and resiliency requirements for smooth deployment and operation of software systems. Lead coordination with product, development, infrastructure, and architecture teams for capacity planning to handle current and future demand and anticipate growth. Improve reliability by identifying and addressing gaps in architecture, services, and tooling. Modernize disaster recovery programs for both on-premises and cloud-based Berkley solutions. Provide technical leadership and mentorship to other engineers, fostering a culture of learning and continuous improvement. Education Requirement
Bachelor’s degree in computer science, Information Technology, or a related field (or a combination of education and equivalent experience). Qualifications
7+ years of IT experience with infrastructure support and development. 7+ years of experience in Site Reliability Engineering and DevOps. Proficient in scripting languages such as Python, Go, Bash, and/or JavaScript with Shell Scripting experience. Strong expertise in observability, monitoring, alerting, and logging tools (Dynatrace, Datadog, ELK Stack). Practical expertise in creating and implementing logging and monitoring architectures. Experience designing and implementing on-premises, cloud, and hybrid resiliency solutions (HA, AA, AP), disaster recovery, and business continuity planning. Deep understanding of cloud computing (IaaS, PaaS, SaaS). Experience with Kubernetes and auto-scaling tools; proficiency with Helm and Prometheus for deployment and monitoring. Proficient in GitOps with containerization, CI/CD pipelines. Experience with automation/configuration management tools (GitHub Actions, Terraform, Ansible, Chef, Puppet). Strong security practices for on-premises, cloud, and hybrid environments; understanding of security frameworks relevant to Berkley environments. Ability to drive critical issues and lead system design discussions across multiple technology teams. Demonstrated leadership, mentoring, and project leadership experience. Excellent problem-solving and cross-functional communication skills. Behavioral Core Competencies
Strategic Influential Organizational Navigation Balanced Approach Commandership Skills Composure Pay and Benefits
The pay range for this position is $80.00 – $80.00/hr. Eligibility requirements apply to some benefits and may depend on job classification and length of employment. Benefits may include Medical, dental & vision; Critical Illness, Accident, and Hospital coverage; 401(k) retirement plan; Life Insurance; Short and long-term disability; Health Spending Account (HSA); Transportation benefits; Employee Assistance Program; and Time Off/Leave (PTO/Vacation/Sick Leave). Workplace Type
This is a hybrid position in Atlanta, GA. Application Deadline
This position is anticipated to close on Oct 2, 2025. About TEKsystems
TEKsystems is an equal opportunity employer and will consider all applications without regard to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity or any characteristic protected by law. The company is an Allegis Group company.
#J-18808-Ljbffr