TekLeaders, Inc
Role Name:
Site Reliability Engineer – Lead
Cincinnati, OH – Hybrid only on w2
Role Description:
As a Site Reliability Engineer – Lead, you will drive the reliability, scalability, and performance of mission-critical systems and services while leading a team of SREs. This role combines deep technical expertise with leadership, mentoring, and strategic planning. You will set standards for operational excellence, guide incident response, and foster a culture of automation and continuous improvement. Collaboration with engineering, operations, and product teams is essential to align reliability initiatives with business objectives and ensure seamless service delivery.
REQUIRED SKILL:
Proven experience in site reliability, DevOps, or systems engineering, with prior leadership or team lead responsibilities
Strong programming/scripting skills (e.g., Python, Go, Bash, or similar)
Deep expertise in Linux/Unix system administration and networking
Experience architecting and operating cloud platforms (AWS, Azure, Google Cloud Platform)
Proficiency with infrastructure-as-code and automation tools (e.g., Terraform, Ansible, CloudFormation)
Advanced knowledge of monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK, Datadog)
Demonstrated incident management and root cause analysis skills
Experience designing and implementing CI/CD pipelines
Strong understanding of containerization and orchestration (Docker, Kubernetes)
Ability to define and enforce reliability, scalability, and security best practices
Excellent communication, stakeholder management, and collaboration skills
Experience mentoring, coaching, and developing SRE or engineering teams
Strong hands-on knowledge to define business process dashboards in APM tools like dynatrace with SLA, ALO and SLI definition, design and implementation as part of observability.
Experience with devices like Scanner, POS Devices, Peripheral devices (includes On device memory based devices)
Experience with Hardcoded protocols and software for devices and should be able to decode and run them and help integrate with other modules.
Experience in Edge computing, Google Distributed Cloud and Hybrid cloud environments.
Experience leading SRE teams in high-growth or regulated environments
Advanced database administration and optimization skills(both SQL e.g. MYSQL and No SQL e.g. Mongo DB databases)
Key Responsibilities:
Team Leadership & Development:
Technical expertise, hands on experience with ability to lead the development team.
Should be able to mentor team members and guide on the right approach for SRE related work.
Foster a culture of operational excellence, automation, and continuous learning
Conduct regular team meetings, 1:1s, and performance reviews
Reliability Strategy & Architecture:
Define and implement reliability, scalability, and performance strategies for critical systems
Set standards for monitoring, alerting, and incident response
Guide architectural decisions to ensure robust, resilient infrastructure
Incident & Problem Management:
Oversee incident response, root cause analysis, and post-mortem processes
Coordinate with cross-functional teams to resolve complex issues and prevent recurrence
Drive improvements based on incident learnings
Process Improvement & Automation:
Identify and eliminate manual operational tasks through automation
Optimize CI/CD pipelines and deployment processes
Continuously enhance system reliability and efficiency
Stakeholder Collaboration:
Partner with engineering, operations, and product teams to align reliability goals with business objectives
Communicate reliability metrics, risks, and progress to leadership and stakeholders
Security & Compliance:
Ensure infrastructure and processes adhere to security best practices and compliance requirements
Experience in handling chaos and resilience
#J-18808-Ljbffr
Site Reliability Engineer – Lead
Cincinnati, OH – Hybrid only on w2
Role Description:
As a Site Reliability Engineer – Lead, you will drive the reliability, scalability, and performance of mission-critical systems and services while leading a team of SREs. This role combines deep technical expertise with leadership, mentoring, and strategic planning. You will set standards for operational excellence, guide incident response, and foster a culture of automation and continuous improvement. Collaboration with engineering, operations, and product teams is essential to align reliability initiatives with business objectives and ensure seamless service delivery.
REQUIRED SKILL:
Proven experience in site reliability, DevOps, or systems engineering, with prior leadership or team lead responsibilities
Strong programming/scripting skills (e.g., Python, Go, Bash, or similar)
Deep expertise in Linux/Unix system administration and networking
Experience architecting and operating cloud platforms (AWS, Azure, Google Cloud Platform)
Proficiency with infrastructure-as-code and automation tools (e.g., Terraform, Ansible, CloudFormation)
Advanced knowledge of monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK, Datadog)
Demonstrated incident management and root cause analysis skills
Experience designing and implementing CI/CD pipelines
Strong understanding of containerization and orchestration (Docker, Kubernetes)
Ability to define and enforce reliability, scalability, and security best practices
Excellent communication, stakeholder management, and collaboration skills
Experience mentoring, coaching, and developing SRE or engineering teams
Strong hands-on knowledge to define business process dashboards in APM tools like dynatrace with SLA, ALO and SLI definition, design and implementation as part of observability.
Experience with devices like Scanner, POS Devices, Peripheral devices (includes On device memory based devices)
Experience with Hardcoded protocols and software for devices and should be able to decode and run them and help integrate with other modules.
Experience in Edge computing, Google Distributed Cloud and Hybrid cloud environments.
Experience leading SRE teams in high-growth or regulated environments
Advanced database administration and optimization skills(both SQL e.g. MYSQL and No SQL e.g. Mongo DB databases)
Key Responsibilities:
Team Leadership & Development:
Technical expertise, hands on experience with ability to lead the development team.
Should be able to mentor team members and guide on the right approach for SRE related work.
Foster a culture of operational excellence, automation, and continuous learning
Conduct regular team meetings, 1:1s, and performance reviews
Reliability Strategy & Architecture:
Define and implement reliability, scalability, and performance strategies for critical systems
Set standards for monitoring, alerting, and incident response
Guide architectural decisions to ensure robust, resilient infrastructure
Incident & Problem Management:
Oversee incident response, root cause analysis, and post-mortem processes
Coordinate with cross-functional teams to resolve complex issues and prevent recurrence
Drive improvements based on incident learnings
Process Improvement & Automation:
Identify and eliminate manual operational tasks through automation
Optimize CI/CD pipelines and deployment processes
Continuously enhance system reliability and efficiency
Stakeholder Collaboration:
Partner with engineering, operations, and product teams to align reliability goals with business objectives
Communicate reliability metrics, risks, and progress to leadership and stakeholders
Security & Compliance:
Ensure infrastructure and processes adhere to security best practices and compliance requirements
Experience in handling chaos and resilience
#J-18808-Ljbffr