Inspire Brands
Senior Manager -Reliability Engineer and Observability Platforms
Inspire Brands, Cartersville, Georgia, United States, 30120
Overview
We are seeking an experienced and dynamic
Senior Manager, Reliability Engineering & Observability Platforms
to lead our observability initiatives and reliability engineering efforts. This role is accountable for designing and managing platforms that ensure visibility, uptime, performance, and seamless operation of critical systems and services. The ideal candidate will bring deep technical expertise in observability tools and reliability engineering practices, along with proven leadership experience. They will lead a team responsible for enabling high availability, incident response, performance monitoring, and operational resilience through automation, process improvement, and cross-functional collaboration. This leader will partner closely with IT, DevOps, Infrastructure, and Business Units to deliver scalable and reliable services with a focus on proactive issue detection and resolution. This is an in-office position based in Atlanta (80% onsite). RESPONSIBILITIES
Own and evolve observability platforms
(monitoring, logging, tracing) to meet organizational needs for performance and availability.
Improve observability maturity by driving adoption of best practices and platform-wide instrumentation.
Lead the reliability engineering function
focused on ensuring system uptime, operability, and resilience.
Define and track SLOs/SLIs/SLAs, partnering with product and infrastructure teams to uphold service quality standards.
Drive adoption of reliability best practices into application design, deployments, and operations.
Develop and mature incident management processes
including alerting, triage, resolution, and post-incident reviews.
Oversee and continuously improve on-call strategies, ensuring the team is prepared for high-impact production events.
Champion automation
of monitoring, diagnostics, deployment validation, and platform operations to reduce manual effort.
Integrate observability and reliability engineering practices into CI/CD pipelines and deployment workflows.
Mentor and lead a team of engineers
with a focus on operational excellence, continuous learning, and accountability.
Build a high-performing team culture aligned to business outcomes and platform stability.
Collaborate with cross-functional teams
including application developers, DevOps, cloud infrastructure, and security to ensure reliable and observable service delivery.
Partner with architecture and engineering teams to ensure new systems are designed with reliability in mind.
Provide regular reporting and insights
on system health, incidents, and reliability trends to leadership.
Use telemetry data to identify system bottlenecks, recurring issues, and areas for proactive improvement.
Manage observability and reliability tool vendors , including evaluation, contracts, renewals, and integrations.
EDUCATION AND EXPERIENCE QUALIFICATIONS
Bachelor’s degree in computer science, Information Technology, Engineering, or a related field.
Master’s degree is a plus
5 – 7 years of experience managing and leading engineering teams.
7+ years in IT operations, DevOps, or software/platform engineering.
3+ years in a leadership role focused on observability or reliability engineering
5 – 7 years of experience managing and leading engineering teams.
7+ years in IT operations, DevOps, or software/platform engineering.
3+ years in a leadership role focused on observability or reliability engineering.
KNOWLEDGE, SKILLS, AND ABILITIES
Technical Expertise
Hands-on experience with observability platforms (e.g., Splunk, Prometheus, Grafana, ELK, New Relic, Dynatrace).
Deep understanding of logging, monitoring, alerting, and tracing technologies.
Strong knowledge of public cloud (Azure, AWS, or GCP) and container platforms (e.g., Kubernetes).
Familiarity with infrastructure as code (e.g., Terraform, Ansible).
Reliability Engineering Competency
Experience implementing and supporting highly available, scalable systems.
Understanding of SLOs, SLIs, incident lifecycle, and post-incident analysis.
Ability to embed reliability practices into SDLC and CI/CD workflows.
Leadership & Communication
Demonstrated ability to build, grow, and lead high-performing teams.
Strong analytical, communication, and cross-functional collaboration skills.
#J-18808-Ljbffr
We are seeking an experienced and dynamic
Senior Manager, Reliability Engineering & Observability Platforms
to lead our observability initiatives and reliability engineering efforts. This role is accountable for designing and managing platforms that ensure visibility, uptime, performance, and seamless operation of critical systems and services. The ideal candidate will bring deep technical expertise in observability tools and reliability engineering practices, along with proven leadership experience. They will lead a team responsible for enabling high availability, incident response, performance monitoring, and operational resilience through automation, process improvement, and cross-functional collaboration. This leader will partner closely with IT, DevOps, Infrastructure, and Business Units to deliver scalable and reliable services with a focus on proactive issue detection and resolution. This is an in-office position based in Atlanta (80% onsite). RESPONSIBILITIES
Own and evolve observability platforms
(monitoring, logging, tracing) to meet organizational needs for performance and availability.
Improve observability maturity by driving adoption of best practices and platform-wide instrumentation.
Lead the reliability engineering function
focused on ensuring system uptime, operability, and resilience.
Define and track SLOs/SLIs/SLAs, partnering with product and infrastructure teams to uphold service quality standards.
Drive adoption of reliability best practices into application design, deployments, and operations.
Develop and mature incident management processes
including alerting, triage, resolution, and post-incident reviews.
Oversee and continuously improve on-call strategies, ensuring the team is prepared for high-impact production events.
Champion automation
of monitoring, diagnostics, deployment validation, and platform operations to reduce manual effort.
Integrate observability and reliability engineering practices into CI/CD pipelines and deployment workflows.
Mentor and lead a team of engineers
with a focus on operational excellence, continuous learning, and accountability.
Build a high-performing team culture aligned to business outcomes and platform stability.
Collaborate with cross-functional teams
including application developers, DevOps, cloud infrastructure, and security to ensure reliable and observable service delivery.
Partner with architecture and engineering teams to ensure new systems are designed with reliability in mind.
Provide regular reporting and insights
on system health, incidents, and reliability trends to leadership.
Use telemetry data to identify system bottlenecks, recurring issues, and areas for proactive improvement.
Manage observability and reliability tool vendors , including evaluation, contracts, renewals, and integrations.
EDUCATION AND EXPERIENCE QUALIFICATIONS
Bachelor’s degree in computer science, Information Technology, Engineering, or a related field.
Master’s degree is a plus
5 – 7 years of experience managing and leading engineering teams.
7+ years in IT operations, DevOps, or software/platform engineering.
3+ years in a leadership role focused on observability or reliability engineering
5 – 7 years of experience managing and leading engineering teams.
7+ years in IT operations, DevOps, or software/platform engineering.
3+ years in a leadership role focused on observability or reliability engineering.
KNOWLEDGE, SKILLS, AND ABILITIES
Technical Expertise
Hands-on experience with observability platforms (e.g., Splunk, Prometheus, Grafana, ELK, New Relic, Dynatrace).
Deep understanding of logging, monitoring, alerting, and tracing technologies.
Strong knowledge of public cloud (Azure, AWS, or GCP) and container platforms (e.g., Kubernetes).
Familiarity with infrastructure as code (e.g., Terraform, Ansible).
Reliability Engineering Competency
Experience implementing and supporting highly available, scalable systems.
Understanding of SLOs, SLIs, incident lifecycle, and post-incident analysis.
Ability to embed reliability practices into SDLC and CI/CD workflows.
Leadership & Communication
Demonstrated ability to build, grow, and lead high-performing teams.
Strong analytical, communication, and cross-functional collaboration skills.
#J-18808-Ljbffr