Inspire Brands, Inc.
Senior Manager -Reliability Engineer and Observability Platforms
Inspire Brands, Inc., Atlanta, Georgia, United States, 30383
Overview
Senior Manager, Reliability Engineering & Observability Platforms
to lead observability initiatives and reliability engineering efforts. This role designs and manages platforms that ensure visibility, uptime, performance, and seamless operation of critical systems and services. The position is in-office, based in Atlanta (80% onsite).
The ideal candidate will bring deep technical expertise in observability tools and reliability engineering practices, with proven leadership experience. They will lead a team responsible for enabling high availability, incident response, performance monitoring, and operational resilience through automation, process improvement, and cross-functional collaboration.
RESPONSIBILITIES
Own and evolve observability platforms (monitoring, logging, tracing) to meet organizational needs for performance and availability.
Improve observability maturity by driving adoption of best practices and platform-wide instrumentation.
Lead the reliability engineering function focused on ensuring system uptime, operability, and resilience.
Define and track SLOs/SLIs/SLAs, partnering with product and infrastructure teams to uphold service quality standards.
Drive adoption of reliability best practices into application design, deployments, and operations.
Develop and mature incident management processes including alerting, triage, resolution, and post-incident reviews.
Oversee and continuously improve on-call strategies, ensuring the team is prepared for high-impact production events.
Champion automation of monitoring, diagnostics, deployment validation, and platform operations to reduce manual effort.
Integrate observability and reliability engineering practices into CI/CD pipelines and deployment workflows.
Mentor and lead a team of engineers with a focus on operational excellence, continuous learning, and accountability.
Build a high-performing team culture aligned to business outcomes and platform stability.
Collaborate with cross-functional teams including application developers, DevOps, cloud infrastructure, and security to ensure reliable and observable service delivery.
Partner with architecture and engineering teams to ensure new systems are designed with reliability in mind.
Provide regular reporting and insights on system health, incidents, and reliability trends to leadership.
Use telemetry data to identify system bottlenecks, recurring issues, and areas for proactive improvement.
Manage observability and reliability tool vendors, including evaluation, contracts, renewals, and integrations.
EDUCATION AND EXPERIENCE QUALIFICATIONS
Bachelor’s degree in computer science, Information Technology, Engineering, or a related field.
Master’s degree is a plus.
5–7 years of experience managing and leading engineering teams.
7+ years in IT operations, DevOps, or software/platform engineering.
3+ years in a leadership role focused on observability or reliability engineering.
KNOWLEDGE, SKILLS, AND ABILITIES
Technical Expertise
— Hands-on experience with observability platforms (e.g., Splunk, Prometheus, Grafana, ELK, New Relic, Dynatrace).
Deep understanding of logging, monitoring, alerting, and tracing technologies.
Strong knowledge of public cloud (Azure, AWS, or GCP) and container platforms (Kubernetes).
Familiarity with infrastructure as code (Terraform, Ansible).
Reliability Engineering Competency
— Experience implementing and supporting highly available, scalable systems.
Understanding of SLOs, SLIs, incident lifecycle, and post-incident analysis.
Ability to embed reliability practices into SDLC and CI/CD workflows.
Leadership & Communication
— Demonstrated ability to build, grow, and lead high-performing teams; strong analytical, communication, and cross-functional collaboration skills.
#J-18808-Ljbffr
to lead observability initiatives and reliability engineering efforts. This role designs and manages platforms that ensure visibility, uptime, performance, and seamless operation of critical systems and services. The position is in-office, based in Atlanta (80% onsite).
The ideal candidate will bring deep technical expertise in observability tools and reliability engineering practices, with proven leadership experience. They will lead a team responsible for enabling high availability, incident response, performance monitoring, and operational resilience through automation, process improvement, and cross-functional collaboration.
RESPONSIBILITIES
Own and evolve observability platforms (monitoring, logging, tracing) to meet organizational needs for performance and availability.
Improve observability maturity by driving adoption of best practices and platform-wide instrumentation.
Lead the reliability engineering function focused on ensuring system uptime, operability, and resilience.
Define and track SLOs/SLIs/SLAs, partnering with product and infrastructure teams to uphold service quality standards.
Drive adoption of reliability best practices into application design, deployments, and operations.
Develop and mature incident management processes including alerting, triage, resolution, and post-incident reviews.
Oversee and continuously improve on-call strategies, ensuring the team is prepared for high-impact production events.
Champion automation of monitoring, diagnostics, deployment validation, and platform operations to reduce manual effort.
Integrate observability and reliability engineering practices into CI/CD pipelines and deployment workflows.
Mentor and lead a team of engineers with a focus on operational excellence, continuous learning, and accountability.
Build a high-performing team culture aligned to business outcomes and platform stability.
Collaborate with cross-functional teams including application developers, DevOps, cloud infrastructure, and security to ensure reliable and observable service delivery.
Partner with architecture and engineering teams to ensure new systems are designed with reliability in mind.
Provide regular reporting and insights on system health, incidents, and reliability trends to leadership.
Use telemetry data to identify system bottlenecks, recurring issues, and areas for proactive improvement.
Manage observability and reliability tool vendors, including evaluation, contracts, renewals, and integrations.
EDUCATION AND EXPERIENCE QUALIFICATIONS
Bachelor’s degree in computer science, Information Technology, Engineering, or a related field.
Master’s degree is a plus.
5–7 years of experience managing and leading engineering teams.
7+ years in IT operations, DevOps, or software/platform engineering.
3+ years in a leadership role focused on observability or reliability engineering.
KNOWLEDGE, SKILLS, AND ABILITIES
Technical Expertise
— Hands-on experience with observability platforms (e.g., Splunk, Prometheus, Grafana, ELK, New Relic, Dynatrace).
Deep understanding of logging, monitoring, alerting, and tracing technologies.
Strong knowledge of public cloud (Azure, AWS, or GCP) and container platforms (Kubernetes).
Familiarity with infrastructure as code (Terraform, Ansible).
Reliability Engineering Competency
— Experience implementing and supporting highly available, scalable systems.
Understanding of SLOs, SLIs, incident lifecycle, and post-incident analysis.
Ability to embed reliability practices into SDLC and CI/CD workflows.
Leadership & Communication
— Demonstrated ability to build, grow, and lead high-performing teams; strong analytical, communication, and cross-functional collaboration skills.
#J-18808-Ljbffr