Logo
Inspire Brands

Senior Manager -Reliability Engineer and Observability Platforms

Inspire Brands, Cartersville, Georgia, United States, 30120

Save Job

Overview

We are seeking an experienced and dynamic

Senior Manager, Reliability Engineering & Observability Platforms

to lead our observability initiatives and reliability engineering efforts. This role is accountable for designing and managing platforms that ensure visibility, uptime, performance, and seamless operation of critical systems and services. The ideal candidate will bring deep technical expertise in observability tools and reliability engineering practices, along with proven leadership experience. They will lead a team responsible for enabling high availability, incident response, performance monitoring, and operational resilience through automation, process improvement, and cross-functional collaboration. This leader will partner closely with IT, DevOps, Infrastructure, and Business Units to deliver scalable and reliable services with a focus on proactive issue detection and resolution. This is an in-office position based in Atlanta (80% onsite). RESPONSIBILITIES

Own and evolve observability platforms

(monitoring, logging, tracing) to meet organizational needs for performance and availability.

Improve observability maturity by driving adoption of best practices and platform-wide instrumentation.

Lead the reliability engineering function

focused on ensuring system uptime, operability, and resilience.

Define and track SLOs/SLIs/SLAs, partnering with product and infrastructure teams to uphold service quality standards.

Drive adoption of reliability best practices into application design, deployments, and operations.

Develop and mature incident management processes

including alerting, triage, resolution, and post-incident reviews.

Oversee and continuously improve on-call strategies, ensuring the team is prepared for high-impact production events.

Champion automation

of monitoring, diagnostics, deployment validation, and platform operations to reduce manual effort.

Integrate observability and reliability engineering practices into CI/CD pipelines and deployment workflows.

Mentor and lead a team of engineers

with a focus on operational excellence, continuous learning, and accountability.

Build a high-performing team culture aligned to business outcomes and platform stability.

Collaborate with cross-functional teams

including application developers, DevOps, cloud infrastructure, and security to ensure reliable and observable service delivery.

Partner with architecture and engineering teams to ensure new systems are designed with reliability in mind.

Provide regular reporting and insights

on system health, incidents, and reliability trends to leadership.

Use telemetry data to identify system bottlenecks, recurring issues, and areas for proactive improvement.

Manage observability and reliability tool vendors , including evaluation, contracts, renewals, and integrations.

EDUCATION AND EXPERIENCE QUALIFICATIONS

Bachelor’s degree in computer science, Information Technology, Engineering, or a related field.

Master’s degree is a plus

5 – 7 years of experience managing and leading engineering teams.

7+ years in IT operations, DevOps, or software/platform engineering.

3+ years in a leadership role focused on observability or reliability engineering

5 – 7 years of experience managing and leading engineering teams.

7+ years in IT operations, DevOps, or software/platform engineering.

3+ years in a leadership role focused on observability or reliability engineering.

KNOWLEDGE, SKILLS, AND ABILITIES

Technical Expertise

Hands-on experience with observability platforms (e.g., Splunk, Prometheus, Grafana, ELK, New Relic, Dynatrace).

Deep understanding of logging, monitoring, alerting, and tracing technologies.

Strong knowledge of public cloud (Azure, AWS, or GCP) and container platforms (e.g., Kubernetes).

Familiarity with infrastructure as code (e.g., Terraform, Ansible).

Reliability Engineering Competency

Experience implementing and supporting highly available, scalable systems.

Understanding of SLOs, SLIs, incident lifecycle, and post-incident analysis.

Ability to embed reliability practices into SDLC and CI/CD workflows.

Leadership & Communication

Demonstrated ability to build, grow, and lead high-performing teams.

Strong analytical, communication, and cross-functional collaboration skills.

#J-18808-Ljbffr