Empyrean

IT Monitoring & Observability Engineer/Architect

Empyrean, Minneapolis, Minnesota, United States, 55400

IT Monitoring & Observability Engineer/Architect Join to apply for the IT Monitoring & Observability Engineer/Architect role at Empyrean

The Staff Observability Engineer serves as the senior technical authority responsible for designing, implementing, and owning Empyrean’s observability strategy across all product teams, with direct accountability for the Environmental Health platform.

This strategic individual contributor role integrates architecture, platform engineering, and organizational enablement to ensure every Empyrean system is measurable, reliable, and actionable. The Staff Observability Engineer will architect scalable telemetry systems, guide engineering teams, and foster a culture of metric ownership—empowering developers and service owners to proactively monitor, detect, and resolve issues before they impact end users.

This role directly impacts Empyrean’s ability to deliver reliable outcomes in high-stakes environments such as public health and environmental monitoring, ensuring trust, transparency, and resilience in every system.

Essential Duties And Responsibilities

Architectural & Platform Ownership

Own the observability platform end-to-end—architecture, design, implementation, and continuous improvement.

Define unified telemetry pipelines for metrics, logs, and traces across distributed cloud and edge systems.

Build and operate platforms using Grafana, Prometheus, Loki, Tempo, and OpenTelemetry.

Standardize data collection and retention across applications, infrastructure, and sensor ecosystems.

Collaborate with enterprise architecture, data, and security teams to ensure observability aligns with governance and compliance standards.

Detect symptoms before outages by continuously refining monitoring and alerting systems.

Roadmap & Strategic Leadership

Set and maintain the observability roadmap across all Empyrean product teams in partnership with engineering and product leadership.

Partner with executives to align observability metrics with business reliability goals and environmental outcomes.

Establish organizational success metrics for observability adoption and maturity.

Evaluate and integrate observability tools, balancing open-source flexibility with managed solutions.

Enablement & Process Design

Lead strategic participation in incident response and postmortems—focusing on enablement through process design, not tactical firefighting.

Mentor and coach teams to create actionable SLIs/SLOs, define trace schemas, and embed observability into development lifecycles.

Facilitate adoption of observability-as-code using Terraform, Helm, and CI/CD pipelines.

Lead blameless RCAs and drive long-term corrective actions; ensure outcomes are documented and measured.

Reliability & Continuous Improvement

Improve alert quality through signal-to-noise reduction and contextual alert design.

Partner with development and support teams to enhance MTTR, MTTD, and reliability posture.

Translate observability goals into actionable tasks to maintain transparency and accountability.

Monitor and evaluate emerging trends in AIOps, anomaly detection, and telemetry analytics.

Documentation & Knowledge Sharing

Develop and maintain high-quality documentation, playbooks, and runbooks for observability standards and best practices.

Promote an internal culture of continuous learning by leading workshops and publishing internal guides and RCAs.

Influence & Mentorship

Act as a technical mentor to senior engineers and peers, guiding design decisions and fostering a learning environment.

Champion cross-functional collaboration, ensuring observability informs decisions across development, product, and infrastructure.

Demonstrate emotional intelligence, self-awareness, and the ability to de-escalate conflict while aligning stakeholders around shared goals.

Required Skills And Abilities

Deep expertise in distributed systems monitoring and observability frameworks.

Proficiency with Infrastructure as Code tools (Terraform, Ansible, Helm).

Strong written and verbal communication; effective at influencing cross-functional teams.

Demonstrated leadership in designing scalable observability platforms and organizational frameworks.

Balance of strategic vision and hands-on execution; can move from design to delivery confidently.

Deep understanding of incident management, root cause analysis, and performance optimization.

Knowledge, Experience, and/or Education Requirements

10+ years of experience in systems engineering, SRE, or DevOps roles.

3+ years in an observability-focused leadership or platform ownership role.

Proven experience building and scaling observability systems in AWS, Azure, or GCP.

Hands-on expertise with Prometheus, Grafana, OpenTelemetry, Loki, Tempo, or equivalent stacks.

Experience implementing SLIs/SLOs and leveraging telemetry to improve reliability and decision-making.

Demonstrated ability to guide post-incident learning and institutionalize observability practices.

Preferred

Background in Environmental Health, IoT, or sensor-based systems monitoring.

Experience applying AIOps or ML to telemetry analytics and proactive detection.

Familiarity with HIPAA, SOC2, ISO 27001, or similar compliance frameworks.

Proven ability to influence observability strategy across multiple engineering organizations.

Advanced knowledge of Kubernetes operations and telemetry instrumentation.

Compensation Range: $134,400-$168,000-$201,600

Bonus: 15%, eligibility after a year of employment

Disclaimer: This job description is not intended to be an exhaustive list of all duties, responsibilities, or qualifications associated with the job. Management reserves the right to modify or reassign job duties as business needs evolve.

Empyrean is an Equal Opportunity Employer: including disability and veterans

#J-18808-Ljbffr