Voltage Park

Manager of Infrastructure Engineering (Observability)

Voltage Park, Seattle, Washington, us, 98127

Manager of Infrastructure Engineering (Observability) Join to apply for the Manager of Infrastructure Engineering (Observability) role at Voltage Park.

Voltage Park is your enterprise AI factory. We offer scalable compute power, on-demand and reserved bare metal AI infrastructure using NVIDIA GPUs, with world-class service, performance, and value. Founded with the mission of making accessible AI computing for all, our flexible, affordable GPU solutions power everyone from builders to enterprises.

Voltage Park is looking for a Manager of Infrastructure Engineering for our Infrastructure Engineering team. Our team is responsible for building automation, tooling, and API-driven systems to bridge the gap between our physical infrastructure and the systems that our customers depend on for AI/ML training, inference, and HPC workloads at scale.

In this role, you’ll design and implement systems that enable humans and software to interact programmatically with thousands of bare-metal servers, storage clusters, and high-performance networks. You will work closely with teams across Voltage Park to drive new infrastructure rollouts and improve the lifecycle management of existing resources. Observability is not a nice-to-have—it is foundational to how we operate safely, efficiently, and at scale.

Qualifications

7+ years in infrastructure engineering, SRE, or platform roles

2+ years managing technical teams

Deep experience designing and operating observability systems at scale

Strong background in Linux, distributed systems, and production operations

Comfort operating in environments with hardware dependencies and tight SLAs

Strong Technical Background

Metrics systems (Prometheus, VictoriaMetrics)

Logging systems (ELK, OpenSearch)

Distributed tracing (OpenTelemetry, Jaeger, Tempo)

Kubernetes observability (nodes, clusters, workloads, control plane)

Alerting strategy, SLOs, SLIs, and error budgets

High-cardinality, high-volume telemetry tradeoffs

Nice to Have

Experience in GPU, HPC, or AI infrastructure environments

Familiarity with hardware, power, thermal, or network telemetry

Experience with bare-metal provisioning and monitoring

Multi-datacenter or edge-style infrastructure experience

Prior SRE leadership experience

What You’ll Do Technical Ownership & Strategy

Own Voltage Park’s observability strategy across infrastructure and platform layers

Define standards for metrics, logs, traces, alerts, dashboards, and SLOs

Drive architecture decisions for telemetry pipelines, storage, and retention

Balance signal quality, system performance, and cost at scale

Team Leadership

Build, manage, and mentor a team of infrastructure engineers focused on observability

Set clear technical direction, priorities, and expectations

Review designs, guide implementation, and raise the bar on operational rigor

Partner closely with SRE, platform, networking, storage, and hardware teams

Platform Engineering

Design and operate high-throughput observability pipelines (metrics, logs, traces)

Ensure observability platforms are reliable, scalable, and resilient

Improve alert quality and reduce noise across production systems

Enable self-service observability for internal engineering teams

Reliability & Operations

Participate in and lead infrastructure incident response

Use observability data to drive root-cause analysis and systemic improvements

Build feedback loops from incidents into better tooling, alerts, and runbooks

Help establish a culture of measurement-driven reliability

Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter.

#J-18808-Ljbffr