Voltage Park
Manager of Infrastructure Engineering (Observability)
Voltage Park, Seattle, Washington, us, 98127
Manager of Infrastructure Engineering (Observability)
Join to apply for the Manager of Infrastructure Engineering (Observability) role at Voltage Park.
Voltage Park is your enterprise AI factory. We offer scalable compute power, on-demand and reserved bare metal AI infrastructure using NVIDIA GPUs, with world-class service, performance, and value. Founded with the mission of making accessible AI computing for all, our flexible, affordable GPU solutions power everyone from builders to enterprises.
Voltage Park is looking for a Manager of Infrastructure Engineering for our Infrastructure Engineering team. Our team is responsible for building automation, tooling, and API-driven systems to bridge the gap between our physical infrastructure and the systems that our customers depend on for AI/ML training, inference, and HPC workloads at scale.
In this role, you’ll design and implement systems that enable humans and software to interact programmatically with thousands of bare-metal servers, storage clusters, and high-performance networks. You will work closely with teams across Voltage Park to drive new infrastructure rollouts and improve the lifecycle management of existing resources. Observability is not a nice-to-have—it is foundational to how we operate safely, efficiently, and at scale.
Qualifications
7+ years in infrastructure engineering, SRE, or platform roles
2+ years managing technical teams
Deep experience designing and operating observability systems at scale
Strong background in Linux, distributed systems, and production operations
Comfort operating in environments with hardware dependencies and tight SLAs
Strong Technical Background
Metrics systems (Prometheus, VictoriaMetrics)
Logging systems (ELK, OpenSearch)
Distributed tracing (OpenTelemetry, Jaeger, Tempo)
Kubernetes observability (nodes, clusters, workloads, control plane)
Alerting strategy, SLOs, SLIs, and error budgets
High-cardinality, high-volume telemetry tradeoffs
Nice to Have
Experience in GPU, HPC, or AI infrastructure environments
Familiarity with hardware, power, thermal, or network telemetry
Experience with bare-metal provisioning and monitoring
Multi-datacenter or edge-style infrastructure experience
Prior SRE leadership experience
What You’ll Do Technical Ownership & Strategy
Own Voltage Park’s observability strategy across infrastructure and platform layers
Define standards for metrics, logs, traces, alerts, dashboards, and SLOs
Drive architecture decisions for telemetry pipelines, storage, and retention
Balance signal quality, system performance, and cost at scale
Team Leadership
Build, manage, and mentor a team of infrastructure engineers focused on observability
Set clear technical direction, priorities, and expectations
Review designs, guide implementation, and raise the bar on operational rigor
Partner closely with SRE, platform, networking, storage, and hardware teams
Platform Engineering
Design and operate high-throughput observability pipelines (metrics, logs, traces)
Ensure observability platforms are reliable, scalable, and resilient
Improve alert quality and reduce noise across production systems
Enable self-service observability for internal engineering teams
Reliability & Operations
Participate in and lead infrastructure incident response
Use observability data to drive root-cause analysis and systemic improvements
Build feedback loops from incidents into better tooling, alerts, and runbooks
Help establish a culture of measurement-driven reliability
Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter.
#J-18808-Ljbffr
Voltage Park is your enterprise AI factory. We offer scalable compute power, on-demand and reserved bare metal AI infrastructure using NVIDIA GPUs, with world-class service, performance, and value. Founded with the mission of making accessible AI computing for all, our flexible, affordable GPU solutions power everyone from builders to enterprises.
Voltage Park is looking for a Manager of Infrastructure Engineering for our Infrastructure Engineering team. Our team is responsible for building automation, tooling, and API-driven systems to bridge the gap between our physical infrastructure and the systems that our customers depend on for AI/ML training, inference, and HPC workloads at scale.
In this role, you’ll design and implement systems that enable humans and software to interact programmatically with thousands of bare-metal servers, storage clusters, and high-performance networks. You will work closely with teams across Voltage Park to drive new infrastructure rollouts and improve the lifecycle management of existing resources. Observability is not a nice-to-have—it is foundational to how we operate safely, efficiently, and at scale.
Qualifications
7+ years in infrastructure engineering, SRE, or platform roles
2+ years managing technical teams
Deep experience designing and operating observability systems at scale
Strong background in Linux, distributed systems, and production operations
Comfort operating in environments with hardware dependencies and tight SLAs
Strong Technical Background
Metrics systems (Prometheus, VictoriaMetrics)
Logging systems (ELK, OpenSearch)
Distributed tracing (OpenTelemetry, Jaeger, Tempo)
Kubernetes observability (nodes, clusters, workloads, control plane)
Alerting strategy, SLOs, SLIs, and error budgets
High-cardinality, high-volume telemetry tradeoffs
Nice to Have
Experience in GPU, HPC, or AI infrastructure environments
Familiarity with hardware, power, thermal, or network telemetry
Experience with bare-metal provisioning and monitoring
Multi-datacenter or edge-style infrastructure experience
Prior SRE leadership experience
What You’ll Do Technical Ownership & Strategy
Own Voltage Park’s observability strategy across infrastructure and platform layers
Define standards for metrics, logs, traces, alerts, dashboards, and SLOs
Drive architecture decisions for telemetry pipelines, storage, and retention
Balance signal quality, system performance, and cost at scale
Team Leadership
Build, manage, and mentor a team of infrastructure engineers focused on observability
Set clear technical direction, priorities, and expectations
Review designs, guide implementation, and raise the bar on operational rigor
Partner closely with SRE, platform, networking, storage, and hardware teams
Platform Engineering
Design and operate high-throughput observability pipelines (metrics, logs, traces)
Ensure observability platforms are reliable, scalable, and resilient
Improve alert quality and reduce noise across production systems
Enable self-service observability for internal engineering teams
Reliability & Operations
Participate in and lead infrastructure incident response
Use observability data to drive root-cause analysis and systemic improvements
Build feedback loops from incidents into better tooling, alerts, and runbooks
Help establish a culture of measurement-driven reliability
Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter.
#J-18808-Ljbffr