ZipRecruiter
Job DescriptionJob Description
Infrastructure Engineer
Division:
DATUM, Impac Exploration Services Location:
Oklahoma City (OK), Houston (TX) Type:
Full-Time
Build Infrastructure for the Next of Industrial AI
We're looking for an infrastructure engineer who gets excited about making AI work in the real world—not just in pristine data centers.
You'll architect and build infrastructure that bridges the gap between cutting-edge ML research and production deployments. This isn't your typical DevOps role—you'll be creating novel architectures and solving challenges that sit at the intersection of high-performance computing, distributed systems, and industrial operations.
The Real Environment
You'll be designing and building from first principles, iterating rapidly based on what our researchers need and what reality demands. If you thrive when given a complex problem and the freedom to solve it your way, you'll love this.
We move fast. Ship fast. Learn fast. Your architecture sketch from Monday might be in production by Friday.
What You'll Own
Novel infrastructure architectures that don't exist elsewhere
Systems design from whiteboard to production deployment
Platform decisions that shape how we scale
Infrastructure that makes our data scientists dangerously productive
The technical foundation for AI that works where others can't
Building the playbook others will eventually copy
Technical Stack & Expertise
Hardware/Compute:
NVIDIA GPUs (A100, H100, A6000) and their quirks
GPU interconnects (NVLink, InfiniBand)
Server platforms (Dell PowerEdge, HPE Apollo, Supermicro)
Understanding of CUDA, memory hierarchies, and GPU optimization
Orchestration & Containers:
Kubernetes in anger (not just tutorials)
Container runtimes (Docker, containerd, CRI-O)
Service mesh (Istio, Linkerd)
Helm, Kustomize, or similar for deployment management
Infrastructure & Networking:
Terraform, Ansible, or Pulumi for IaC
BGP, VXLAN, and software-defined networking
Load balancing at layer 4 and 7
Storage solutions (Ceph, MinIO, NetApp)
ML Infrastructure:
Kubeflow, MLflow, or similar ML platforms
GPU scheduling (NVIDIA GPU Operator, MIG)
Distributed training frameworks
Model serving infrastructure (Triton, TorchServe)
You're Our Person If
You see undefined requirements as creative freedom
You've built infrastructure without Stack Overflow because no one's solved it yet
"It's never been done" sounds like a challenge, not a warning
You can move from architecture diagrams to kubectl commands
Complex distributed systems are your canvas
You can explain your choices without defaulting to "best practices"
Especially If
You've built GPU clusters that actually stayed up
You've created systems that surprised even you with what they could do
You understand when to build vs. buy vs. fork
You've made infrastructure decisions with incomplete information—and been right
You can prototype in the morning and production-harden in the afternoon
You've worked where "good enough" isn't
The Opportunity
This is a chance to build without bureaucracy. You'll:
Define architectures that become the standard for industrial AI
Work directly with ML researchers who push your systems to their limits
Make decisions that would take months of committees elsewhere
Build infrastructure that enables entirely new capabilities
Create systems that work where cloud providers fear to tread
Why This Hits Different
No legacy systems to maintain or migrate
Budget to build right, not just cheap
Direct line from your ideas to production
Team that understands infrastructure enables everything else
Problems that haven't been solved before
Freedom to define how industrial AI infrastructure should work
Ready?
Show us infrastructure you've built that others said was impossible. Tell us about a time you threw out the playbook and built something better. Share your thoughts on where ML infrastructure is heading.
We're looking for builders who see constraints as design inspiration, not limitations.
We are not currently sponsoring visas or participating in CPT programs.
Infrastructure Engineer
Division:
DATUM, Impac Exploration Services Location:
Oklahoma City (OK), Houston (TX) Type:
Full-Time
Build Infrastructure for the Next of Industrial AI
We're looking for an infrastructure engineer who gets excited about making AI work in the real world—not just in pristine data centers.
You'll architect and build infrastructure that bridges the gap between cutting-edge ML research and production deployments. This isn't your typical DevOps role—you'll be creating novel architectures and solving challenges that sit at the intersection of high-performance computing, distributed systems, and industrial operations.
The Real Environment
You'll be designing and building from first principles, iterating rapidly based on what our researchers need and what reality demands. If you thrive when given a complex problem and the freedom to solve it your way, you'll love this.
We move fast. Ship fast. Learn fast. Your architecture sketch from Monday might be in production by Friday.
What You'll Own
Novel infrastructure architectures that don't exist elsewhere
Systems design from whiteboard to production deployment
Platform decisions that shape how we scale
Infrastructure that makes our data scientists dangerously productive
The technical foundation for AI that works where others can't
Building the playbook others will eventually copy
Technical Stack & Expertise
Hardware/Compute:
NVIDIA GPUs (A100, H100, A6000) and their quirks
GPU interconnects (NVLink, InfiniBand)
Server platforms (Dell PowerEdge, HPE Apollo, Supermicro)
Understanding of CUDA, memory hierarchies, and GPU optimization
Orchestration & Containers:
Kubernetes in anger (not just tutorials)
Container runtimes (Docker, containerd, CRI-O)
Service mesh (Istio, Linkerd)
Helm, Kustomize, or similar for deployment management
Infrastructure & Networking:
Terraform, Ansible, or Pulumi for IaC
BGP, VXLAN, and software-defined networking
Load balancing at layer 4 and 7
Storage solutions (Ceph, MinIO, NetApp)
ML Infrastructure:
Kubeflow, MLflow, or similar ML platforms
GPU scheduling (NVIDIA GPU Operator, MIG)
Distributed training frameworks
Model serving infrastructure (Triton, TorchServe)
You're Our Person If
You see undefined requirements as creative freedom
You've built infrastructure without Stack Overflow because no one's solved it yet
"It's never been done" sounds like a challenge, not a warning
You can move from architecture diagrams to kubectl commands
Complex distributed systems are your canvas
You can explain your choices without defaulting to "best practices"
Especially If
You've built GPU clusters that actually stayed up
You've created systems that surprised even you with what they could do
You understand when to build vs. buy vs. fork
You've made infrastructure decisions with incomplete information—and been right
You can prototype in the morning and production-harden in the afternoon
You've worked where "good enough" isn't
The Opportunity
This is a chance to build without bureaucracy. You'll:
Define architectures that become the standard for industrial AI
Work directly with ML researchers who push your systems to their limits
Make decisions that would take months of committees elsewhere
Build infrastructure that enables entirely new capabilities
Create systems that work where cloud providers fear to tread
Why This Hits Different
No legacy systems to maintain or migrate
Budget to build right, not just cheap
Direct line from your ideas to production
Team that understands infrastructure enables everything else
Problems that haven't been solved before
Freedom to define how industrial AI infrastructure should work
Ready?
Show us infrastructure you've built that others said was impossible. Tell us about a time you threw out the playbook and built something better. Share your thoughts on where ML infrastructure is heading.
We're looking for builders who see constraints as design inspiration, not limitations.
We are not currently sponsoring visas or participating in CPT programs.