Logo
ZipRecruiter

Infrastructure Engineer

ZipRecruiter, Houston, Texas, United States, 77246

Save Job

Job DescriptionJob Description Infrastructure Engineer Division:

DATUM, Impac Exploration Services Location:

Oklahoma City (OK), Houston (TX) Type:

Full-Time Build Infrastructure for the Next of Industrial AI We're looking for an infrastructure engineer who gets excited about making AI work in the real worldnot just in pristine data centers. You'll architect and build infrastructure that bridges the gap between cutting-edge ML research and production deployments. This isn't your typical DevOps roleyou'll be creating novel architectures and solving challenges that sit at the intersection of high-performance computing, distributed systems, and industrial operations. The Real Environment You'll be designing and building from first principles, iterating rapidly based on what our researchers need and what reality demands. If you thrive when given a complex problem and the freedom to solve it your way, you'll love this. We move fast. Ship fast. Learn fast. Your architecture sketch from Monday might be in production by Friday. What You'll Own Novel infrastructure architectures that don't exist elsewhere Systems design from whiteboard to production deployment Platform decisions that shape how we scale Infrastructure that makes our data scientists dangerously productive The technical foundation for AI that works where others can't Building the playbook others will eventually copy Technical Stack & Expertise Hardware/Compute: NVIDIA GPUs (A100, H100, A6000) and their quirks GPU interconnects (NVLink, InfiniBand) Server platforms (Dell PowerEdge, HPE Apollo, Supermicro) Understanding of CUDA, memory hierarchies, and GPU optimization Orchestration & Containers: Kubernetes in anger (not just tutorials) Container runtimes (Docker, containerd, CRI-O) Service mesh (Istio, Linkerd) Helm, Kustomize, or similar for deployment management Infrastructure & Networking: Terraform, Ansible, or Pulumi for IaC BGP, VXLAN, and software-defined networking Load balancing at layer 4 and 7 Storage solutions (Ceph, MinIO, NetApp) ML Infrastructure: Kubeflow, MLflow, or similar ML platforms GPU scheduling (NVIDIA GPU Operator, MIG) Distributed training frameworks Model serving infrastructure (Triton, TorchServe) You're Our Person If You see undefined requirements as creative freedom You've built infrastructure without Stack Overflow because no one's solved it yet "It's never been done" sounds like a challenge, not a warning You can move from architecture diagrams to kubectl commands Complex distributed systems are your canvas You can explain your choices without defaulting to "best practices" Especially If You've built GPU clusters that actually stayed up You've created systems that surprised even you with what they could do You understand when to build vs. buy vs. fork You've made infrastructure decisions with incomplete informationand been right You can prototype in the morning and production-harden in the afternoon You've worked where "good enough" isn't The Opportunity This is a chance to build without bureaucracy. You'll: Define architectures that become the standard for industrial AI Work directly with ML researchers who push your systems to their limits Make decisions that would take months of committees elsewhere Build infrastructure that enables entirely new capabilities Create systems that work where cloud providers fear to tread Why This Hits Different No legacy systems to maintain or migrate Budget to build right, not just cheap Direct line from your ideas to production Team that understands infrastructure enables everything else Problems that haven't been solved before Freedom to define how industrial AI infrastructure should work Ready? Show us infrastructure you've built that others said was impossible. Tell us about a time you threw out the playbook and built something better. Share your thoughts on where ML infrastructure is heading. We're looking for builders who see constraints as design inspiration, not limitations.

We are not currently sponsoring visas or participating in CPT programs. #J-18808-Ljbffr