Logo
Luma AI

HPC Engineer - Research Infrastructure

Luma AI, Palo Alto, California, United States, 94306

Save Job

About Luma AI

Luma's mission is to build multimodal AI by pushing the boundaries of what is possible with large-scale supercomputing. We are building some of the biggest and fastest AI clusters in the world, and this role is at the very heart of that effort. This requires a deep, first-principles understanding of how hardware and software intersect to unlock maximum performance.

Where You Come In

This is a rare, foundational role for a hybrid SRE/HPC engineer with elite, low-level expertise in GPUs, high-performance networking, and Linux. You will be responsible for the absolute performance and stability of our massive, GPU supercomputing infrastructure. This role demands the ability to design, debug, and optimize at every level of the stack. You will manage our training clusters from provisioning to performance tuning, ensuring our researchers have the most powerful and efficient platform possible.

What You'll Do Architect & Optimize Supercomputers : Design, build, and tune systems that combine CPUs, GPUs (NVIDIA and AMD), and high-performance networking into world-class clusters. Master Low-Level Performance : Dive deep into the Linux OS, device drivers, and user-space code to optimize performance at every level of the stack. Debug Complex Hardware/Software Failures : Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA. Manage HPC Schedulers : Architect and manage modern HPC job management frameworks like Kubernetes, designing queues and partitions setups to maximize throughput and utilization for mixed research workloads. Build Automation for Scale : Write code to automate the monitoring, diagnostics, and healing of thousands of servers, enabling a massive infrastructure footprint with a small, elite team. Who You Are

8+ years of experience as an Infrastructure, DevOps, or HPC engineer working on large, complex distributed systems. You have deep, hands-on experience managing and troubleshooting large GPU clusters from provisioning to monitoring You are an expert in high-performance networking, with practical experience in InfiniBand, RDMA, or RoCE. You possess extensive knowledge of Linux systems, including performance tuning, debugging, and configuration. You have a deep understanding of modern HPC job management systems based on Kubernetes, and are familiar with workflow orchestration frameworks like Ray or Flyte. You have architected, built, and maintained large-scale Kubernetes clusters from first principles, including managing the control plane and node components in a production environment. You are an independently driven, tenacious problem-solver who can own issues from end-to-end. What Sets You Apart (Bonus Points)

Experience at national labs, research universities, or companies known for their large-scale, on-prem supercomputing infrastructure, with a focus on containerized applications Deep expertise with GPU tooling for NVIDIA and AMD GPUs, like DCGM or ROCm