Luma AI
Reliability Engineer | High-Performance AI
Luma AI, Palo Alto, California, United States, 94306
Reliability Engineer | High-Performance AI
Join to apply for the
Reliability Engineer | High-Performance AI
role at
Luma AI .
This range is provided by Luma AI. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.
Base pay range $170,000.00/yr - $360,000.00/yr
The Opportunity At Luma AI, "full-stack" has a distinct meaning. It means understanding everything from the generative model down to the silicon it runs on. We are pushing the physical limits of current hardware to train Omni models that understand the world. This requires a level of engineering rigor that standard cloud environments simply do not demand. We are looking for engineers who are tired of high-level abstractions and want to work on the metal that powers the AI revolution.
Where You Come In You will operate at the jagged edge where software meets hardware. Standard cloud providers abstract away the complexity; we embrace it. You will be responsible for maximizing efficiency from our heterogeneous fleet of NVIDIA and AMD accelerators. This role is about precision, performance, and the relentless pursuit of system optimization in a multi-vendor supercomputing environment.
What You Will Build
The Bare Metal Stack: Manage and optimize the lifecycle of bare-metal servers, ensuring that our OS, drivers, and firmware are tuned for peak AI performance.
High-Throughput Interconnects: Engineer the software configurations for our InfiniBand and RoCE fabrics, solving the intricate data movement challenges that define modern distributed training.
Performance Diagnostics: Build the tooling to visualize what is happening inside the cluster, turning opaque hardware counters into actionable signals for debugging latency and throughput.
The Profile We Are Looking For
Low-Level Fluency: You are not afraid of the kernel. You understand interrupts, memory management, and how the OS interacts with peripheral devices.
Hardware Curiosity: You understand that software doesn't run in a vacuum. You are interested in the physical constraints of GPUs, networking cards, and storage subsystems.
First-Principles Reasoning: When a system behaves unexpectedly, you don't just restart it; you investigate the physics of the failure to ensure it is solved permanently.
Seniority level Mid-Senior level
Employment type Full-time
Job function Engineering and Information Technology
Industries Software Development
Get notified about new Reliability Engineer jobs in
Palo Alto, CA .
Referrals increase your chances of interviewing at Luma AI by 2x.
#J-18808-Ljbffr
Reliability Engineer | High-Performance AI
role at
Luma AI .
This range is provided by Luma AI. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.
Base pay range $170,000.00/yr - $360,000.00/yr
The Opportunity At Luma AI, "full-stack" has a distinct meaning. It means understanding everything from the generative model down to the silicon it runs on. We are pushing the physical limits of current hardware to train Omni models that understand the world. This requires a level of engineering rigor that standard cloud environments simply do not demand. We are looking for engineers who are tired of high-level abstractions and want to work on the metal that powers the AI revolution.
Where You Come In You will operate at the jagged edge where software meets hardware. Standard cloud providers abstract away the complexity; we embrace it. You will be responsible for maximizing efficiency from our heterogeneous fleet of NVIDIA and AMD accelerators. This role is about precision, performance, and the relentless pursuit of system optimization in a multi-vendor supercomputing environment.
What You Will Build
The Bare Metal Stack: Manage and optimize the lifecycle of bare-metal servers, ensuring that our OS, drivers, and firmware are tuned for peak AI performance.
High-Throughput Interconnects: Engineer the software configurations for our InfiniBand and RoCE fabrics, solving the intricate data movement challenges that define modern distributed training.
Performance Diagnostics: Build the tooling to visualize what is happening inside the cluster, turning opaque hardware counters into actionable signals for debugging latency and throughput.
The Profile We Are Looking For
Low-Level Fluency: You are not afraid of the kernel. You understand interrupts, memory management, and how the OS interacts with peripheral devices.
Hardware Curiosity: You understand that software doesn't run in a vacuum. You are interested in the physical constraints of GPUs, networking cards, and storage subsystems.
First-Principles Reasoning: When a system behaves unexpectedly, you don't just restart it; you investigate the physics of the failure to ensure it is solved permanently.
Seniority level Mid-Senior level
Employment type Full-time
Job function Engineering and Information Technology
Industries Software Development
Get notified about new Reliability Engineer jobs in
Palo Alto, CA .
Referrals increase your chances of interviewing at Luma AI by 2x.
#J-18808-Ljbffr