NVIDIA

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA, California, Missouri, United States, 65018

Overview

NVIDIA is looking for an outstanding, passionate, and talented Senior AI Infrastructure Engineer to join our DGX Cloud group. This engineering role will design, build and maintain large scale production systems with high efficiency and availability using software and systems engineering practices. The role requires knowledge across systems, networking, coding, databases, capacity management, continuous delivery and deployment, and open source cloud technologies like Kubernetes and OpenStack. DGX Cloud SRE at NVIDIA ensures that internal and external facing GPU cloud services run with maximum reliability and uptime, while carefully planning changes, managing capacity and performance. NVIDIA values diversity, curiosity, problem solving, and openness, and fosters collaboration, risk-taking in a blame-free environment, self-direction to work on meaningful projects, and mentorship for learning and growth. What You’ll Be Doing

Design, build, deploy, and run internal tooling for large-scale AI training and inference platforms built on top of cloud infrastructure Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters Engage in and improve the full lifecycle of services—from inception and design through deployment, operation and refinement Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews Maintain services once they are live by measuring and monitoring availability, latency and overall system health Scale systems sustainably through automation and evolve systems to improve reliability and velocity Practice sustainable incident response and blameless postmortems Be part of an on-call rotation to support production systems What We Need To See

BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience 6+ years of experience A track record showing initiative, collaboration, and ability to work with others on projects Experience with infrastructure automation and distributed systems design for large-scale private or public cloud systems in production Experience in one or more of Python, Go, C/C++, Java In-depth knowledge of Linux, Networking, Storage, and Containers Technologies Experience with Public Cloud and Infrastructure as Code (IaC) and Terraform Distributed systems experience Ways To Stand Out From The Crowd

Interest in crafting, analyzing and fixing large-scale distributed systems Systematic problem-solving approach with strong communication, ownership, and drive Ability to debug and optimize code and automate routine tasks; experience with large private and public cloud systems based on Kubernetes or Slurm Salary and benefits information are provided to eligible applicants. Applications for this job will be accepted at least until September 24, 2025. NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.

#J-18808-Ljbffr