NVIDIA

Manager, Engineering - Data Center Management

NVIDIA, Santa Clara

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern deep learning — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company.” We're looking to grow our company and establish teams with the most thoughtful people in the world. NVIDIA GH200 superchip provides performance and productivity required for strong scaling for HPC and generative AI workload. Scale out is inherent to the design of this massive superchip. We are seeking expert engineers to help design rack-level solutions for next-generation scaling AI supercomputing platforms.

We are looking for a strong technical architect to own end-to-end manageability architecture for these products in data centers. You will collaborate with various component leads internally and externally, drive customer use cases, align architecture with customer requirements, and release top-quality products to market. Join us at the forefront of technological advancement.

What you’ll be doing:

Drive server management for large clusters and data centers deploying GPUs and Grace solutions from NVIDIA.
Collaborate with data center architects and cloud customers to refine requirements for implementation to ensure rapid product development.
Work closely with hardware teams to define low-level requirements and architecture for all data center management products.
Own and deliver firmware for low-level management components; manage teams to ensure high-quality firmware delivery.
Coordinate with internal teams to ensure requirements are properly designed and implemented across firmware and software modules. Collaborate on designing and building data center health management workflows.
Enhance reliability and optimize firmware architecture from a data center perspective. Work with cluster bring-up teams to resolve issues swiftly. Ensure firmware quality, reliability, and telemetry performance in data centers.

What we need to see:

10+ years of relevant experience in server firmware (BMC) and platform software development.
BS, MS, or PhD in EE/CS or related fields, or equivalent experience.
Hands-on experience with data center health management workflows and proven record of delivering server firmware for large data centers.
Strong knowledge of data center management, server architecture, and manageability. At least 4 years of experience managing engineering teams.
Proficiency in C/C++ and Python, with experience in programming and debugging server platforms.
Experience with SCM tools (e.g., Git, Perforce) and project management tools like Jira. Excellent communication skills, strong work ethics, teamwork, and dedication to quality work.
Self-starter with a passion for solving complex problems and hands-on coding experience.

Ways to stand out from the crowd:

Experience with data center health management and server manageability.
Proven leadership in driving large-scale projects with teams of 25+ engineers.

NVIDIA is widely regarded as one of the most desirable employers in the tech industry. We have some of the most innovative and dedicated people working with us. If you're creative and autonomous, we want to hear from you!

The base salary range is $224,000 - $425,500, determined by location, experience, and market standards. You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. We do not discriminate based on race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability, or any other characteristic protected by law.

#J-18808-Ljbffr