NVIDIA

GPU and HPC Infrastructure Engineer - New College Grad 2025

NVIDIA, Santa Clara, California, us, 95053

GPU and HPC Infrastructure Engineer - New College Grad 2025 Join to apply for the

GPU and HPC Infrastructure Engineer - New College Grad 2025

role at

NVIDIA

What You Will Be Doing

We have built a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers. You'll contribute to this platform to build end-to-end automation of datacenter operations, break/fix, and lifecycle management for large-scale Machine Learning systems.

Implement monitoring and health management capabilities that enable reliability, availability, and scalability of GPU assets by harnessing multiple data streams, including GPU hardware diagnostics, cluster telemetry, and network telemetry.

Work on software that manages NVLINK topography across GPU clusters.

Build automated test infrastructure to qualify distributed systems for operation.

Collaborate with engineering teams across NVIDIA to ensure software integrates from hardware to AI training applications.

Continuously innovate, identify new problems, and develop solutions.

What We Need To See

Pursuing or recently completed a BS or MS in Computer Science/Engineering/Physics/Mathematics or equivalent experience.

Software engineering experience on large-scale production systems.

Experience working with multi-functional teams and coordinating across organizational boundaries and geographies.

Strong knowledge of a systems programming language (Go, Python) and understanding of data structures and algorithms.

High-level knowledge of Linux system administration and management.

Understanding of cluster management systems (Kubernetes, SLURM).

Understanding of performance, security, and reliability in complex distributed systems, including data synchronization and fault tolerance.

Ways To Stand Out From The Crowd

Proficiency in architecting and managing large-scale distributed systems, independent of cloud providers, with deep knowledge of datacenter operations and GPU hardware. Hands-on experience with RDMA networking.

Advanced hands-on experience with cluster management systems (Kubernetes, SLURM). Hands-on experience in Machine Learning Operations and with Bright Cluster Manager.

Hands-on experience developing and/or operating hardware fleet management systems and a track record of operational excellence in AI infrastructure.

"Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 104,000 USD - 172,500 USD for Level 1, and 120,000 USD - 189,750 USD for Level 2."

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until October 5, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

JR2002464

Seniority level

Entry level

Employment type

Full-time

Job function

Industries: Computer Hardware Manufacturing, Software Development, Computers and Electronics Manufacturing

Referrals increase your chances of interviewing at NVIDIA by 2x

Get notified about new Infrastructure Engineer jobs in

Santa Clara, CA .

#J-18808-Ljbffr