NVIDIA
GPU and HPC Infrastructure Engineer - New College Grad 2025
NVIDIA, Santa Clara, California, us, 95053
GPU and HPC Infrastructure Engineer - New College Grad 2025
Join to apply for the
GPU and HPC Infrastructure Engineer - New College Grad 2025
role at
NVIDIA
What You Will Be Doing
We have built a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers. You'll contribute to this platform to build end-to-end automation of datacenter operations, break/fix, and lifecycle management for large-scale Machine Learning systems.
Implement monitoring and health management capabilities that enable reliability, availability, and scalability of GPU assets by harnessing multiple data streams, including GPU hardware diagnostics, cluster telemetry, and network telemetry.
Work on software that manages NVLINK topography across GPU clusters.
Build automated test infrastructure to qualify distributed systems for operation.
Collaborate with engineering teams across NVIDIA to ensure software integrates from hardware to AI training applications.
Continuously innovate, identify new problems, and develop solutions.
What We Need To See
Pursuing or recently completed a BS or MS in Computer Science/Engineering/Physics/Mathematics or equivalent experience.
Software engineering experience on large-scale production systems.
Experience working with multi-functional teams and coordinating across organizational boundaries and geographies.
Strong knowledge of a systems programming language (Go, Python) and understanding of data structures and algorithms.
High-level knowledge of Linux system administration and management.
Understanding of cluster management systems (Kubernetes, SLURM).
Understanding of performance, security, and reliability in complex distributed systems, including data synchronization and fault tolerance.
Ways To Stand Out From The Crowd
Proficiency in architecting and managing large-scale distributed systems, independent of cloud providers, with deep knowledge of datacenter operations and GPU hardware. Hands-on experience with RDMA networking.
Advanced hands-on experience with cluster management systems (Kubernetes, SLURM). Hands-on experience in Machine Learning Operations and with Bright Cluster Manager.
Hands-on experience developing and/or operating hardware fleet management systems and a track record of operational excellence in AI infrastructure.
"Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 104,000 USD - 172,500 USD for Level 1, and 120,000 USD - 189,750 USD for Level 2."
You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until October 5, 2025.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
JR2002464
Seniority level
Entry level
Employment type
Full-time
Job function
Industries: Computer Hardware Manufacturing, Software Development, Computers and Electronics Manufacturing
Referrals increase your chances of interviewing at NVIDIA by 2x
Get notified about new Infrastructure Engineer jobs in
Santa Clara, CA .
#J-18808-Ljbffr
GPU and HPC Infrastructure Engineer - New College Grad 2025
role at
NVIDIA
What You Will Be Doing
We have built a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers. You'll contribute to this platform to build end-to-end automation of datacenter operations, break/fix, and lifecycle management for large-scale Machine Learning systems.
Implement monitoring and health management capabilities that enable reliability, availability, and scalability of GPU assets by harnessing multiple data streams, including GPU hardware diagnostics, cluster telemetry, and network telemetry.
Work on software that manages NVLINK topography across GPU clusters.
Build automated test infrastructure to qualify distributed systems for operation.
Collaborate with engineering teams across NVIDIA to ensure software integrates from hardware to AI training applications.
Continuously innovate, identify new problems, and develop solutions.
What We Need To See
Pursuing or recently completed a BS or MS in Computer Science/Engineering/Physics/Mathematics or equivalent experience.
Software engineering experience on large-scale production systems.
Experience working with multi-functional teams and coordinating across organizational boundaries and geographies.
Strong knowledge of a systems programming language (Go, Python) and understanding of data structures and algorithms.
High-level knowledge of Linux system administration and management.
Understanding of cluster management systems (Kubernetes, SLURM).
Understanding of performance, security, and reliability in complex distributed systems, including data synchronization and fault tolerance.
Ways To Stand Out From The Crowd
Proficiency in architecting and managing large-scale distributed systems, independent of cloud providers, with deep knowledge of datacenter operations and GPU hardware. Hands-on experience with RDMA networking.
Advanced hands-on experience with cluster management systems (Kubernetes, SLURM). Hands-on experience in Machine Learning Operations and with Bright Cluster Manager.
Hands-on experience developing and/or operating hardware fleet management systems and a track record of operational excellence in AI infrastructure.
"Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 104,000 USD - 172,500 USD for Level 1, and 120,000 USD - 189,750 USD for Level 2."
You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until October 5, 2025.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
JR2002464
Seniority level
Entry level
Employment type
Full-time
Job function
Industries: Computer Hardware Manufacturing, Software Development, Computers and Electronics Manufacturing
Referrals increase your chances of interviewing at NVIDIA by 2x
Get notified about new Infrastructure Engineer jobs in
Santa Clara, CA .
#J-18808-Ljbffr