NVIDIA
Overview
NVIDIA is the world leader in GPU Computing. We are passionate about markets including gaming, automotive, vision, HPC, datacenters and networking in addition to our traditional OEM business. NVIDIA is also positioned as the ‘AI Computing Company’, with GPUs powering Deep Learning software frameworks, analytics, data centers, and autonomous vehicles. We seek dedicated, forward-thinking, and hardworking technical professionals across countries who are excited by this opportunity. This role is for an individual who thrives in a diverse work environment, has outstanding interpersonal skills, and demonstrates engagement and continuous process improvement. The candidate must have enterprise server integration, strong Linux experience, reliability testing with various telemetries, scale-out cluster experience, test plan development, track record in developing AI tools and NLP, and DevOps/CI–CD experience to join our platform SWQA team.
What you’ll be doing
Responsible for the development and execution of NVIDIA HGX/DGX/MGX platform test plans on servers, OS, firmware, and CUDA SW stack from design doc.
Install and test various systems: OS, server firmware, and SW stack.
Drive support for root cause analysis on reliability and validation test failures to identify root causes and achieve mitigation.
Build, develop, and debug server and OS level automation front-end and back-end framework and tests.
Review partner and supplier test results and prescribe additional reliability testing on components, servers, and packaging as needed.
Work in an agile software development team with very high production quality standards.
Manage bug lifecycle and collaborate with inter-groups to drive for solutions.
What we need to see
Bachelor’s Degree (or equivalent experience) in a STEM field (Science, Technology, Engineering, Math or Physics)
5+ years proven experience; or master’s degree
Proven experience in OS and server-level automation, CI/CD processes and DevOps using Python, SHELL, Ansible, Jenkins, C/C++, Java, JavaScript
Strong server and Linux troubleshooting and debugging experience in bare-metal and KVM/VMware/Hyper-V environments
Good knowledge and hands-on experience in model testing, AI tools/frameworks (TensorFlow, PyTorch, etc.), NLP and LLM benchmarking
Experience in using AI development tools for test plans creation, test cases development and test automation
Strong experience in firmware, BMC/OpenBMC, network protocols, enterprise storage devices, PCIe buses and devices, IO sub-devices, CPU and memory, ACPI, UEFI spec, Redfish; and related areas are a plus
Proven experience with GitHub/GitLab/Gerrit, PXE, SLURM, Stack/Kubernetes/Docker; additional experience is a plus
Ways to stand out from the crowd
AI-related tools, LLM and NLP experience
Experience working with NVIDIA GPU hardware is a strong plus
Strong understanding of virtualization in Linux (KVM, Docker, Kubernetes)
Background in parallel programming (CUDA/OpenCL) is a plus
With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world’s most desirable employers. We have forward-thinking and hardworking people, and due to unprecedented growth, our exclusive engineering teams are rapidly expanding. If you are a creative and autonomous engineer with a real passion for technology, we want to hear from you.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 136,000 USD - 212,750 USD for Level 3, and 168,000 USD - 264,500 USD for Level 4.
You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until September 28, 2025. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. We value diversity in our current and future employees and do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.
#J-18808-Ljbffr
What you’ll be doing
Responsible for the development and execution of NVIDIA HGX/DGX/MGX platform test plans on servers, OS, firmware, and CUDA SW stack from design doc.
Install and test various systems: OS, server firmware, and SW stack.
Drive support for root cause analysis on reliability and validation test failures to identify root causes and achieve mitigation.
Build, develop, and debug server and OS level automation front-end and back-end framework and tests.
Review partner and supplier test results and prescribe additional reliability testing on components, servers, and packaging as needed.
Work in an agile software development team with very high production quality standards.
Manage bug lifecycle and collaborate with inter-groups to drive for solutions.
What we need to see
Bachelor’s Degree (or equivalent experience) in a STEM field (Science, Technology, Engineering, Math or Physics)
5+ years proven experience; or master’s degree
Proven experience in OS and server-level automation, CI/CD processes and DevOps using Python, SHELL, Ansible, Jenkins, C/C++, Java, JavaScript
Strong server and Linux troubleshooting and debugging experience in bare-metal and KVM/VMware/Hyper-V environments
Good knowledge and hands-on experience in model testing, AI tools/frameworks (TensorFlow, PyTorch, etc.), NLP and LLM benchmarking
Experience in using AI development tools for test plans creation, test cases development and test automation
Strong experience in firmware, BMC/OpenBMC, network protocols, enterprise storage devices, PCIe buses and devices, IO sub-devices, CPU and memory, ACPI, UEFI spec, Redfish; and related areas are a plus
Proven experience with GitHub/GitLab/Gerrit, PXE, SLURM, Stack/Kubernetes/Docker; additional experience is a plus
Ways to stand out from the crowd
AI-related tools, LLM and NLP experience
Experience working with NVIDIA GPU hardware is a strong plus
Strong understanding of virtualization in Linux (KVM, Docker, Kubernetes)
Background in parallel programming (CUDA/OpenCL) is a plus
With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world’s most desirable employers. We have forward-thinking and hardworking people, and due to unprecedented growth, our exclusive engineering teams are rapidly expanding. If you are a creative and autonomous engineer with a real passion for technology, we want to hear from you.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 136,000 USD - 212,750 USD for Level 3, and 168,000 USD - 264,500 USD for Level 4.
You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until September 28, 2025. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. We value diversity in our current and future employees and do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.
#J-18808-Ljbffr