NVIDIA

Senior DevOps Service Reliability Operations Engineer - DGX Cloud

NVIDIA, Santa Clara, California, us, 95053

NVIDIA's NGC team is looking for highly motivated System Administrator/DevOps engineers to design, develop and implement a global, dynamic, innovative Service Reliability Operations Center, to provide extraordinary levels of support for our Cloud products and services. As a key member of the CIS Team (Compute Infrastructure Support), you will partner with other key members of our organization including Site Reliability Engineering, Security Operations Center, DevOps teams, and other partners to help make our services capable of providing near 100% availability. On the rare occasion that an incident occurs, you will be our front line to decrease the frequency and duration of any issue. Working in partnership with the development community the CIS team will develop monitors, alarms, and alerts to help make the service more reliable and improve our customer experience.

What you will be doing:

The team will provide their services 24/7 with a follow-the-sun environment which will span continents. You will report directly to a manager in the United States.

Some CIS shifts require either a Saturday or Sunday each week. The hours worked may include an early or late start (10hrs-per-day x 4 days-per-week schedule) to ensure that the combination the US and India teams provide 24/7 coverage.

Every CIS team member will use alerts and alarms to help prevent issues and incidents when possible. You may also work with the developer community to develop and implement predictive support or diagnostic routines.

Perform systems administration tasks, network administration tasks, security incident monitoring to drive our actions.

CIS team members will work with developers to learn how the service works, then translate that understanding into runbooks which the entire team will use. As new features and functionality are added, you will also update and evolve the runbooks as needed.

Help discover incidents and issues, including initiating the incident management procedure.

Bring in subject matter authorities or service owners as needed to resolve issues. Feedback will help us continually improve our service.

Your interpersonal skills will help keep the team engaged through resolution and ensure our clients believe we value their time and effort. May perform other tasks that will help us provide extraordinary service levels for our customers.

What we need to see:

Highly motivated with strong communication skills, you have the ability to work successfully with multi-functional teams, principles, and architects, coordinating effectively across organizational boundaries and geographies.

5+ years of experience administering large-scale production systems. 3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC).

BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience.

Expert-level knowledge of Linux system administration and automation using Ansible and/or Python.

Strong experience with shell scripting, DNS, DHCP, storage systems, and core networking (IP Tables, routing, firewalls).

Experience with at least one workload manager (Slurm preferred) or job scheduling system in a production environment.

Strong experience troubleshooting and maintaining large-scale bare-metal infrastructure. Strong cross-team collaboration, documentation, and mentoring skills.

Experience improving processes for automation, reliability, and operational excellence.

Expertise using monitoring tools and problem ticketing systems. Strong problem-solving, analytical, and troubleshooting abilities.

Ways to Stand Out from the Crowd:

Advanced hands-on experience with Kubernetes, SLURM, and large-scale cluster management.

Familiarity with GPU hardware and high-performance computing environments.

Experience with observability and incident management tools (Grafana, OpenTelemetry, PagerDuty, JIRA). Cloud experience (AWS, Azure, GCP) is a plus; strong preference for on-prem expertise.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 144,000 USD - 230,000 USD for Level 3, and 168,000 USD - 270,250 USD for Level 4.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until November 18, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#J-18808-Ljbffr