NVIDIA

Senior Site Reliability Engineer - DGX Cloud

NVIDIA, Santa Clara, California, us, 95053

Senior Site Reliability Engineer - DGX Cloud

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. This involves a combination of software and systems engineering practices. The discipline demands knowledge across various systems, networking, coding, databases, capacity management, continuous delivery, deployment, and open-source cloud technologies like Kubernetes and OpenStack. SRE at NVIDIA ensures that our GPU cloud services—both internal and external—operate with maximum reliability and uptime. We enable developers to make changes through careful planning while monitoring capacity, latency, and performance. SRE also embodies a mindset and engineering approach to optimize production systems, emphasizing automation, performance tuning, and efficiency. Our culture values diversity, curiosity, problem-solving, and openness. We foster collaboration, big thinking, risk-taking, and self-direction, supported by mentorship and growth opportunities. What you'll be doing:

Design, implement, and support the operational and reliability aspects of large-scale Kubernetes clusters, focusing on performance, real-time monitoring, logging, and alerting. Engage in and improve the entire service lifecycle—from design and deployment to operation and refinement. Support services before launch through system design consulting, developing tools, capacity management, and review processes. Maintain live services by monitoring availability, latency, and system health. Scale systems sustainably via automation and drive improvements in reliability and velocity. Conduct sustainable incident responses and blameless postmortems. Participate in on-call rotations to support production systems. What we need to see:

BS degree in Computer Science or a related technical field involving coding, or equivalent experience. 5+ years of relevant experience. Experience with infrastructure automation, distributed systems design, and developing tools for large-scale cloud systems. Proficiency in one or more of the following: Python, Go, Perl, or Ruby. In-depth knowledge of Linux, networking, and containers. Ways to stand out from the crowd:

Interest in large-scale distributed systems analysis and troubleshooting. Strong problem-solving, communication skills, ownership, and initiative. Ability to debug, optimize, and automate tasks effectively. Experience operating large private/public cloud systems based on Kubernetes, OpenStack, and Docker. NVIDIA is considered one of the most desirable employers in the tech industry, with forward-thinking and dedicated professionals. If you're creative, autonomous, and love challenges, we want to hear from you. The base salary range is $144,000 - $270,250, determined by location, experience, and current market rates. You may also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis. We are committed to diversity and equal opportunity, welcoming applicants regardless of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability, or other protected characteristics.

#J-18808-Ljbffr