NVIDIA
Principal Architect, Site Reliability Engineering - GeForce Now
NVIDIA, California, Missouri, United States, 65018
Overview
Principal Architect, Site Reliability Engineering - GeForce Now role at NVIDIA. NVIDIA is the world leader in accelerated computing, tackling challenges across gaming, data centers, AI, and robotics. We seek an expert and transformative Principal Architect for Site Reliability Engineering (SRE) to define architecture and strategic direction for NVIDIA’s highly available, scalable, and secure systems powering critical services and platforms. You will collaborate with product, platform, and infrastructure teams to establish best practices, improve reliability, and drive the evolution of our SRE function. This role requires knowledge across systems, networking, coding, database, capacity management, continuous delivery and deployment, and open source cloud technologies such as Kubernetes. What You Will Be Doing Design and architect scalable, resilient infrastructure for cloud-native and hybrid services. Define and implement SRE principles, SLAs, SLOs, and error budgets across teams and services. Collaborate with multi-functional teams to ensure reliability, observability, performance, and security. Lead architecture reviews, disaster recovery planning, incident response strategies, and postmortems. Develop automation frameworks for deployment, monitoring, and remediation of systems. Champion a culture of reliability, continuous improvement, and operational excellence. Mentor SREs and DevOps engineers, sharing knowledge and standard methodologies across the organization. What We Need To See Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (or equivalent experience). 15+ years of experience in infrastructure, cloud, or SRE roles, including at least 5+ years in an architectural or technical leadership position. Expertise in cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (Kubernetes). Deep understanding of distributed systems, microservices architecture, and CI/CD pipelines. Proficient with observability tools (Prometheus, Grafana, ELK/EFK, Datadog) and infrastructure as code (Terraform, Ansible). Strong programming/scripting skills (Python, Go, Bash, etc.). Ability to communicate ideas and code clearly through documents and presentations. Ways To Stand Out From The Crowd AWS, GCP, or Azure Professional Solution Architect Certification. Familiarity with parallel programming and distributed computing platforms. Experience in developing large-scale and complex applications. Cross-platform development experience. NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. We value diversity in our current and future employees and do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law. Your base salary will be determined based on location, experience, and the pay of employees in similar positions. The base salary range is 248,000 USD - 391,000 USD. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until August 2, 2025. JR1999812
#J-18808-Ljbffr
Principal Architect, Site Reliability Engineering - GeForce Now role at NVIDIA. NVIDIA is the world leader in accelerated computing, tackling challenges across gaming, data centers, AI, and robotics. We seek an expert and transformative Principal Architect for Site Reliability Engineering (SRE) to define architecture and strategic direction for NVIDIA’s highly available, scalable, and secure systems powering critical services and platforms. You will collaborate with product, platform, and infrastructure teams to establish best practices, improve reliability, and drive the evolution of our SRE function. This role requires knowledge across systems, networking, coding, database, capacity management, continuous delivery and deployment, and open source cloud technologies such as Kubernetes. What You Will Be Doing Design and architect scalable, resilient infrastructure for cloud-native and hybrid services. Define and implement SRE principles, SLAs, SLOs, and error budgets across teams and services. Collaborate with multi-functional teams to ensure reliability, observability, performance, and security. Lead architecture reviews, disaster recovery planning, incident response strategies, and postmortems. Develop automation frameworks for deployment, monitoring, and remediation of systems. Champion a culture of reliability, continuous improvement, and operational excellence. Mentor SREs and DevOps engineers, sharing knowledge and standard methodologies across the organization. What We Need To See Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (or equivalent experience). 15+ years of experience in infrastructure, cloud, or SRE roles, including at least 5+ years in an architectural or technical leadership position. Expertise in cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (Kubernetes). Deep understanding of distributed systems, microservices architecture, and CI/CD pipelines. Proficient with observability tools (Prometheus, Grafana, ELK/EFK, Datadog) and infrastructure as code (Terraform, Ansible). Strong programming/scripting skills (Python, Go, Bash, etc.). Ability to communicate ideas and code clearly through documents and presentations. Ways To Stand Out From The Crowd AWS, GCP, or Azure Professional Solution Architect Certification. Familiarity with parallel programming and distributed computing platforms. Experience in developing large-scale and complex applications. Cross-platform development experience. NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. We value diversity in our current and future employees and do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law. Your base salary will be determined based on location, experience, and the pay of employees in similar positions. The base salary range is 248,000 USD - 391,000 USD. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until August 2, 2025. JR1999812
#J-18808-Ljbffr