Sustainable Talent
Overview
Sustainable Talent is partnering with Nvidia, a global leader who has been transforming computer graphics, PC gaming, and accelerated computing for over 25 years. We are looking for a SRE & DevOps Engineer to support NVIDIA's Infrastructure, Planning and Processes organization. This is a W-2 full-time contract based in Santa Clara, CA, onsite. We offer competitive pay based on factors like experience, education, location, etc., and provide full benefits, PTO, and a strong company culture.
What you’ll be doing
Working on systems deployed in NVIDIA's internal infrastructure products and ensuring they are available and reliable for our end users.
Monitor system performance and troubleshoot issues related to NVIDIA hardware and software stack.
Provide high quality user support.
Monitor KPIs and ensure that the team’s SLAs are met.
Manage and maintain production Kubernetes clusters and Jenkins pipelines.
Drive automation of monitoring to gain more insight into applications and system health.
What we need to see
Experience maintaining cloud and on-prem infrastructure and highly-available production environments.
Expert level proficiency in CI/CD systems such as ArgoCD, Jenkins, GitLab CI, GitHub Actions, etc.
Background in databases like SQL (MySQL) and timeseries DBs like Prometheus.
Experience with data analytics/visualization tools (ELK, Grafana, Splunk) and alerting tools (Zabbix, Alertmanager, PagerDuty).
Proficiency with Ansible, Kubernetes, Containers & Virtualization platforms.
5+ years of proven experience and a Bachelor's degree in Computer Science, Information Technology, or a related field, or equivalent experience.
Ways to stand out from the crowd
Previous experience with SRE teams managing on-prem infrastructure.
Experience managing NVIDIA hardware like GPUs and Tegra devices.
Thrives in a multi-tasking environment with evolving priorities.
Prior experience with a large-scale operations team.
Sustainable Talent is a M/F+, disabled, and veteran equal employment opportunity and affirmative action employer.
#J-18808-Ljbffr
What you’ll be doing
Working on systems deployed in NVIDIA's internal infrastructure products and ensuring they are available and reliable for our end users.
Monitor system performance and troubleshoot issues related to NVIDIA hardware and software stack.
Provide high quality user support.
Monitor KPIs and ensure that the team’s SLAs are met.
Manage and maintain production Kubernetes clusters and Jenkins pipelines.
Drive automation of monitoring to gain more insight into applications and system health.
What we need to see
Experience maintaining cloud and on-prem infrastructure and highly-available production environments.
Expert level proficiency in CI/CD systems such as ArgoCD, Jenkins, GitLab CI, GitHub Actions, etc.
Background in databases like SQL (MySQL) and timeseries DBs like Prometheus.
Experience with data analytics/visualization tools (ELK, Grafana, Splunk) and alerting tools (Zabbix, Alertmanager, PagerDuty).
Proficiency with Ansible, Kubernetes, Containers & Virtualization platforms.
5+ years of proven experience and a Bachelor's degree in Computer Science, Information Technology, or a related field, or equivalent experience.
Ways to stand out from the crowd
Previous experience with SRE teams managing on-prem infrastructure.
Experience managing NVIDIA hardware like GPUs and Tegra devices.
Thrives in a multi-tasking environment with evolving priorities.
Prior experience with a large-scale operations team.
Sustainable Talent is a M/F+, disabled, and veteran equal employment opportunity and affirmative action employer.
#J-18808-Ljbffr