Logo
Berkely Lab

Dev Ops Engineer

Berkely Lab, San Francisco, California, United States, 94199

Save Job

DevOps Engineer

Lawrence Berkeley National Lab's (LBNL) NERSC Division has an opening for a DevOps Engineer to join the team. In this exciting role, you will serve as a DevOps-oriented System Administrator/Software Engineer (Computer Systems Engineer 3/4) at the National Energy Research Scientific Computing Center (NERSC) to help architect, deploy, configure, and operate large scale, leading-edge high-performance computing (HPC) systems. You will work collaboratively to develop and operate large-scale compute and storage platforms to support NERSC's mission of accelerating scientific discovery through high-performance computing and data analysis. Working with teams at NERSC, other national laboratories, HPC vendors and open-source communities you will develop innovative solutions that enable science as well as improve the state of HPC practice on an international stage. Your focus will be to improve and operate NERSC's largest HPC resources, Perlmutter and Doudna, and to work with the rest of the HPC community to develop and maintain world-class system software. The selected candidate(s) will be hired at the Computer Systems Engineer 3 or 4 (CSE3 or CSE4) depending on their level of skills and experience. What You Will Do if hired at a Level 3: Participate in team-oriented agile development and management process for HPC systems using languages like Go, Rust, and Python Develop and maintain APIs to securely expose system functionality to end users Automate common tasks and processes to continuously improve HPC systems management Analyze system issues and develop solutions to improve end-user experience Be part of a team that installs, tests, maintains, and manages HPC systems Assist with technology evaluation of systems and system architecture Work with vendors to prioritize, develop, and enhance their technologies in order to better meet the needs of our users Be part of the team providing on-call rotation for 24x7 HPC system support Work on and resolve complex issues where analysis of situations or data requires an in-depth evaluation of variable factors Exercise judgment in selecting methods, techniques, and evaluation criteria for obtaining results Determine methods and procedures on new assignments and may coordinate activities of other personnel Network with key contacts outside own area of expertise In Additional Responsibilities if hired at a Level 4: Provide leadership and technical guidance to group members, and members of other groups at NERSC Recommend and lead implementation and deployment efforts for system improvements that enhance reliability, stability, usability, performance, and security Identify and evaluate emerging HPC technologies and explore new features that would create new capabilities and enhance system performance and usability Participate in working/user/advocacy groups and represent NERSC and its interests to the broader HPC community Work at a higher level of independence while carrying out work assignment Work on and solve significant and issues where analysis of situations or data requires an in-depth evaluation of variable factors What is Required to be hired at a Level 3: Typically requires a minimum of 8 years of related experience with a Bachelor's degree; or 6 years and a Master's degree; or equivalent experience Minimum of 2 years of experience with systems programming in the Linux environment or management of large-scale Linux-based systems in a high-performance computing, cloud computing, or hyper-scale environment Experience with C, Bourne shell, and Python3 programming languages Additional Requirements to be hired at a Level 4: Typically requires a minimum of 12 years of related experience with a Bachelor's degree; or 8 years and a Master's degree; or equivalent experience Demonstrated excellent systems programming skills and strong knowledge of Linux internals Demonstrated ability to work independently as well as collaboratively in large projects, and contribute to an active and respectful intellectual environment Excellent oral and written communication skills Ability to resolve complex issues in creative and effective ways and derive technical solutions in a collaborative environment to meet end-user requirements or needs Ability to network and collaborate with key contacts outside own area of expertise Ability to work on and resolve significant and unique issues where analysis of situations or data requires an evaluation of intangibles Ability to exercise independent judgment in methods, techniques, and evaluation criteria for obtaining results Desired Qualifications: Development of Kubernetes microservices using technologies like Helm or Loftsman for deployment Operations of Kubernetes, etcd Infrastructure as code solutions like Argo, Terraform, Ansible, Puppet, Salt Rust or Go programming language Gitlab or Github Continuous Integration and Project Management Agile process, Scrum Linux kernel interfaces, cgroups, ebpf Installation, configuration, monitoring, and tuning of workload management systems such as Slurm, PBSPro, or GridEngine Monitoring solutions such as Grafana, Prometheus, LDMS HPC systems administration HPC applications analysis, MPI Specialized networking (Infiniband, Slingshot or other high-speed networks) Lustre, SpectrumScale (GPFS) or other parallel file systems Notes: This is a (full-time/part-time) career appointment, exempt (monthly paid) from overtime pay This position will involve access to hardware, commodities, and technical information subject to export control regulations including, but not limited to, the Export Administration Regulations ("EAR") and/or International Traffic in Arms Regulations ("ITAR"). Accordingly, any hiring decision may depend in part on Berkeley Lab's ability to obtain or rely on federal government authorizations as required, if you are not a U.S. citizen, lawful permanent resident of the U.S. ("green card holder"), asylee, refugee, or other qualifying protected individual as defined by 8 U.S.C. 1324b(a)(3). This position will be hired at a level commensurate with the business needs and the skills, knowledge, and abilities of the successful candidate. Level 3:

The full salary range of this position is between $129,948.00 - $219,276.00 per year and is expected to pay between a targeted range of $146,184.00 - $178,668.00 per year depending upon candidates' full skills, knowledge, and abilities, including education, certifications, and years of experience. Level 4:

The full salary range of this position is between $147,984.00 - $249,732.00 per year and is expected to pay between a targeted range of $166,476.00 - $203,484.00 per year depending upon candidates' full skills, knowledge, and abilities, including education, certifications, and years of experience. This position is subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment. This position requires substantial on-site presence, but is eligible for a flexible work mode, and hybrid schedules may be considered. Hybrid work is a combination of performing work on-site at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA and some telework. Individuals working a hybrid schedule must reside within 150 miles of Berkeley Lab. Work schedules are dependent on business needs. In rare cases, full-time telework or remote work modes may be considered. Want to learn more about working at Berkeley Lab? Please visit

careers.lbl.gov Equal Employment Opportunity Employer: The foundation of Berkeley Lab is our Stewardship