Leidos

Lead High-Performance Computing System Administrator

Leidos, Atlanta, Georgia, United States, 30383

Description The Public Health and Human Services Operation of Leidos is in search of a

Lead High-Performance Computing System Administrator

to guide a team of skilled system administrators in overseeing a cutting-edge high-performance computing (HPC) infrastructure utilized by public health researchers and scientists. This pivotal role demands substantial Linux expertise alongside a thorough comprehension of specialized hardware, software, and networking essential for scientific inquiry and expansive data analysis. Candidate Must: be located in the Atlanta, GA area for partial onsite work be a US Citizen with the capability to secure a Public Trust Clearance The candidate will ensure secure, reliable infrastructure services, accessible by researchers for customer-sponsored data managed in both on-premise environments and the cloud. Responsibilities include providing secure access to high-performance computing resources for scientific research endeavors. HPC Infrastructure Management:

Deploy, administer, and monitor HPC clusters while managing vast multi-petabyte datasets using Pure Storage flash memory and AWS S3 Glacier. Software and Resource Management:

Install, maintain, and upgrade scientific software, libraries, and batch schedulers like GridEngine and Slurm. Develop effective processes for resource sharing across various research teams. VMware Administration:

Manage VMware vSphere Foundation for the provisioning, deployment, and configuration of virtual servers, along with hardware and software implementation and maintenance. System Operations:

Execute system monitoring, routine security patch management, troubleshooting, and performance tuning. Project Planning and Coordination:

Advise clients and project managers in the design and documentation of technical solutions. Support infrastructure projects, from planning and coordinating activities to executing plans and providing status updates while collaborating with internal and client teams. Automation and Scripting:

Drive automation initiatives to streamline system management tasks utilizing scripting languages (Bash, Python) and configuration management tools (Puppet, Ansible). Research Collaboration:

Collaborate with scientists, bioinformatics developers, and principal investigators to comprehend their computational needs, translating scientific goals into technical configurations while offering technical support for workflow optimization. System Architecture and Deployment:

Lead the technical design, integration, and optimization of both on-site HPC and cloud resources. Mentorship and Team Coordination:

Mentor other system administrators in best practices and may involve managing a team of system administrators. Security and Compliance:

Implement robust security measures, manage access controls, and design architectures that comply with standards such as HIPAA or NIST. Support Security Assessment and Authorization (SA&A) processes. Disaster Recovery and Monitoring:

Design and implement backup and disaster recovery plans, and integrate monitoring and alerting systems to ensure high availability and reliability. Required Education and Experience A Bachelor's degree in computer science or a related field, complemented by 10 years of system administration experience. At least 7 years of extensive experience designing and operating HPC infrastructure. Proficient in Linux systems and administration, including troubleshooting, security, performance monitoring, and familiarity with distributions such as Red Hat and Ubuntu. Soft Skills:

Strong problem-solving and communication skills are essential for effective collaboration with customers, bioinformatics developers, researchers, and leading a team. Experience integrating new technologies and processes into existing production environments is preferred. Networking:

Proficient with network devices including routers, switches, gateways, and hubs. Security Knowledge:

Capable of developing infrastructure deliverables, offering continuous diagnostics and mitigation, security architecture support, and vulnerability management. VMware Experience:

Experienced in managing virtual machine infrastructure. Leadership:

Proven track record in planning and coordinating infrastructure support activities, alongside mentoring system administrators. HPC and Cluster Management:

Demonstrated experience with HPC clusters, job schedulers (Slurm), and high-speed networking (10/40/100Gb). Technical Skills:

Proficient in Bash and Python scripting; experience with cloud technologies (hybrid-cloud integration) and container environments (Docker, Singularity, Kubernetes). Desired Qualifications: A Master's Degree in IT, engineering, or related fields. Experience with federal government agency or research organization. Expertise in large-scale infrastructure design and implementation projects. Certifications such as Red Hat Certified Engineer (RHCE) or Red Hat Certified Architect (RHCA). Knowledge of computer networking protocols including TCP, IP, UDP, HTTP, DHCP, and DNS, with understanding of LAN, WAN, and VPN design. Experience in optimizing cloud utilization patterns and migrating from on-premises to hybrid models. AWS or Azure Cloud engineer certification. At Leidos, we are not just looking for employees; we seek innovators who challenge the status quo. If you are ready to take your career to new heights and join a dynamic team, we encourage you to apply. Original Posting:

September 29, 2025 The anticipated minimum pay range for this position is $89,700.00, with a maximum of $162,150.00. This is a guideline and does not guarantee a specific salary. Various factors, such as job responsibilities, education, experience, skills, and internal equity, will be considered in the offer process.