HPC System Administrator
Illinois Staffing - Chicago, Illinois, United States, 60290
Work at Illinois Staffing
Overview
- View job
Overview
The University of Chicago Research Computing Center (RCC) is seeking a skilled HPC System Administrator to join its Systems and Operations Team. This position will support the deployment, maintenance, and automation of RCCs HPC systems, including CPU/GPU clusters, storage, and networking infrastructure. The HPC System Administrator will assist in system-level administration, troubleshooting, performance tuning, and automation while collaborating with faculty and researchers to enable cutting-edge computational science. This is a hybrid position requiring 3 days onsite. Responsibilities: Administer, install, monitor, and maintain HPC systems, including compute nodes, storage, networking, and software stacks. Develop and maintain automation tools for system provisioning, configuration management, and monitoring. Assist in the implementation and management of distributed file systems (e.g., Lustre, BeeGFS, GPFS). Install, configure, and optimize job scheduling and resource management tools (e.g., Slurm, LSF, PBS). Assist in system security, patch management, and troubleshooting operational issues. Contribute to performance benchmarking, system tuning, and capacity planning. Deploy and maintain commonly used HPC applications and software stacks. Document system administration procedures and contribute to knowledge-sharing initiatives. Support researchers by providing technical expertise and resolving escalated support tickets. Participate in vendor coordination, system procurement, and hardware/software lifecycle management. Installs, configure, and maintain operating system workstations and servers. Performs software installations and upgrades to operating systems and layered software packages. Monitors and tunes the system to achieve optimum performance levels, acquiring higher-level skills in the process. Maintains all supporting documentation for comprehensive operating system, hardware and software configuration. Monitors primary responses for information technology related security incidents and violations. Keeps current with new security and network monitoring technologies, applicable laws, and regulations. Minimum Qualifications: Education: Minimum requirements include a college or university degree in related field. Work Experience: Minimum requirements include knowledge and skills developed through 2-5 years of work experience in a related job discipline. Preferred Qualifications: Technical Skills or Knowledge: Experience administering Linux-based HPC clusters, including job schedulers (e.g., Slurm, LSF, PBS). Familiarity with high-speed networking (e.g., InfiniBand, Ethernet). Scripting/programming skills (Python, Bash, or Perl). Experience configuring, installing and troubleshooting MPI and OpenMP applications. Experience configuring, installing, tuning and maintaining scientific applications on large-scale systems. Experience with system automation tools (e.g., Ansible, Puppet). Experience with system provisioning tools (e.g., xCAT, Confluent, Warewulf, etc). Knowledge of distributed storage systems (e.g., Lustre, BeeGFS, GPFS). Experience with containerization (Docker, Singularity, Apptainer). Experience configuring, installing, maintaining and/or using infrastructure and performance monitoring and optimization tools (such as CheckMK, Grafana, Prometheus, Icinga, etc). Experience in setting up and executing benchmarks in an HPC environment and analyzing