DeWinter Group

HPC Engineer

DeWinter Group, Somerville, Massachusetts, us, 02145

Title:

HPC Engineer

Job Type:

Contract

Contract Length:

6-7 Month Contract (with potential for extension)

Target Start Date:

January

Work Location/Structure:

Remote (local to the Northeast or Midwest preferred)

About the Opportunity Our client, a leader in Academic Research and Higher Education, is looking for a skilled HPC Engineer to join their team for a 6-7 month contract engagement. This project involves scaling and maintaining a critical High-Performance Computing (HPC) ecosystem used by university researchers for parallel processing, AI/ML applications, and massive data transfers. This is a high-impact role that requires a self‑motivated, tenured professional who can immediately contribute to the stability and efficiency of a complex, large‑scale research computing environment.

Key Responsibilities & Deliverables

Maintain the entire HPC ecosystem, including system specification, provisioning, OS installation (Rocky Linux), and managing updates/changes to approximately 200 Linux systems. This includes login/file transfer nodes, compute nodes, job schedulers (Slurm), and virtualization (VMware).

Utilize configuration management and security best practices to maintain all systems using Ansible and the Werewolf cluster management system.

Manage the Globus data transfer software and support the storage team with Vast and TrueNad Storage maintenance. Provide support for data indexing tools like Starburst.

Maintain and support user‑facing HPC web gateways and research tools (e.g., Open OnDemand, Jupyter Notebook/Lab/Hub, FastX, OpenXDMod).

Respond to outage/urgent system issues and develop/document continual operational improvements in the HPC system administration service. Assist with vendor management as needed.

Required Skills & Experience

5+ years of experience in a similar role within a large‑scale enterprise or research environment, with a "tenured" approach to system administration.

Deep expertise in Linux Systems Administration, Ansible, and HPC cluster management tools like Werewolf and the Slurm job scheduler. This isn't a learning role—you need to be a subject matter expert.

Demonstrated ability to work autonomously and manage your own time effectively to meet project goals and handle critical system issues.

Experience installing and maintaining common research computing frameworks and software, particularly AI/ML/DL libraries (TensorFlow, PyTorch) and container platforms.

Familiarity with high‑performance storage solutions like Vast Storage and TrueNad Storage, and experience with Globus or a strong willingness to quickly learn.

Strong communication skills to provide clear and concise status updates to the project team and technical expertise regarding network, storage administration, and data center issues.

Scripting proficiency in Shell or Python is a plus.

#J-18808-Ljbffr