Conexess

GPU Systems Engineer

Conexess, Chicago

We are seeking a GPU Systems Engineer with hands-on experience designing, configuring, and supporting GPU-based infrastructure. This role will focus on managing and optimizing a 4-node H100 cluster to support large language model (LLM) workloads and other high-performance computing (HPC) initiatives. The ideal candidate will have strong expertise in GPU hardware, cluster management, and workload optimization for AI/ML environments.
Duties & Responsibilities

Deploy, configure, and maintain a 4-node NVIDIA H100 GPU cluster.
Support large language model (LLM) training, fine-tuning, and inference workloads.
Monitor and optimize cluster performance, resource allocation, and utilization.
Implement best practices for scalability, reliability, and high availability of GPU workloads.
Collaborate with data scientists, ML engineers, and researchers to align infrastructure with AI/ML requirements.
Troubleshoot hardware, software, and networking issues across the GPU cluster.
Maintain documentation for system architecture, configurations, and procedures.
Stay current with emerging GPU, HPC, and AI infrastructure technologies.

Qualifications

Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).
Hands-on experience with NVIDIA GPUs (H100, A100, or similar).
Strong knowledge of Linux system administration and GPU driver/toolchain setup.
Experience managing multi-node GPU clusters for HPC or AI/ML workloads.
Familiarity with LLM training frameworks (PyTorch, TensorFlow, DeepSpeed, or similar).
Understanding of job scheduling and resource management (e.g., Slurm, Kubernetes, or equivalent).
Strong troubleshooting and performance tuning skills.

Preferred Skills

Experience with NVIDIA software stack (CUDA, NCCL, NVLink, Triton Inference Server).
Knowledge of distributed training for large language models.
Familiarity with cloud-based GPU infrastructure (AWS, Azure, GCP).
Exposure to networking and storage systems for HPC workloads.

#LI-BW1
#LI-REMOTE