Conexess
We are seeking a GPU Systems Engineer with hands-on experience designing, configuring, and supporting GPU-based infrastructure. This role will focus on managing and optimizing a 4-node H100 cluster to support large language model (LLM) workloads and other high-performance computing (HPC) initiatives. The ideal candidate will have strong expertise in GPU hardware, cluster management, and workload optimization for AI/ML environments.
Duties & Responsibilities
- Deploy, configure, and maintain a 4-node NVIDIA H100 GPU cluster.
- Support large language model (LLM) training, fine-tuning, and inference workloads.
- Monitor and optimize cluster performance, resource allocation, and utilization.
- Implement best practices for scalability, reliability, and high availability of GPU workloads.
- Collaborate with data scientists, ML engineers, and researchers to align infrastructure with AI/ML requirements.
- Troubleshoot hardware, software, and networking issues across the GPU cluster.
- Maintain documentation for system architecture, configurations, and procedures.
- Stay current with emerging GPU, HPC, and AI infrastructure technologies.
- Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).
- Hands-on experience with NVIDIA GPUs (H100, A100, or similar).
- Strong knowledge of Linux system administration and GPU driver/toolchain setup.
- Experience managing multi-node GPU clusters for HPC or AI/ML workloads.
- Familiarity with LLM training frameworks (PyTorch, TensorFlow, DeepSpeed, or similar).
- Understanding of job scheduling and resource management (e.g., Slurm, Kubernetes, or equivalent).
- Strong troubleshooting and performance tuning skills.
- Experience with NVIDIA software stack (CUDA, NCCL, NVLink, Triton Inference Server).
- Knowledge of distributed training for large language models.
- Familiarity with cloud-based GPU infrastructure (AWS, Azure, GCP).
- Exposure to networking and storage systems for HPC workloads.
#LI-REMOTE