Logo
Genesis AI

Staff Software Engineer, Training (Bay Area / Paris / Remote)

Genesis AI, San Carlos, California, United States, 94071

Save Job

What You’ll Do

Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack stack, from data pipelines to GPU kernels

Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization

Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks

Optimize workloads for hardware efficiency: CPU/GPU compute balance, memory management, data throughput, and networking

Develop monitoring and debugging tools for large-scale runs, enabling rapid diagnosis of performance regressions and failures

What You’ll Bring

Deep experience in distributed systems, ML infrastructure, or high-performance computing (8+ years)

Production-grade expertise in Python

Low-level performance mastery: CUDA/cuDNN/Triton, CPU–GPU interactions, data movement, and kernel optimization

Scaling at the frontier: experience with PyTorch and training jobs using data, context, pipeline, and model parallelism

System-level mindset with a track record of tuning hardware–software interactions for maximum utilization

#J-18808-Ljbffr