Black Forest Labs
Member of Technical Staff - Training Cluster Engineer
Black Forest Labs, San Francisco, California, United States, 94199
Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our large GPU training clusters.
Role & Responsibilities
Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration
Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows
Partner with cloud and colocation providers to ensure cluster availability and performance
Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute)
Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity
Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning
Required Experience
Production experience managing SLURM clusters at scale, including job scheduling policies, resource allocation, and federation
Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments
Proven track record managing GPU clusters, including driver management and DCGM monitoring
Preferred Qualifications
Understanding of distributed training patterns, checkpointing strategies, and data pipeline optimization
Experience with Kubernetes for containerized workloads, particularly for inference or mixed compute environments
Experience with high-performance interconnects (InfiniBand, RoCE) and NCCL optimization for multi-node training
Track record of managing 1000+ GPU training runs, with deep understanding of failure modes and recovery patterns
Familiarity with high-performance storage solutions (VAST, blob storage) and their performance characteristics for ML workloads
Experience running hybrid training/inference infrastructure with appropriate resource isolation
Strong scripting skills (Python, Bash) and infrastructure-as-code experience
#J-18808-Ljbffr
Role & Responsibilities
Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration
Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows
Partner with cloud and colocation providers to ensure cluster availability and performance
Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute)
Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity
Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning
Required Experience
Production experience managing SLURM clusters at scale, including job scheduling policies, resource allocation, and federation
Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments
Proven track record managing GPU clusters, including driver management and DCGM monitoring
Preferred Qualifications
Understanding of distributed training patterns, checkpointing strategies, and data pipeline optimization
Experience with Kubernetes for containerized workloads, particularly for inference or mixed compute environments
Experience with high-performance interconnects (InfiniBand, RoCE) and NCCL optimization for multi-node training
Track record of managing 1000+ GPU training runs, with deep understanding of failure modes and recovery patterns
Familiarity with high-performance storage solutions (VAST, blob storage) and their performance characteristics for ML workloads
Experience running hybrid training/inference infrastructure with appropriate resource isolation
Strong scripting skills (Python, Bash) and infrastructure-as-code experience
#J-18808-Ljbffr