Black Forest Labs
Member of Technical Staff - Pretraining / Inference Optimization
Black Forest Labs, San Francisco, California, United States, 94199
Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently seeking a strong researcher / engineer to work closely with our research team on pretraining and inference optimization.
Role:
Finding ideal training strategies (parallelism, precision trade-offs) for a variety of model sizes and compute loads
Profiling, debugging, and optimizing single and multi-GPU operations using tools such as Nsight or stack trace viewers
Reasoning about the speed and quality trade-offs of quantization for model inference
Developing and improving low-level kernel optimizations for state-of-the-art inference and training
Innovating new ideas that bring us closer to the limits of a GPU
Ideal Experiences:
Being familiar with the latest and the most effective techniques in optimizing inference and training workloads
Optimizing for both memory-bound and compute-bound operations
Understanding GPU memory hierarchy and computation capabilities
Deep understanding of efficient attention algorithms
Implementing both forward and backward Triton kernels and ensuring their correctness while considering floating point errors
Using, for example, pybind to integrate custom-written kernels into a PyTorch framework
Nice to have:
Experience with Diffusion and Autoregressive models
Experience in low-level CUDA kernel optimizations
#J-18808-Ljbffr
Role:
Finding ideal training strategies (parallelism, precision trade-offs) for a variety of model sizes and compute loads
Profiling, debugging, and optimizing single and multi-GPU operations using tools such as Nsight or stack trace viewers
Reasoning about the speed and quality trade-offs of quantization for model inference
Developing and improving low-level kernel optimizations for state-of-the-art inference and training
Innovating new ideas that bring us closer to the limits of a GPU
Ideal Experiences:
Being familiar with the latest and the most effective techniques in optimizing inference and training workloads
Optimizing for both memory-bound and compute-bound operations
Understanding GPU memory hierarchy and computation capabilities
Deep understanding of efficient attention algorithms
Implementing both forward and backward Triton kernels and ensuring their correctness while considering floating point errors
Using, for example, pybind to integrate custom-written kernels into a PyTorch framework
Nice to have:
Experience with Diffusion and Autoregressive models
Experience in low-level CUDA kernel optimizations
#J-18808-Ljbffr