Black Forest Labs
Member of Technical Staff - Large Model Data
Black Forest Labs, San Francisco, California, United States, 94199
At Black Forest Labs, we’re on a mission to advance the state of the art in generative deep learning for media, building powerful, creative, and open models that push what’s possible.
Born from foundational research, we continuously create advanced infrastructure to transform ideas into images and videos.
Our team pioneered Latent Diffusion, Stable Diffusion, and FLUX.1 – milestones in the evolution of generative AI. Today, these foundations power millions of creations worldwide, from individual artists to enterprise applications.
We are looking for a Data Engineer to help create large-scale datasets that power the next generation of generative models.
Role and Responsibilities:
Develop and maintain scalable infrastructure for large-scale image and video data acquisition
Manage and coordinate data transfers from various licensing partners
Implement and deploy state‑of‑the‑art ML models for data cleaning, processing, and preparation
Implement scalable and efficient tools to visualize, cluster, and deeply understand the data
Optimize and parallelize data processing workflows to handle billion‑scale datasets efficiently
Ensure data quality, diversity, and proper annotation (including captioning) for training readiness
Getting training data from alternative sources such as user preferences into trainable format
Work closely in the model development loop to update data as necessitated by the training trajectory
What we look for:
Proficiency in Python and various file systems for data intensive manipulation and analysis
Familiarity with cloud computing platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing
Experience with image and video processing libraries (e.g., OpenCV, FFmpeg)
Demonstrated ability to optimize and parallelize data processing workflows across CPUs and GPUs
Familiarity with data annotation and captioning processes for ML training datasets
Knowledge of machine learning techniques for data cleaning and preprocessing
Nice to have:
Background or keen interest in developing large‑scale data acquisition systems
Experience with natural language processing for image/video captioning
Experience with data deduplication techniques at scale
Experience with big data processing frameworks (e.g., Apache Spark, Hadoop)
Experience shipping a SOTA model
Understanding of ethical considerations in data collection and usage
#J-18808-Ljbffr
Born from foundational research, we continuously create advanced infrastructure to transform ideas into images and videos.
Our team pioneered Latent Diffusion, Stable Diffusion, and FLUX.1 – milestones in the evolution of generative AI. Today, these foundations power millions of creations worldwide, from individual artists to enterprise applications.
We are looking for a Data Engineer to help create large-scale datasets that power the next generation of generative models.
Role and Responsibilities:
Develop and maintain scalable infrastructure for large-scale image and video data acquisition
Manage and coordinate data transfers from various licensing partners
Implement and deploy state‑of‑the‑art ML models for data cleaning, processing, and preparation
Implement scalable and efficient tools to visualize, cluster, and deeply understand the data
Optimize and parallelize data processing workflows to handle billion‑scale datasets efficiently
Ensure data quality, diversity, and proper annotation (including captioning) for training readiness
Getting training data from alternative sources such as user preferences into trainable format
Work closely in the model development loop to update data as necessitated by the training trajectory
What we look for:
Proficiency in Python and various file systems for data intensive manipulation and analysis
Familiarity with cloud computing platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing
Experience with image and video processing libraries (e.g., OpenCV, FFmpeg)
Demonstrated ability to optimize and parallelize data processing workflows across CPUs and GPUs
Familiarity with data annotation and captioning processes for ML training datasets
Knowledge of machine learning techniques for data cleaning and preprocessing
Nice to have:
Background or keen interest in developing large‑scale data acquisition systems
Experience with natural language processing for image/video captioning
Experience with data deduplication techniques at scale
Experience with big data processing frameworks (e.g., Apache Spark, Hadoop)
Experience shipping a SOTA model
Understanding of ethical considerations in data collection and usage
#J-18808-Ljbffr