Midjourney
Overview
We’re the data team behind Midjourney's image generation models. We handle the dataset side: processing, filtering, scoring, captioning, and all the distributed compute that makes high-quality training data possible. Responsibilities
Large-scale dataset processing and filtering pipelines Training classifiers for content moderation and quality assessment Models for data quality and aesthetic evaluation Data visualization tools for experimenting on dataset samples Testing/simulating distributed inference pipelines Monitoring dashboards for data quality and pipeline health Performance optimization and infrastructure scaling Occasionally jumping into inference optimization and other cross-team projects Our current stack
PySpark, Slurm, distributed batch processing across hybrid cloud setup. We're pragmatic about tools - if there's something better, we'll switch. Who we're looking for
Data engineering/ML pipelines at scale, or Cloud/infrastructure with distributed systems experience Don't need exact tech matches - comfort with adjacent technologies and willingness to learn matters more. We work with our own hardware plus GCP and other providers, so adaptability across different environments is valuable. Location
Location: SF office a few times per week (we may make exceptions on location for truly exceptional candidates) About the role
The role offers variety, our team members often get pulled into different projects across the company, from dataset work to inference optimization. If you're interested in the intersection of large-scale data processing and cutting-edge generative AI, we'd love to hear from you. Seniority level
Mid-Senior level Employment type
Full-time Job function
Information Technology Industries
Research Services
#J-18808-Ljbffr
We’re the data team behind Midjourney's image generation models. We handle the dataset side: processing, filtering, scoring, captioning, and all the distributed compute that makes high-quality training data possible. Responsibilities
Large-scale dataset processing and filtering pipelines Training classifiers for content moderation and quality assessment Models for data quality and aesthetic evaluation Data visualization tools for experimenting on dataset samples Testing/simulating distributed inference pipelines Monitoring dashboards for data quality and pipeline health Performance optimization and infrastructure scaling Occasionally jumping into inference optimization and other cross-team projects Our current stack
PySpark, Slurm, distributed batch processing across hybrid cloud setup. We're pragmatic about tools - if there's something better, we'll switch. Who we're looking for
Data engineering/ML pipelines at scale, or Cloud/infrastructure with distributed systems experience Don't need exact tech matches - comfort with adjacent technologies and willingness to learn matters more. We work with our own hardware plus GCP and other providers, so adaptability across different environments is valuable. Location
Location: SF office a few times per week (we may make exceptions on location for truly exceptional candidates) About the role
The role offers variety, our team members often get pulled into different projects across the company, from dataset work to inference optimization. If you're interested in the intersection of large-scale data processing and cutting-edge generative AI, we'd love to hear from you. Seniority level
Mid-Senior level Employment type
Full-time Job function
Information Technology Industries
Research Services
#J-18808-Ljbffr