Logo
Storm3

Research Scientist - Vision Data Infrastructure

Storm3, San Francisco, California, United States, 94199

Save Job

Research Scientist - Vision Data Infrastructure This role is offered by Storm3. Your actual pay will be based on your skills and experience — talk with your recruiter for details.

Base pay range

$250,000 – $600,000 per year.

⚡ Research Scientists/Engineers (all levels)

Focus on Vision Data Infrastructure

Come join one of the only research institutions globally providing the resources to compete with top AI companies — tens of thousands of GPUs to explore state‑of‑the‑art research in LLMs, multimodal and agentic AI.

We are seeking AI talent with expertise in building scalable pipelines for vision data to support image/video generative training and multimodal alignment. You’ll design high-performance pipelines for large-scale image and video datasets, enabling efficient pretraining, alignment, and simulation-based data generation.

Responsibilities:

Vision Data Sourcing & Curation

Collect and organize image and video data from open datasets and the web.

Handle data cleaning, filtering, deduplication, and metadata generation.

Ensure ethical and compliant data collection at scale.

Processing & Augmentation

Build high-throughput pipelines for vision data preprocessing (frame extraction, resolution normalization, format conversion, latent caching).

Implement GPU-accelerated augmentation and distributed data loading (WebDataset, TFRecords, Parquet).

Synthetic & Simulation‑Based Data Generation

Use simulation tools (Unreal Engine 5, Isaac Sim, Unity) to generate high-quality synthetic vision data.

Create specialized datasets for VLM training, visual reasoning, and agent interaction.

Requirements:

Strong experience with data engineering, computer vision, or machine learning infrastructure.

Expertise in building and scaling ETL/data pipelines for large unstructured datasets.

Proficiency with Python, PyTorch, and distributed data frameworks (Ray, Spark, Dask).

Experience with WebDataset, TFRecords, Parquet, or similar high-throughput data formats.

Familiarity with GPU‑accelerated preprocessing, NVIDIA DALI, or equivalent systems.

Understanding of image/video codecs, data compression, and cloud storage optimization.

Preferred Experience:

Prior work with simulation-based or synthetic data generation using Unreal Engine, Isaac Sim, or Unity.

Experience curating datasets for multimodal or vision‑language model training.

Knowledge of data ethics, privacy, and compliance frameworks for large-scale AI datasets.

Experience contributing to open datasets or data-centric AI research.

Why apply:

Opportunity to join a fast-growing core team that is already pushing AI breakthroughs.

Highly competitive salary package.

Work alongside ambitious and bright superstars from tech and academia.

Medical, Dental and Vision Insurance.

Interested? Please contact

stefani.lukic@storm3.com .

Seniority level:

Mid‑Senior level

Employment type:

Full‑time

Job function:

Research and Engineering

Industries:

Software Development and Research Services

#J-18808-Ljbffr