Storm3

Research Scientist - Data

Storm3, San Francisco, California, United States, 94199

Overview

AI for Science - Connecting Tech talent into innovative Healthtechs and Biotechs Come join a revolutionary AI research lab in SF Bay Area that is poised to develop & publish high-impact breakthroughs in GenAI - across LLMs and Multimodal AI. As part of the team, you’ll work at the intersection of data, large-scale training, and foundation model innovation. You will collaborate with world-class researchers, data scientists, and engineers to solve critical challenges in creating robust, scalable, and reasoning-capable LLMs. Your research will shape the way data is curated, processed, and leveraged to train the next generation of intelligent systems. Base pay range

$200,000.00/yr - $350,000.00/yr Direct message the job poster from Storm3 Responsibilities

Lead research on

data-centric approaches for LLMs , including pretraining corpus design, data valuation, and speculative decoding strategies. Develop pipelines to

process challenging data sources

into structured and reproducible training datasets. Build and optimize

agentic data pipelines , integrating retrieval, self-curation, and multi-agent feedback for high-quality training and evaluation data. Collaborate with researchers on

alignment and reasoning-focused training

that leverage data-driven approaches for improving LLM capabilities. Prototype and deploy

evaluation frameworks

to measure data quality, coverage, and downstream impact on LLM reasoning. Publish findings at top-tier venues (e.g., NeurIPS, ICLR, ACL, EMNLP) and represent the institute at international conferences. Contribute to open-source tools, datasets, and benchmarks that advance the global foundation model research community. Requirements

Master’s degree in Computer Science, Data Science, or a related technical field (PhD strongly preferred) Experience collecting and curating high-quality text data including multi-lingual data. Hands-on experience with

large-scale dataset curation and preprocessing

for ML/LLM training. Prior works synthesizing complex datasets. Code, math, and

agentic data

are higher priority Experience with ML infrastructure for

scalable training, evaluation, and debugging . Experience at the intersection of data and post-training (RL/SFT) Proven ability to independently drive

research questions related to data quality, scaling, or reasoning . Preferred Experience

Experience with

retrieval-augmented generation (RAG) , agentic data pipelines, or reasoning benchmarks. Contributions to

speculative decoding, self-curation, or reinforcement learning from synthetic data . Background in

knowledge graphs, semantic search, or indexing systems . Strong publication record in leading AI conferences. Prior contributions to

open-source ML data tools or benchmarks . Prior work on speculative decoding/contributions to LLM serving engines Prior work on training LLM-as-a-judge Deep expertise with tokenization/training tokenizers Why apply

Opportunity to build out a new division at the forefront of AI innovation FAANG competitive salary & package Work alongside superstars from FAANG labs & leading AI companies Medical, Dental and Vision Insurance Interested in applying? Please click on the ‘Easy Apply’ button or alternatively email me your resume at

stefani.lukic@storm3.com Seniority level

Mid-Senior level Employment type

Full-time Job function

Information Technology, Research, and Engineering Industries: Research Services and Software Development San Jose, CA $121,700.00-$228,600.00 1 hour ago San Francisco, CA $47,840.00-$95,680.00 3 days ago San Francisco, CA $100,000.00-$150,000.00 3 days ago South San Francisco, CA $45.00-$55.00 6 days ago San Francisco, CA $100,000.00-$150,000.00 1 day ago San Mateo, CA $195,407.00-$248,900.00 3 days ago San Bruno, CA $76,000.00-$108,000.00 5 days ago

#J-18808-Ljbffr