Storm3
Overview
AI for Science - Connecting Tech talent into innovative Healthtechs and Biotechs Come join a revolutionary AI research lab in SF Bay Area that is poised to develop & publish high-impact breakthroughs in GenAI - across LLMs and Multimodal AI. As part of the team, you’ll work at the intersection of data, large-scale training, and foundation model innovation. You will collaborate with world-class researchers, data scientists, and engineers to solve critical challenges in creating robust, scalable, and reasoning-capable LLMs. Your research will shape the way data is curated, processed, and leveraged to train the next generation of intelligent systems. Base pay range
$200,000.00/yr - $350,000.00/yr Direct message the job poster from Storm3 Responsibilities
Lead research on
data-centric approaches for LLMs , including pretraining corpus design, data valuation, and speculative decoding strategies. Develop pipelines to
process challenging data sources
into structured and reproducible training datasets. Build and optimize
agentic data pipelines , integrating retrieval, self-curation, and multi-agent feedback for high-quality training and evaluation data. Collaborate with researchers on
alignment and reasoning-focused training
that leverage data-driven approaches for improving LLM capabilities. Prototype and deploy
evaluation frameworks
to measure data quality, coverage, and downstream impact on LLM reasoning. Publish findings at top-tier venues (e.g., NeurIPS, ICLR, ACL, EMNLP) and represent the institute at international conferences. Contribute to open-source tools, datasets, and benchmarks that advance the global foundation model research community. Requirements
Master’s degree in Computer Science, Data Science, or a related technical field (PhD strongly preferred) Experience collecting and curating high-quality text data including multi-lingual data. Hands-on experience with
large-scale dataset curation and preprocessing
for ML/LLM training. Prior works synthesizing complex datasets. Code, math, and
agentic data
are higher priority Experience with ML infrastructure for
scalable training, evaluation, and debugging . Experience at the intersection of data and post-training (RL/SFT) Proven ability to independently drive
research questions related to data quality, scaling, or reasoning . Preferred Experience
Experience with
retrieval-augmented generation (RAG) , agentic data pipelines, or reasoning benchmarks. Contributions to
speculative decoding, self-curation, or reinforcement learning from synthetic data . Background in
knowledge graphs, semantic search, or indexing systems . Strong publication record in leading AI conferences. Prior contributions to
open-source ML data tools or benchmarks . Prior work on speculative decoding/contributions to LLM serving engines Prior work on training LLM-as-a-judge Deep expertise with tokenization/training tokenizers Why apply
Opportunity to build out a new division at the forefront of AI innovation FAANG competitive salary & package Work alongside superstars from FAANG labs & leading AI companies Medical, Dental and Vision Insurance Interested in applying? Please click on the ‘Easy Apply’ button or alternatively email me your resume at
stefani.lukic@storm3.com Seniority level
Mid-Senior level Employment type
Full-time Job function
Information Technology, Research, and Engineering Industries: Research Services and Software Development San Jose, CA $121,700.00-$228,600.00 1 hour ago San Francisco, CA $47,840.00-$95,680.00 3 days ago San Francisco, CA $100,000.00-$150,000.00 3 days ago South San Francisco, CA $45.00-$55.00 6 days ago San Francisco, CA $100,000.00-$150,000.00 1 day ago San Mateo, CA $195,407.00-$248,900.00 3 days ago San Bruno, CA $76,000.00-$108,000.00 5 days ago
#J-18808-Ljbffr
AI for Science - Connecting Tech talent into innovative Healthtechs and Biotechs Come join a revolutionary AI research lab in SF Bay Area that is poised to develop & publish high-impact breakthroughs in GenAI - across LLMs and Multimodal AI. As part of the team, you’ll work at the intersection of data, large-scale training, and foundation model innovation. You will collaborate with world-class researchers, data scientists, and engineers to solve critical challenges in creating robust, scalable, and reasoning-capable LLMs. Your research will shape the way data is curated, processed, and leveraged to train the next generation of intelligent systems. Base pay range
$200,000.00/yr - $350,000.00/yr Direct message the job poster from Storm3 Responsibilities
Lead research on
data-centric approaches for LLMs , including pretraining corpus design, data valuation, and speculative decoding strategies. Develop pipelines to
process challenging data sources
into structured and reproducible training datasets. Build and optimize
agentic data pipelines , integrating retrieval, self-curation, and multi-agent feedback for high-quality training and evaluation data. Collaborate with researchers on
alignment and reasoning-focused training
that leverage data-driven approaches for improving LLM capabilities. Prototype and deploy
evaluation frameworks
to measure data quality, coverage, and downstream impact on LLM reasoning. Publish findings at top-tier venues (e.g., NeurIPS, ICLR, ACL, EMNLP) and represent the institute at international conferences. Contribute to open-source tools, datasets, and benchmarks that advance the global foundation model research community. Requirements
Master’s degree in Computer Science, Data Science, or a related technical field (PhD strongly preferred) Experience collecting and curating high-quality text data including multi-lingual data. Hands-on experience with
large-scale dataset curation and preprocessing
for ML/LLM training. Prior works synthesizing complex datasets. Code, math, and
agentic data
are higher priority Experience with ML infrastructure for
scalable training, evaluation, and debugging . Experience at the intersection of data and post-training (RL/SFT) Proven ability to independently drive
research questions related to data quality, scaling, or reasoning . Preferred Experience
Experience with
retrieval-augmented generation (RAG) , agentic data pipelines, or reasoning benchmarks. Contributions to
speculative decoding, self-curation, or reinforcement learning from synthetic data . Background in
knowledge graphs, semantic search, or indexing systems . Strong publication record in leading AI conferences. Prior contributions to
open-source ML data tools or benchmarks . Prior work on speculative decoding/contributions to LLM serving engines Prior work on training LLM-as-a-judge Deep expertise with tokenization/training tokenizers Why apply
Opportunity to build out a new division at the forefront of AI innovation FAANG competitive salary & package Work alongside superstars from FAANG labs & leading AI companies Medical, Dental and Vision Insurance Interested in applying? Please click on the ‘Easy Apply’ button or alternatively email me your resume at
stefani.lukic@storm3.com Seniority level
Mid-Senior level Employment type
Full-time Job function
Information Technology, Research, and Engineering Industries: Research Services and Software Development San Jose, CA $121,700.00-$228,600.00 1 hour ago San Francisco, CA $47,840.00-$95,680.00 3 days ago San Francisco, CA $100,000.00-$150,000.00 3 days ago South San Francisco, CA $45.00-$55.00 6 days ago San Francisco, CA $100,000.00-$150,000.00 1 day ago San Mateo, CA $195,407.00-$248,900.00 3 days ago San Bruno, CA $76,000.00-$108,000.00 5 days ago
#J-18808-Ljbffr