Storm3
Overview AI for Science - Connecting Tech talent into innovative Healthtechs and Biotechs
Hit Apply below to send your application for consideration Ensure that your CV is up to date, and that you have read the job specs first. Come join a revolutionary AI research lab in SF Bay Area that is poised to develop & publish high-impact breakthroughs in GenAI - across LLMs and Multimodal AI. As part of the team, you’ll work at the intersection of data, large-scale training, and foundation model innovation. You will collaborate with world-class researchers, data scientists, and engineers to solve critical challenges in creating robust, scalable, and reasoning-capable LLMs. Your research will shape the way data is curated, processed, and leveraged to train the next generation of intelligent systems. Base pay range $200,000.00/yr - $350,000.00/yr Direct message the job poster from Storm3 Responsibilities
Lead research on
data-centric approaches for LLMs , including pretraining corpus design, data valuation, and speculative decoding strategies. Develop pipelines to
process challenging data sources
into structured and reproducible training datasets. Build and optimize
agentic data pipelines , integrating retrieval, self-curation, and multi-agent feedback for high-quality training and evaluation data. Collaborate with researchers on
alignment and reasoning-focused training
that leverage data-driven approaches for improving LLM capabilities. Prototype and deploy
evaluation frameworks
to measure data quality, coverage, and downstream impact on LLM reasoning. Publish findings at top-tier venues (e.g., NeurIPS, ICLR, ACL, EMNLP) and represent the institute at international conferences. Contribute to open-source tools, datasets, and benchmarks that advance the global foundation model research community. Requirements
Master’s degree in Computer Science, Data Science, or a related technical field (PhD strongly preferred) Experience collecting and curating high-quality text data including multi-lingual data. Hands-on experience with
large-scale dataset curation and preprocessing
for ML/LLM training. Prior works synthesizing complex datasets. Code, math, and
agentic data
are higher priority Experience with ML infrastructure for
scalable training, evaluation, and debugging . Experience at the intersection of data and post-training (RL/SFT) Proven ability to independently drive
research questions related to data quality, scaling, or reasoning . Preferred Experience
Experience with
retrieval-augmented generation (RAG) , agentic data pipelines, or reasoning benchmarks. Contributions to
speculative decoding, self-curation, or reinforcement learning from synthetic data . Background in
knowledge graphs, semantic search, or indexing systems . Strong publication record in leading AI conferences. Prior contributions to
open-source ML data tools or benchmarks . Prior work on speculative decoding/contributions to LLM serving engines Prior work on training LLM-as-a-judge Deep expertise with tokenization/training tokenizers Why apply
Opportunity to build out a new division at the forefront of AI innovation FAANG competitive salary & package Work alongside superstars from FAANG labs & leading AI companies Medical, Dental and Vision Insurance Interested in applying? Please click on the ‘Easy Apply’ button or alternatively email me your resume at
stefani.lukic@storm3.com Seniority level
Mid-Senior level Employment type
Full-time Job function
Information Technology, Research, and Engineering Industries: Research Services and Software Development San Jose, CA $121,700.00-$228,600.00 1 hour ago San Francisco, CA $47,840.00-$95,680.00 3 days ago San Francisco, CA $100,000.00-$150,000.00 3 days ago South San Francisco, CA $45.00-$55.00 6 days ago San Francisco, CA $100,000.00-$150,000.00 1 day ago San Mateo, CA $195,407.00-$248,900.00 3 days ago San Bruno, CA $76,000.00-$108,000.00 5 days ago
#J-18808-Ljbffr
Hit Apply below to send your application for consideration Ensure that your CV is up to date, and that you have read the job specs first. Come join a revolutionary AI research lab in SF Bay Area that is poised to develop & publish high-impact breakthroughs in GenAI - across LLMs and Multimodal AI. As part of the team, you’ll work at the intersection of data, large-scale training, and foundation model innovation. You will collaborate with world-class researchers, data scientists, and engineers to solve critical challenges in creating robust, scalable, and reasoning-capable LLMs. Your research will shape the way data is curated, processed, and leveraged to train the next generation of intelligent systems. Base pay range $200,000.00/yr - $350,000.00/yr Direct message the job poster from Storm3 Responsibilities
Lead research on
data-centric approaches for LLMs , including pretraining corpus design, data valuation, and speculative decoding strategies. Develop pipelines to
process challenging data sources
into structured and reproducible training datasets. Build and optimize
agentic data pipelines , integrating retrieval, self-curation, and multi-agent feedback for high-quality training and evaluation data. Collaborate with researchers on
alignment and reasoning-focused training
that leverage data-driven approaches for improving LLM capabilities. Prototype and deploy
evaluation frameworks
to measure data quality, coverage, and downstream impact on LLM reasoning. Publish findings at top-tier venues (e.g., NeurIPS, ICLR, ACL, EMNLP) and represent the institute at international conferences. Contribute to open-source tools, datasets, and benchmarks that advance the global foundation model research community. Requirements
Master’s degree in Computer Science, Data Science, or a related technical field (PhD strongly preferred) Experience collecting and curating high-quality text data including multi-lingual data. Hands-on experience with
large-scale dataset curation and preprocessing
for ML/LLM training. Prior works synthesizing complex datasets. Code, math, and
agentic data
are higher priority Experience with ML infrastructure for
scalable training, evaluation, and debugging . Experience at the intersection of data and post-training (RL/SFT) Proven ability to independently drive
research questions related to data quality, scaling, or reasoning . Preferred Experience
Experience with
retrieval-augmented generation (RAG) , agentic data pipelines, or reasoning benchmarks. Contributions to
speculative decoding, self-curation, or reinforcement learning from synthetic data . Background in
knowledge graphs, semantic search, or indexing systems . Strong publication record in leading AI conferences. Prior contributions to
open-source ML data tools or benchmarks . Prior work on speculative decoding/contributions to LLM serving engines Prior work on training LLM-as-a-judge Deep expertise with tokenization/training tokenizers Why apply
Opportunity to build out a new division at the forefront of AI innovation FAANG competitive salary & package Work alongside superstars from FAANG labs & leading AI companies Medical, Dental and Vision Insurance Interested in applying? Please click on the ‘Easy Apply’ button or alternatively email me your resume at
stefani.lukic@storm3.com Seniority level
Mid-Senior level Employment type
Full-time Job function
Information Technology, Research, and Engineering Industries: Research Services and Software Development San Jose, CA $121,700.00-$228,600.00 1 hour ago San Francisco, CA $47,840.00-$95,680.00 3 days ago San Francisco, CA $100,000.00-$150,000.00 3 days ago South San Francisco, CA $45.00-$55.00 6 days ago San Francisco, CA $100,000.00-$150,000.00 1 day ago San Mateo, CA $195,407.00-$248,900.00 3 days ago San Bruno, CA $76,000.00-$108,000.00 5 days ago
#J-18808-Ljbffr