Cerebro
Supporting the USA's leading startups with world class AI & Robotics Talent | Co-Founder of Mentors in Machine Learning | Recruitment like a 5* hotel
Join the Frontier: Research Engineer, AI Benchmarking
Are you passionate about shaping how the world measures and trusts AI? We’re seeking
exceptional AI researchers and engineers
to architect the next generation of LLM benchmarks—impacting how foundation models evolve and are adopted globally.
Your work will define the standards by which LLMs are judged —from everyday applications to breakthroughs in finance, healthcare, and beyond. You’ll design, build, and analyze cutting‑edge evaluation pipelines, collaborating with leading model labs and enterprises. If you thrive at the intersection of
deep research and real‑world impact , this is your stage.
What You’ll Do
Invent and build new benchmarks
that test the boundaries of LLMs in real‑world scenarios
Conduct rigorous research
to ensure benchmarks are robust, valid, and actionable
Collaborate with AI labs and enterprise partners
to identify emerging evaluation needs
Analyze and interpret model performance , communicating insights to diverse audiences
Publish and present research findings
in top venues, contributing to the evaluation community
Work closely with infra engineers
to scale your benchmark designs
Stay ahead of the curve
on LLM capabilities and evaluation methodologies
Your Background
Advanced research experience : MS/PhD in CS, NLP, ML, or related field (exceptional undergrads considered)
Publication record : Papers at NeurIPS, ICML, ACL, EMNLP, etc.—especially on NLP, ML evaluation, or benchmarking
Python proficiency
for prototyping and experimentation
Excellent communicator , able to synthesize complex ideas for all audiences
Collaborative spirit : Experience working in research teams, open to feedback
Portfolio : Evidence of impactful research
Location:
In‑person in San Francisco.
Relocation/transportation support provided.
Bonus Points
Experience with LLM evaluation, benchmarking, or foundation models
Collaboration with industry or applied research partners
Background in HCI, psychology, or domain‑specific evaluation
Startup or early‑stage lab experience
Contributions to open‑source evaluation tools/datasets
What’s in It for You?
Competitive salary & meaningful equity
Relocation and transit support
Unlimited PTO
Opportunities to publish, present, and shape the field
Who We Are Our founding team brings together leading experience from top research institutions and industry giants. The platform’s core is rooted in
advanced NLP evaluation research
and is backed by premier investors. Our collective work is highly cited, and we’re committed to setting the gold standard for AI benchmarking.
Tech stack:
React (TSX) frontend, Django backend, AWS infra.
What Matters Most
Raw intelligence and research ability
trump pedigree. We care about what you can build and discover.
Ownership:
We move fast and expect initiative. You’ll have autonomy and a chance to make a visible impact.
Intensity:
The LLM landscape evolves at breakneck speed. We need researchers who thrive in a dynamic, high‑execution environment.
Solution focus:
Every evaluation challenge is an opportunity to innovate.
Seniority level Mid‑Senior level
Employment type Full‑time
Job function Information Technology
Industries Technology, Information & Media and Research Services
#J-18808-Ljbffr
Are you passionate about shaping how the world measures and trusts AI? We’re seeking
exceptional AI researchers and engineers
to architect the next generation of LLM benchmarks—impacting how foundation models evolve and are adopted globally.
Your work will define the standards by which LLMs are judged —from everyday applications to breakthroughs in finance, healthcare, and beyond. You’ll design, build, and analyze cutting‑edge evaluation pipelines, collaborating with leading model labs and enterprises. If you thrive at the intersection of
deep research and real‑world impact , this is your stage.
What You’ll Do
Invent and build new benchmarks
that test the boundaries of LLMs in real‑world scenarios
Conduct rigorous research
to ensure benchmarks are robust, valid, and actionable
Collaborate with AI labs and enterprise partners
to identify emerging evaluation needs
Analyze and interpret model performance , communicating insights to diverse audiences
Publish and present research findings
in top venues, contributing to the evaluation community
Work closely with infra engineers
to scale your benchmark designs
Stay ahead of the curve
on LLM capabilities and evaluation methodologies
Your Background
Advanced research experience : MS/PhD in CS, NLP, ML, or related field (exceptional undergrads considered)
Publication record : Papers at NeurIPS, ICML, ACL, EMNLP, etc.—especially on NLP, ML evaluation, or benchmarking
Python proficiency
for prototyping and experimentation
Excellent communicator , able to synthesize complex ideas for all audiences
Collaborative spirit : Experience working in research teams, open to feedback
Portfolio : Evidence of impactful research
Location:
In‑person in San Francisco.
Relocation/transportation support provided.
Bonus Points
Experience with LLM evaluation, benchmarking, or foundation models
Collaboration with industry or applied research partners
Background in HCI, psychology, or domain‑specific evaluation
Startup or early‑stage lab experience
Contributions to open‑source evaluation tools/datasets
What’s in It for You?
Competitive salary & meaningful equity
Relocation and transit support
Unlimited PTO
Opportunities to publish, present, and shape the field
Who We Are Our founding team brings together leading experience from top research institutions and industry giants. The platform’s core is rooted in
advanced NLP evaluation research
and is backed by premier investors. Our collective work is highly cited, and we’re committed to setting the gold standard for AI benchmarking.
Tech stack:
React (TSX) frontend, Django backend, AWS infra.
What Matters Most
Raw intelligence and research ability
trump pedigree. We care about what you can build and discover.
Ownership:
We move fast and expect initiative. You’ll have autonomy and a chance to make a visible impact.
Intensity:
The LLM landscape evolves at breakneck speed. We need researchers who thrive in a dynamic, high‑execution environment.
Solution focus:
Every evaluation challenge is an opportunity to innovate.
Seniority level Mid‑Senior level
Employment type Full‑time
Job function Information Technology
Industries Technology, Information & Media and Research Services
#J-18808-Ljbffr