Logo
Cerebro

Member of Technical Staff

Cerebro, San Francisco, California, United States, 94199

Save Job

Supporting the USA's leading startups with world class AI & Robotics Talent | Co-Founder of Mentors in Machine Learning | Recruitment like a 5* hotel Join the Frontier: Research Engineer, AI Benchmarking

Are you passionate about shaping how the world measures and trusts AI? We’re seeking

exceptional AI researchers and engineers

to architect the next generation of LLM benchmarks—impacting how foundation models evolve and are adopted globally.

Your work will define the standards by which LLMs are judged —from everyday applications to breakthroughs in finance, healthcare, and beyond. You’ll design, build, and analyze cutting‑edge evaluation pipelines, collaborating with leading model labs and enterprises. If you thrive at the intersection of

deep research and real‑world impact , this is your stage.

What You’ll Do

Invent and build new benchmarks

that test the boundaries of LLMs in real‑world scenarios

Conduct rigorous research

to ensure benchmarks are robust, valid, and actionable

Collaborate with AI labs and enterprise partners

to identify emerging evaluation needs

Analyze and interpret model performance , communicating insights to diverse audiences

Publish and present research findings

in top venues, contributing to the evaluation community

Work closely with infra engineers

to scale your benchmark designs

Stay ahead of the curve

on LLM capabilities and evaluation methodologies

Your Background

Advanced research experience : MS/PhD in CS, NLP, ML, or related field (exceptional undergrads considered)

Publication record : Papers at NeurIPS, ICML, ACL, EMNLP, etc.—especially on NLP, ML evaluation, or benchmarking

Python proficiency

for prototyping and experimentation

Excellent communicator , able to synthesize complex ideas for all audiences

Collaborative spirit : Experience working in research teams, open to feedback

Portfolio : Evidence of impactful research

Location:

In‑person in San Francisco.

Relocation/transportation support provided.

Bonus Points

Experience with LLM evaluation, benchmarking, or foundation models

Collaboration with industry or applied research partners

Background in HCI, psychology, or domain‑specific evaluation

Startup or early‑stage lab experience

Contributions to open‑source evaluation tools/datasets

What’s in It for You?

Competitive salary & meaningful equity

Relocation and transit support

Unlimited PTO

Opportunities to publish, present, and shape the field

Who We Are Our founding team brings together leading experience from top research institutions and industry giants. The platform’s core is rooted in

advanced NLP evaluation research

and is backed by premier investors. Our collective work is highly cited, and we’re committed to setting the gold standard for AI benchmarking.

Tech stack:

React (TSX) frontend, Django backend, AWS infra.

What Matters Most

Raw intelligence and research ability

trump pedigree. We care about what you can build and discover.

Ownership:

We move fast and expect initiative. You’ll have autonomy and a chance to make a visible impact.

Intensity:

The LLM landscape evolves at breakneck speed. We need researchers who thrive in a dynamic, high‑execution environment.

Solution focus:

Every evaluation challenge is an opportunity to innovate.

Seniority level Mid‑Senior level

Employment type Full‑time

Job function Information Technology

Industries Technology, Information & Media and Research Services

#J-18808-Ljbffr