RecruitSeq
This range is provided by RecruitSeq. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.
Base pay range $240,000.00/yr - $300,000.00/yr
Founding ML Research Scientist
Seattle, WA (On-Site)
Our client is a stealth-stage AI startup developing a real-time, human foundation model that brings social and emotional intelligence to voice, face, and body for next-generation interactive experiences. The team is well-funded, early-stage, and focused on building core foundational models rather than application-layer features.
This role leads the end-to-end training of large multimodal, autoregressive models that jointly reason over text, speech, facial expression, and body language in real time. You will own research, data strategy, and large-scale training pipelines to power lifelike interactive avatars that respond with nuanced expressions, gestures, and tone frame by frame.
Responsibilities
Design and train large multimodal autoregressive models across text, audio, and video (face and body) for real-time interaction.
Develop model architectures, objectives, and optimization strategies to capture fine-grained human signals (e.g., prosody, micro-expressions, body pose dynamics).
Build scalable training, evaluation, and deployment pipelines for low-latency inference in production environments.
Define data collection, curation, and labeling strategies for multimodal human interaction datasets, including safety and privacy guardrails.
Establish rigorous offline and online evaluation frameworks for social/emotional intelligence, realism, and responsiveness.
Collaborate closely with founding researchers and leadership to translate research breakthroughs into product-ready capabilities.
Mentor junior researchers/engineers and help set technical standards, coding practices, and research culture in a small, high-ownership team.
Qualifications
3+ years of experience training large-scale multimodal or language models, autoregressive architectures, or closely related foundation model work (industry or post-PhD).
Strong background in deep learning for one or more of: speech, audio, computer vision (face/body), or sequence modeling.
Hands-on experience implementing and training transformer-based or similar architectures with modern ML frameworks (e.g., PyTorch, JAX, or TensorFlow).
Proven track record of end-to-end model development: problem formulation, experimentation, training, evaluation, and iteration.
Solid software engineering skills, including writing production-quality Python and working with large-scale training infrastructure on GPUs/TPUs.
Comfort working 5 days per week onsite in the Seattle area in a fast-paced, highly collaborative environment.
Preferred Skills
PhD in Computer Science, Electrical Engineering, Robotics, or related field with research in multimodal learning, generative models, or human–computer interaction.
Experience building or training MLLMs, conversational agents, or avatar/embodiment systems that combine vision, audio, and language.
Publications at top-tier ML or vision/graphics conferences (e.g., NeurIPS, ICML, ICLR, CVPR, ICCV, SIGGRAPH).
Prior experience in early-stage startups or small, fast-moving research teams with high ownership.
Familiarity with real-time inference optimization (quantization, distillation, on-device deployment) and streaming architectures.
Seniority level Mid-Senior level
Employment type Full-time
Job function Research and Engineering
Industries
Staffing and Recruiting
Software Development
#J-18808-Ljbffr
Base pay range $240,000.00/yr - $300,000.00/yr
Founding ML Research Scientist
Seattle, WA (On-Site)
Our client is a stealth-stage AI startup developing a real-time, human foundation model that brings social and emotional intelligence to voice, face, and body for next-generation interactive experiences. The team is well-funded, early-stage, and focused on building core foundational models rather than application-layer features.
This role leads the end-to-end training of large multimodal, autoregressive models that jointly reason over text, speech, facial expression, and body language in real time. You will own research, data strategy, and large-scale training pipelines to power lifelike interactive avatars that respond with nuanced expressions, gestures, and tone frame by frame.
Responsibilities
Design and train large multimodal autoregressive models across text, audio, and video (face and body) for real-time interaction.
Develop model architectures, objectives, and optimization strategies to capture fine-grained human signals (e.g., prosody, micro-expressions, body pose dynamics).
Build scalable training, evaluation, and deployment pipelines for low-latency inference in production environments.
Define data collection, curation, and labeling strategies for multimodal human interaction datasets, including safety and privacy guardrails.
Establish rigorous offline and online evaluation frameworks for social/emotional intelligence, realism, and responsiveness.
Collaborate closely with founding researchers and leadership to translate research breakthroughs into product-ready capabilities.
Mentor junior researchers/engineers and help set technical standards, coding practices, and research culture in a small, high-ownership team.
Qualifications
3+ years of experience training large-scale multimodal or language models, autoregressive architectures, or closely related foundation model work (industry or post-PhD).
Strong background in deep learning for one or more of: speech, audio, computer vision (face/body), or sequence modeling.
Hands-on experience implementing and training transformer-based or similar architectures with modern ML frameworks (e.g., PyTorch, JAX, or TensorFlow).
Proven track record of end-to-end model development: problem formulation, experimentation, training, evaluation, and iteration.
Solid software engineering skills, including writing production-quality Python and working with large-scale training infrastructure on GPUs/TPUs.
Comfort working 5 days per week onsite in the Seattle area in a fast-paced, highly collaborative environment.
Preferred Skills
PhD in Computer Science, Electrical Engineering, Robotics, or related field with research in multimodal learning, generative models, or human–computer interaction.
Experience building or training MLLMs, conversational agents, or avatar/embodiment systems that combine vision, audio, and language.
Publications at top-tier ML or vision/graphics conferences (e.g., NeurIPS, ICML, ICLR, CVPR, ICCV, SIGGRAPH).
Prior experience in early-stage startups or small, fast-moving research teams with high ownership.
Familiarity with real-time inference optimization (quantization, distillation, on-device deployment) and streaming architectures.
Seniority level Mid-Senior level
Employment type Full-time
Job function Research and Engineering
Industries
Staffing and Recruiting
Software Development
#J-18808-Ljbffr