Logo
Walmart

Distinguished, Data Scientist - Quality & LLM Judging Systems in Conversational

Walmart, Sunnyvale, California, United States, 94087

Save Job

Position Summary Walmart’s Next Gen Commerce team is shaping the future of conversational shopping by building intelligent agents that not only respond, but reason, recommend, and proactively assist customers. As a Distinguished Data Scientist for Quality & LLM Judging Systems in Conversational Commerce, you will serve as the key IC partner to the Director of Data Science for this space. You will lead the technical vision and model development for cutting-edge evaluation methodologies to measure and improve the quality of AI-powered conversations and tool outputs.

You’ll help define how we evaluate our agents and their dependent tools using a combination of human-labeled benchmarks, LLM-as-a-judge systems, and scalable automated pipelines. You'll design prompts, validate agreement with human judgment, and develop LLM distillation strategies to replicate high-quality judgment cost-effectively.

This is a high-impact, hands-on technical role requiring deep expertise in LLM prompting, evaluation frameworks, and structured experimentation. You will work closely with modeling, product, and platform teams to ensure that measurement drives improvement, and that the agent’s behaviors align with quality, safety, and relevance at every step.

Responsibilities

Design evaluation pipelines for conversational agents and their tool outputs using LLM-as-a-judge, human annotation, and hybrid methods

Develop high-quality prompts for structured evaluation tasks and iterate based on inter-rater reliability with human judges

Develop novel techniques to assess non-textual or subjective outputs—such as recommendations, summaries, and agent-driven actions—where standard metrics fall short

Guide the modeling team to distill or fine-tune smaller LLMs to act as scalable evaluation proxies

Work with engineering partners to integrate evaluation hooks into model training, validation, and production workflows

Conduct in-depth failure mode analysis and define actionable quality signals that inform model and production iteration

Uphold statistical rigor in metric design, validation, and experimental analysis to ensure reliable and interpretable results

Foster a culture of principled measurement and trustworthy AI throughout the organization

Minimum Qualifications

7+ years of experience in data science or machine learning, preferably in evaluation, NLP, or conversational AI

Hands-on experience with large language models, including prompt engineering, response grading, and structured generation tasks

Familiarity with both human annotation workflows and automated evaluation strategies using LLMs

Deep understanding of metric design, evaluation reliability, and statistical validity

Strong software engineering fundamentals and ability to own end-to-end pipelines

Excellent communication skills and the ability to influence without authority across functions

Preferred Qualifications

Graduate degree (M.S./Ph.D.) in Computer Science, Machine Learning, NLP, or a related field

Experience with conversational AI, summarization, retrieval-augmented generation, or recommendation evaluation

Knowledge of model distillation, LoRA, instruction tuning, or parameter-efficient adaptation techniques

Familiarity with evaluating open-ended outputs where ground truth is subjective or contextual

Publications, patents, or open-source contributions in LLM evaluation or applied AI

Why Join Us? This is a rare opportunity to shape the science behind how intelligent agents are judged—literally. Your work will directly define what “quality” means in conversational commerce and enable AI systems that are not only functional but truly helpful, engaging, and aligned with human expectations.

At Walmart, we offer competitive pay as well as performance-based bonuses and other benefits for a healthier mind, body, and wallet. Health benefits include medical, vision, and dental coverage. Financial benefits include 401(k), stock purchase, and company-paid life insurance. Paid time off includes PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Additional benefits include short-term and long-term disability, company discounts, and other programs.

Compensation and Location Primary Location: 1375 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America

Salary ranges: Sunnyvale, CA: $169,000.00 – $338,000.00; Bentonville, AR: $130,000.00 – $260,000.00. Additional compensation includes annual or quarterly performance bonuses and, for certain roles, stock options.

#J-18808-Ljbffr