Walmart
Distinguished, Data Scientist - Quality & LLM Judging Systems in Conversational
Walmart, Sunnyvale, California, United States, 94087
Position Summary
Walmart’s Next Gen Commerce team is shaping the future of conversational shopping by building intelligent agents that not only respond, but reason, recommend, and proactively assist customers. As a Distinguished Data Scientist for Quality & LLM Judging Systems in Conversational Commerce, you will serve as the key IC partner to the Director of Data Science for this space. You will lead the technical vision and model development for cutting-edge evaluation methodologies to measure and improve the quality of AI-powered conversations and tool outputs.
You’ll help define how we evaluate our agents and their dependent tools using a combination of human-labeled benchmarks, LLM-as-a-judge systems, and scalable automated pipelines. You'll design prompts, validate agreement with human judgment, and develop LLM distillation strategies to replicate high-quality judgment cost-effectively.
This is a high-impact, hands-on technical role requiring deep expertise in LLM prompting, evaluation frameworks, and structured experimentation. You will work closely with modeling, product, and platform teams to ensure that measurement drives improvement, and that the agent’s behaviors align with quality, safety, and relevance at every step.
Responsibilities
Design evaluation pipelines for conversational agents and their tool outputs using LLM-as-a-judge, human annotation, and hybrid methods
Develop high-quality prompts for structured evaluation tasks and iterate based on inter-rater reliability with human judges
Develop novel techniques to assess non-textual or subjective outputs—such as recommendations, summaries, and agent-driven actions—where standard metrics fall short
Guide the modeling team to distill or fine-tune smaller LLMs to act as scalable evaluation proxies
Work with engineering partners to integrate evaluation hooks into model training, validation, and production workflows
Conduct in-depth failure mode analysis and define actionable quality signals that inform model and production iteration
Uphold statistical rigor in metric design, validation, and experimental analysis to ensure reliable and interpretable results
Foster a culture of principled measurement and trustworthy AI throughout the organization
Minimum Qualifications
7+ years of experience in data science or machine learning, preferably in evaluation, NLP, or conversational AI
Hands-on experience with large language models, including prompt engineering, response grading, and structured generation tasks
Familiarity with both human annotation workflows and automated evaluation strategies using LLMs
Deep understanding of metric design, evaluation reliability, and statistical validity
Strong software engineering fundamentals and ability to own end-to-end pipelines
Excellent communication skills and the ability to influence without authority across functions
Preferred Qualifications
Graduate degree (M.S./Ph.D.) in Computer Science, Machine Learning, NLP, or a related field
Experience with conversational AI, summarization, retrieval-augmented generation, or recommendation evaluation
Knowledge of model distillation, LoRA, instruction tuning, or parameter-efficient adaptation techniques
Familiarity with evaluating open-ended outputs where ground truth is subjective or contextual
Publications, patents, or open-source contributions in LLM evaluation or applied AI
Why Join Us? This is a rare opportunity to shape the science behind how intelligent agents are judged—literally. Your work will directly define what “quality” means in conversational commerce and enable AI systems that are not only functional but truly helpful, engaging, and aligned with human expectations.
At Walmart, we offer competitive pay as well as performance-based bonuses and other benefits for a healthier mind, body, and wallet. Health benefits include medical, vision, and dental coverage. Financial benefits include 401(k), stock purchase, and company-paid life insurance. Paid time off includes PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Additional benefits include short-term and long-term disability, company discounts, and other programs.
Compensation and Location Primary Location: 1375 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America
Salary ranges: Sunnyvale, CA: $169,000.00 – $338,000.00; Bentonville, AR: $130,000.00 – $260,000.00. Additional compensation includes annual or quarterly performance bonuses and, for certain roles, stock options.
#J-18808-Ljbffr
You’ll help define how we evaluate our agents and their dependent tools using a combination of human-labeled benchmarks, LLM-as-a-judge systems, and scalable automated pipelines. You'll design prompts, validate agreement with human judgment, and develop LLM distillation strategies to replicate high-quality judgment cost-effectively.
This is a high-impact, hands-on technical role requiring deep expertise in LLM prompting, evaluation frameworks, and structured experimentation. You will work closely with modeling, product, and platform teams to ensure that measurement drives improvement, and that the agent’s behaviors align with quality, safety, and relevance at every step.
Responsibilities
Design evaluation pipelines for conversational agents and their tool outputs using LLM-as-a-judge, human annotation, and hybrid methods
Develop high-quality prompts for structured evaluation tasks and iterate based on inter-rater reliability with human judges
Develop novel techniques to assess non-textual or subjective outputs—such as recommendations, summaries, and agent-driven actions—where standard metrics fall short
Guide the modeling team to distill or fine-tune smaller LLMs to act as scalable evaluation proxies
Work with engineering partners to integrate evaluation hooks into model training, validation, and production workflows
Conduct in-depth failure mode analysis and define actionable quality signals that inform model and production iteration
Uphold statistical rigor in metric design, validation, and experimental analysis to ensure reliable and interpretable results
Foster a culture of principled measurement and trustworthy AI throughout the organization
Minimum Qualifications
7+ years of experience in data science or machine learning, preferably in evaluation, NLP, or conversational AI
Hands-on experience with large language models, including prompt engineering, response grading, and structured generation tasks
Familiarity with both human annotation workflows and automated evaluation strategies using LLMs
Deep understanding of metric design, evaluation reliability, and statistical validity
Strong software engineering fundamentals and ability to own end-to-end pipelines
Excellent communication skills and the ability to influence without authority across functions
Preferred Qualifications
Graduate degree (M.S./Ph.D.) in Computer Science, Machine Learning, NLP, or a related field
Experience with conversational AI, summarization, retrieval-augmented generation, or recommendation evaluation
Knowledge of model distillation, LoRA, instruction tuning, or parameter-efficient adaptation techniques
Familiarity with evaluating open-ended outputs where ground truth is subjective or contextual
Publications, patents, or open-source contributions in LLM evaluation or applied AI
Why Join Us? This is a rare opportunity to shape the science behind how intelligent agents are judged—literally. Your work will directly define what “quality” means in conversational commerce and enable AI systems that are not only functional but truly helpful, engaging, and aligned with human expectations.
At Walmart, we offer competitive pay as well as performance-based bonuses and other benefits for a healthier mind, body, and wallet. Health benefits include medical, vision, and dental coverage. Financial benefits include 401(k), stock purchase, and company-paid life insurance. Paid time off includes PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Additional benefits include short-term and long-term disability, company discounts, and other programs.
Compensation and Location Primary Location: 1375 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America
Salary ranges: Sunnyvale, CA: $169,000.00 – $338,000.00; Bentonville, AR: $130,000.00 – $260,000.00. Additional compensation includes annual or quarterly performance bonuses and, for certain roles, stock options.
#J-18808-Ljbffr