Walmart

Director, Data Science Quality & LLM Judging Sys...

Walmart, Sunnyvale, California, United States, 94086

Director - Data Science

Quality & LLM Judging Systems for Conversational Commerce

Walmart's Next Gen Commerce team is building intelligent, agentic systems that transform how customers shop through conversation. As Director, Data Science

Quality & LLM Judging Systems for Conversational Commerce, you will lead a critical pillar under the Senior Director of Data Science - Agentic AI for Conversational Commerce. Your mission is to define how we measure the effectiveness of the conversational shopping agent and the tools it invokes, ensuring we evaluate quality with both rigor and scale. You will lead a team responsible for defining evaluation metrics, designing measurement methodologies, and executing cost-efficient evaluations. This includes combining traditional human-labeled approaches with advanced "LLM-as-a-judge" techniques. You will design prompt-based evaluation tasks, identify when human oversight is needed, and explore how to distill smaller LLMs to replicate human-like evaluation at scale. Beyond conversation quality, your scope includes evaluating the outputs of tools invoked by the agent, such as personalized recommendations, summary generation, or proactive suggestions, where traditional metric-based evaluations fall short and human judgment is required. This is a hands-on leadership role requiring sharp judgment, strong experimental thinking, and fluency in both LLM prompting and applied ML. You will work closely with modeling, product, and platform teams to ensure that measurement drives improvement, and that the agents behaviors align with quality, safety, and relevance at every step. Responsibilities

Grow and lead a high-performing team of data scientists, fostering a culture of technical excellence, fast execution, and clear accountability

Define evaluation strategy and success metrics for the conversational shopping agent and its tool outputs

Develop scalable measurement methodologies combining human-labeled benchmarks, LLM-as-a-judge prompts, and automated pipelines

Design and iterate on prompts that enable LLMs to perform structured evaluation tasks with high agreement to human judgment

Explore cost-effective alternatives by generating synthetic training data and distilling smaller LLMs to perform specific judging tasks

Establish quality review loops and integrate feedback from evaluations into model and product development

Partner with engineering, and product teams to ensure metrics are well-instrumented and align with long-term objectives

Drive tooling and process development to support reliable, reproducible, and efficient evaluation at scale

Minimum Qualifications

8+ years of experience in data science or applied machine learning

5+ years leading teams focused on model evaluation, experimentation, or NLP applications

Deep experience with large language models, including prompt engineering, structured evaluation, and response grading

Familiarity with both human annotation workflows and LLM-based evaluators

Strong understanding of metric design, statistical evaluation methods, and A/B testing

Ability to translate ambiguous quality goals into concrete, testable evaluation frameworks

Excellent communication and cross-functional collaboration skills

Preferred Qualifications

Advanced degree in Computer Science, Machine Learning, or related field

Experience with conversational AI, tool-augmented agents, or retrieval-augmented generation

Knowledge of efficient LLM adaptation techniques such as distillation, LoRA, or instruction tuning

Familiarity with evaluating outputs where objective ground truth is undefined (e.g., personalization, summarization, recommendation)

Track record of influencing product quality through principled evaluation and measurement