Logo
Walmart

Distinguished, Architect - AI/ML

Walmart, Sunnyvale, California, United States, 94087

Save Job

Position Summary Walmart Global Tech’s Site Reliability Engineering organization is seeking a Distinguished AI/ML Engineer to architect revolutionary agentic AI systems that autonomously monitor, predict, and resolve issues across the world’s largest retailer’s technology ecosystem. The role impacts millions of customers and associates globally by transforming traditional SRE practices into self‑healing, intelligent platforms.

What You'll Do

Architect and develop advanced agentic AI systems that can autonomously handle complex reliability engineering workflows, predictive failure analysis, and self‑optimization across all Walmart technology systems.

Design and implement multi‑agent orchestration platforms coordinating AI agents for automated incident response, capacity planning, and performance optimization across e‑commerce, supply chain, and in‑store systems.

Build intelligent observability and monitoring systems using ML‑driven anomaly detection, predictive analytics, and autonomous incident resolution capabilities spanning the entire Walmart technology ecosystem.

Develop self‑healing infrastructure platforms that leverage AI to predict, prevent, and automatically resolve system issues before they impact customers, associates, or business operations across any Walmart system.

Design, write, and build advanced tools to improve reliability, latency, availability, and scalability of all Walmart Tech systems, including engineering reliability and availability, scaling solutions, automation, and enhanced instrumentation.

Architect and implement fault‑tolerant systems and services across Walmart’s hybrid cloud infrastructure with a focus on autonomous recovery and intelligent failure prediction for e‑commerce, supply chain, financial services, and in‑store technology.

Collaborate with engineering teams and leadership across all Walmart technology organizations to establish technical strategies that improve MTTR and MTDE through intelligent automation and predictive capabilities.

Partner with E‑commerce, Supply Chain, Store Technology, Fintech, and Data Platform teams to deliver autonomous reliability solutions via advanced ML, NLP, and CV technologies.

Drive the development of MLOps and AIOps platforms enabling continuous learning, model deployment, monitoring, and autonomous optimization of reliability engineering systems.

Implement advanced CI/CD pipelines for reliability systems, including automated deployment, validation, and rollback mechanisms with built‑in observability and performance monitoring.

Establish platform engineering excellence by building reusable SRE infrastructure, intelligent monitoring platforms, and developer productivity tools serving all Walmart engineering teams.

Provide technical mentorship and guidance on advanced SRE concepts, AI/ML for reliability, platform engineering best practices, and autonomous system design.

What You'll Bring Advanced AI/ML & Agentic Systems Expertise

12+ years of expert‑level experience with machine learning algorithms, deep learning frameworks (TensorFlow, PyTorch), and production ML deployment at enterprise scale.

Hands‑on experience building agentic AI systems, multi‑agent frameworks, LLM‑based agents, and autonomous decision‑making platforms.

Proven ability to architect and implement AI‑driven solutions for complex technical challenges.

Enterprise‑Scale Site Reliability Engineering Mastery

Comprehensive SRE expertise including Incident, Problem & Change Management, Performance Engineering, and capacity planning for mission‑critical systems.

Deep understanding of reliability KPIs (MTTL, MTTR, availability) with a track record of improving system reliability at scale.

Experience with chaos engineering, fault injection, and building self‑healing systems across diverse technology stacks.

Cloud‑Native Platform Engineering at Scale

Expert‑level cloud engineering experience (Azure, GCP, AWS) with containerization (Kubernetes, Docker) and serverless architectures.

Strong platform engineering skills: Infrastructure as Code (Terraform, CloudFormation), service mesh, developer productivity tools.

Experience designing and implementing self‑service ML deployment platforms and API gateways for enterprise environments.

Advanced Observability & Monitoring Excellence

Expertise with distributed tracing (OpenTelemetry, Jaeger), metrics collection (Prometheus, Grafana), and log aggregation (ELK, Splunk).

Hands‑on experience building AI‑driven anomaly detection, predictive monitoring systems, and ML‑specific dashboards.

Proven ability to implement comprehensive observability solutions for complex AI/ML pipelines and distributed systems.

About Walmart Global Tech Imagine working in an environment where one line of code can make life easier for hundreds of millions of people. Walmart Global Tech is a team of software engineers, data scientists, cybersecurity experts, and service professionals driving the next retail disruption.

Benefits Beyond competitive compensation, we offer 401(k) match, stock purchase plan, paid maternity and parental leave, PTO, multiple health plans, and more. Additional perks include incentive awards for performance, education benefits, and company‑paid life insurance.

Equal Opportunity Employer Walmart, Inc. is an Equal Opportunity Employer – By Choice. We believe we are best equipped to help our associates, customers, and the communities we serve live better when we really know them. That means understanding, respecting, and valuing unique styles, experiences, identities, ideas, and opinions – while being inclusive of all people.

Minimum Qualifications

Bachelor’s degree in computer science, computer engineering, software engineering, or related field plus 6 years’ experience in software engineering or architecture, or

8 years’ experience in software engineering or architecture.

Preferred Qualifications

Master’s degree in computer science, computer engineering, software engineering, or related field and 4 years’ experience in software engineering or architecture.

Knowledge of accessibility best practices, WCAG 2.2 AA standards, assistive technologies, and digital accessibility integration.

Primary Location 1345 Crossman Ave, Sunnyvale, CA 94089‑1114, United States

#J-18808-Ljbffr