BJ's Wholesale Club
SR Director of IT Operations & Service Excellence
BJ's Wholesale Club, Marlborough, Massachusetts, us, 01752
Overview
SR Director of IT Operations & Service Excellence – BJ's Wholesale Club. The role leads uptime and resiliency across BJ's digital and enterprise technology landscape, spanning applications, infrastructure, and security. Responsible for defining service reliability standards, publishing SLOs/SLIs, and building the organizational capability to deliver them with data‑driven decisions. Reports to the VP of Infrastructure & Operations. Responsibilities
Define and execute the multi‑year IT Service Excellence maturity roadmap aligned to business objectives, cloud migration plans, uptime and resiliency requirements. Craft multi‑year resiliency and cost‑optimization roadmaps aligned to company growth goals. Implement IT operations best practices and collaborate with product teams to ensure reliability and scalability from design. Partner with Enterprise Architecture to define standards for building reliable, highly available systems. Define Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all critical services. Foster a high‑trust, blameless culture emphasizing learning, experimentation, and excellence. Own IT Operations & Service Excellence budget; optimize OpEx through automation, self‑service, and vendor management. IT Operations & Incident Management (24×7 Command Center, NOC & Service Desk)
Oversee real‑time monitoring, incident triage, and major‑incident management with attention to MTTR and SLA adherence. Maintain a high‑performing L1 Service Desk; drive call deflection via knowledge, AI chatbots, and self‑service password reset. Publish operational metrics (MTTA, MTTR, FCR, abandon rate) and provide actionable insights. Lead major incident management, including escalation paths and cross‑functional coordination with timely stakeholder communication. Oversee end‑to‑end incident lifecycle from identification to post‑incident analysis and continuous improvement. Manage on‑call rotations and 24×7 coverage with major incident managers. Develop and enforce robust playbooks for MIM processes, with defined roles and triaging procedures. Matrix manage people, processes, and resources including third parties to move toward resolution. Change & Release Governance
Chair the Change Advisory Board (CAB); aim for 99%+ change success while accelerating deployment velocity. Implement risk‑based change classification; ensure end‑to‑end testing, automated pre‑deployment checks, rollback processes, and post‑implementation reviews. Service Reliability Engineering (SRE) & Observability
Develop and implement SRE policies, standards, and best practices for enterprise‑wide systems. Lead SRE squads covering AWS, data centers, network/edge, and SaaS platforms. Set error budgets, reliability targets, and chaos‑engineering practices; ensure RTO/RPO meet DR objectives. Collaborate with Service Managers for Digital, Membership, Enterprise, and Club & Fuel systems to deliver integrated SRE. Drive end‑to‑end service design including service maps and dependency graphs; enhance observability tooling. Lead the roadmap for logging, metrics, tracing, and AIOps platforms with actionable insights and predictive alerting. Engineering Excellence & Practices
Assess impact of system requirements across cloud and on‑premise technologies; promote reliability and engineering excellence. Advance problem detection and restore processes; promote automated telemetry within DevOps. Implement Site Reliability Engineering methods with self‑healing and automation to strengthen the digital infrastructure. Process Ownership & Continuous Improvement
Codify SOPs and RACI matrices across Ops, SRE, Service Desk, and engineering partners. Lead Lean/Kaizen initiatives to reduce toil and boost productivity. Track OKRs and drive data‑driven course corrections; lead RCA and systemic problem management. Compliance, Security & Risk
Partner with Cybersecurity and Compliance teams to meet PCI‑DSS, SOX, and data‑privacy obligations. Ensure internal and external audits are supported by operational controls. People Development
Demonstrate robust technical leadership in Site Reliability Engineering; foster psychological safety and continuous learning. Coach and develop managers; build and retain a cross‑functional team spanning Service Desk, Command Center, SRE, Change Governance, Problem Management, and Analytics. Required Qualifications
Bachelor’s degree in Computer Science, Engineering, or related discipline (Master’s preferred). 15+ years of progressive IT Operations leadership with 5+ years at a Director/Head level in large‑scale, distributed environments (Retail preferred). Proven track record guiding teams through outages and scalability challenges. 5+ years of oversight of 24×7 operations (NOC, Service Desk) and SRE/DevOps. Cloud‑oriented system design and architecture experience; hybrid cloud (AWS) and on‑prem data centers. ITIL v4/Service Management expertise; ITIL certification strongly desired. Experience with observability, AIOps, and automation platforms (e.g., ServiceNow, OpsRamp, SolarWinds, New Relic, PagerDuty). Excellent communication and executive presence; able to brief C‑suite on risk and performance. Preferred Qualifications
Retail industry experience managing store, fuel, and distribution center technologies. ServiceNow certifications; Lean Six Sigma or Continuous Improvement accreditation. Work Environment & Travel
Hybrid work model (Westborough, MA HQ) with periodic visits to data centers, distribution centers, and club locations. Occasional travel ( Hybrid role: in‑office Tue–Thu at Marlborough, MA; Mon & Fri remote. Compensation
Pay transparency: starting from $179,000 with consideration of location, education, experience, and qualifications. Note: This description reflects current expectations and may be adjusted at BJ's Wholesale Club's discretion. Application Details
Referrals increase your chances of interviewing at BJ's Wholesale Club by 2x.
#J-18808-Ljbffr
SR Director of IT Operations & Service Excellence – BJ's Wholesale Club. The role leads uptime and resiliency across BJ's digital and enterprise technology landscape, spanning applications, infrastructure, and security. Responsible for defining service reliability standards, publishing SLOs/SLIs, and building the organizational capability to deliver them with data‑driven decisions. Reports to the VP of Infrastructure & Operations. Responsibilities
Define and execute the multi‑year IT Service Excellence maturity roadmap aligned to business objectives, cloud migration plans, uptime and resiliency requirements. Craft multi‑year resiliency and cost‑optimization roadmaps aligned to company growth goals. Implement IT operations best practices and collaborate with product teams to ensure reliability and scalability from design. Partner with Enterprise Architecture to define standards for building reliable, highly available systems. Define Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all critical services. Foster a high‑trust, blameless culture emphasizing learning, experimentation, and excellence. Own IT Operations & Service Excellence budget; optimize OpEx through automation, self‑service, and vendor management. IT Operations & Incident Management (24×7 Command Center, NOC & Service Desk)
Oversee real‑time monitoring, incident triage, and major‑incident management with attention to MTTR and SLA adherence. Maintain a high‑performing L1 Service Desk; drive call deflection via knowledge, AI chatbots, and self‑service password reset. Publish operational metrics (MTTA, MTTR, FCR, abandon rate) and provide actionable insights. Lead major incident management, including escalation paths and cross‑functional coordination with timely stakeholder communication. Oversee end‑to‑end incident lifecycle from identification to post‑incident analysis and continuous improvement. Manage on‑call rotations and 24×7 coverage with major incident managers. Develop and enforce robust playbooks for MIM processes, with defined roles and triaging procedures. Matrix manage people, processes, and resources including third parties to move toward resolution. Change & Release Governance
Chair the Change Advisory Board (CAB); aim for 99%+ change success while accelerating deployment velocity. Implement risk‑based change classification; ensure end‑to‑end testing, automated pre‑deployment checks, rollback processes, and post‑implementation reviews. Service Reliability Engineering (SRE) & Observability
Develop and implement SRE policies, standards, and best practices for enterprise‑wide systems. Lead SRE squads covering AWS, data centers, network/edge, and SaaS platforms. Set error budgets, reliability targets, and chaos‑engineering practices; ensure RTO/RPO meet DR objectives. Collaborate with Service Managers for Digital, Membership, Enterprise, and Club & Fuel systems to deliver integrated SRE. Drive end‑to‑end service design including service maps and dependency graphs; enhance observability tooling. Lead the roadmap for logging, metrics, tracing, and AIOps platforms with actionable insights and predictive alerting. Engineering Excellence & Practices
Assess impact of system requirements across cloud and on‑premise technologies; promote reliability and engineering excellence. Advance problem detection and restore processes; promote automated telemetry within DevOps. Implement Site Reliability Engineering methods with self‑healing and automation to strengthen the digital infrastructure. Process Ownership & Continuous Improvement
Codify SOPs and RACI matrices across Ops, SRE, Service Desk, and engineering partners. Lead Lean/Kaizen initiatives to reduce toil and boost productivity. Track OKRs and drive data‑driven course corrections; lead RCA and systemic problem management. Compliance, Security & Risk
Partner with Cybersecurity and Compliance teams to meet PCI‑DSS, SOX, and data‑privacy obligations. Ensure internal and external audits are supported by operational controls. People Development
Demonstrate robust technical leadership in Site Reliability Engineering; foster psychological safety and continuous learning. Coach and develop managers; build and retain a cross‑functional team spanning Service Desk, Command Center, SRE, Change Governance, Problem Management, and Analytics. Required Qualifications
Bachelor’s degree in Computer Science, Engineering, or related discipline (Master’s preferred). 15+ years of progressive IT Operations leadership with 5+ years at a Director/Head level in large‑scale, distributed environments (Retail preferred). Proven track record guiding teams through outages and scalability challenges. 5+ years of oversight of 24×7 operations (NOC, Service Desk) and SRE/DevOps. Cloud‑oriented system design and architecture experience; hybrid cloud (AWS) and on‑prem data centers. ITIL v4/Service Management expertise; ITIL certification strongly desired. Experience with observability, AIOps, and automation platforms (e.g., ServiceNow, OpsRamp, SolarWinds, New Relic, PagerDuty). Excellent communication and executive presence; able to brief C‑suite on risk and performance. Preferred Qualifications
Retail industry experience managing store, fuel, and distribution center technologies. ServiceNow certifications; Lean Six Sigma or Continuous Improvement accreditation. Work Environment & Travel
Hybrid work model (Westborough, MA HQ) with periodic visits to data centers, distribution centers, and club locations. Occasional travel ( Hybrid role: in‑office Tue–Thu at Marlborough, MA; Mon & Fri remote. Compensation
Pay transparency: starting from $179,000 with consideration of location, education, experience, and qualifications. Note: This description reflects current expectations and may be adjusted at BJ's Wholesale Club's discretion. Application Details
Referrals increase your chances of interviewing at BJ's Wholesale Club by 2x.
#J-18808-Ljbffr