Toshiba Global Commerce Solutions
Director of Production Engineering (Reliability Platform Engineering)
Toshiba Global Commerce Solutions, Durham, North Carolina, United States, 27703
Director of Production Engineering
Join to apply for the
Director of Production Engineering
role at
Toshiba Global Commerce Solutions . Toshiba Global Commerce Solutions is seeking a Director of Production Engineering to lead the reliability backbone of our global POS, cloud, and middleware platform. This strategic role owns system availability, resilience, performance, observability, and release reliability across a distributed, mission‑critical commerce ecosystem.
This leader will unify Site Reliability Engineering (SRE), Resilience & Performance Engineering, Observability, and AI‑driven Reliability Automation into one cohesive function. As AI accelerates development velocity, verification and reliability become the core bottlenecks—making this role a cornerstone of our engineering organization.
You will partner closely with Architecture, Cloud Operations, Functional Quality Engineering, and Software Development to ensure predictable reliability, smooth releases, and dramatically fewer Sev‑1/Sev‑2 incidents.
Responsibilities System Reliability & Uptime
Define and enforce SLO/SLA frameworks, error budgets, and release criteria
Lead availability, resilience, and performance strategy across all services.
Own MTTR, MTBF, incident prevention, and rollback strategies at scale.
Unified Reliability Engineering Organization
Lead teams across SRE & L3 Engineering, Resilience & Performance, Engineering, Observability & Telemetry, AI Reliability Automation.
Build a culture focused on prevention over firefighting.
Architecture‑Level Reliability
Collaborate with Principal Engineers and Architects to define system guardrails, resilience patterns, and failure modes.
Ensure high‑quality Production Readiness Reviews (PRRs) and architectural consistency.
Resilience & Performance Engineering
Own chaos, failover, load, stress, and soak testing strategies.
Validate store‑mode behavior, payment workflows, edge‑device dependencies, and multi‑service interactions.
Observability & Telemetry
Ensure complete, accurate signal for logs, traces, metrics, and business health.
Partner with AI systems to build intelligent anomaly detection pipelines.
AI‑Driven Release Reliability
Integrate AI‑based reliability scoring, resiliency prediction, automated gating, regression analysis, and incident pattern detection.
Define the path toward autonomous release reliability pipelines.
Cross‑Org Leadership
Partner with Software Development, Functional Quality Engineering, Cloud Operations, Architecture, and TPM/TPO teams.
Drive multi‑team initiatives and ensure readiness across complex release trains.
Required Experience
Bachelor’s Degree in Computer Science, Engineering or 10‑15 years direct experience.
10–15+ years in SRE, Reliability Engineering, Production Engineering, Distributed Systems, and Performance/Resilience Engineering.
Proven ownership of uptime and system reliability in complex distributed architectures.
Expertise in distributed systems, cloud platforms (AKS, Kubernetes), observability stacks (OpenTelemetry, Grafana, App Insights, Datadog), performance tuning, fault tolerance, network fundamentals, DB/service scaling, chaos testing.
Architectural Leadership: Experience designing resilience patterns (timeouts, retries, hedging, circuit breakers). Strong partnership with architects and senior engineers.
Operational Maturity: Led SRE/on‑call organizations. Defined SLOs, SLIs, and error budgets at scale. Track record of driving incident prevention culture.
Leadership & Communication: Builds strong engineering teams and hires top talent.
Influential communicator with executives and cross‑functional teams. Highly collaborative and low‑ego.
Preferred Requirements
AI‑driven anomaly detection, regression analysis, incident clustering, reliability scoring.
Experience with retail POS, payments, edge devices, or store environments. Hybrid cloud + edge architectures.
Leading reliability transformations and scaling engineering organizations (200→500+).
Why This Role Matters
Uptime becomes engineered, not reactive.
Development and QA operate at AI‑enabled speed.
Our platform grows safely while delivering stability and performance.
We match or surpass best‑in‑class tech organizations (Google, Amazon, Azure, Stripe).
You will build the production engineering foundation that powers our next decade of innovation.
Benefits
Group health coverage (medical, dental, & vision)
Employee Assistance Programs
Pre‑tax spending accounts
401(k) plan (with company match)
Company provided life insurance
Pet Insurance
Employee discounts
Generous paid holiday schedule, paid vacation & sick/personal days
Eeo Toshiba Global Commerce Solutions is an equal opportunity/affirmative action employer that evaluates qualified applicants without regard to age, ancestry, color, religious creed, disability, marital status, medical condition, genetic information, military or veteran status, national origin, race, sex, gender, gender identity, gender expression and sexual orientation or any other protected factor. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.
Individuals who need a reasonable accommodation because of a disability for any part of the employment process should email benefits@toshibagcs.com to request an accommodation.
Diversity, Equity & Inclusion We at Toshiba Global Commerce Solutions firmly believe that our people are an integral part to the success of our customers. We are committed to Diversity, Equity, and Inclusion for all our people as highlighted by our 5 Core Principles (Create Outreach, Foster Belonging, Unleash Opportunity, Diverse Cultural Engagement, and Culture of Transparency). We are passionate about our customers in the retail industry and becoming a more responsible company as we help create a brighter future.
About Toshiba Global Commerce Solutions Toshiba Global Commerce Solutions is a dynamic billion‑dollar global company based in Research Triangle Park, NC, providing retail store solutions to your favorite brands. We power self‑checkout at Lowe’s Foods, earned fuel rewards at Kroger, and facilitate payments at retailers such as Walmart, Michaels, Carrefour, The Gap, Calvin Klein, Boots, Cencosud, BJ’s, and Costco—our installed market share leader.
The nature of retail is changing quickly, so if you share our “Together Commerce” vision of a seamless two‑way, participatory shopping experience, let’s get together to drive the new economy.
#J-18808-Ljbffr
Director of Production Engineering
role at
Toshiba Global Commerce Solutions . Toshiba Global Commerce Solutions is seeking a Director of Production Engineering to lead the reliability backbone of our global POS, cloud, and middleware platform. This strategic role owns system availability, resilience, performance, observability, and release reliability across a distributed, mission‑critical commerce ecosystem.
This leader will unify Site Reliability Engineering (SRE), Resilience & Performance Engineering, Observability, and AI‑driven Reliability Automation into one cohesive function. As AI accelerates development velocity, verification and reliability become the core bottlenecks—making this role a cornerstone of our engineering organization.
You will partner closely with Architecture, Cloud Operations, Functional Quality Engineering, and Software Development to ensure predictable reliability, smooth releases, and dramatically fewer Sev‑1/Sev‑2 incidents.
Responsibilities System Reliability & Uptime
Define and enforce SLO/SLA frameworks, error budgets, and release criteria
Lead availability, resilience, and performance strategy across all services.
Own MTTR, MTBF, incident prevention, and rollback strategies at scale.
Unified Reliability Engineering Organization
Lead teams across SRE & L3 Engineering, Resilience & Performance, Engineering, Observability & Telemetry, AI Reliability Automation.
Build a culture focused on prevention over firefighting.
Architecture‑Level Reliability
Collaborate with Principal Engineers and Architects to define system guardrails, resilience patterns, and failure modes.
Ensure high‑quality Production Readiness Reviews (PRRs) and architectural consistency.
Resilience & Performance Engineering
Own chaos, failover, load, stress, and soak testing strategies.
Validate store‑mode behavior, payment workflows, edge‑device dependencies, and multi‑service interactions.
Observability & Telemetry
Ensure complete, accurate signal for logs, traces, metrics, and business health.
Partner with AI systems to build intelligent anomaly detection pipelines.
AI‑Driven Release Reliability
Integrate AI‑based reliability scoring, resiliency prediction, automated gating, regression analysis, and incident pattern detection.
Define the path toward autonomous release reliability pipelines.
Cross‑Org Leadership
Partner with Software Development, Functional Quality Engineering, Cloud Operations, Architecture, and TPM/TPO teams.
Drive multi‑team initiatives and ensure readiness across complex release trains.
Required Experience
Bachelor’s Degree in Computer Science, Engineering or 10‑15 years direct experience.
10–15+ years in SRE, Reliability Engineering, Production Engineering, Distributed Systems, and Performance/Resilience Engineering.
Proven ownership of uptime and system reliability in complex distributed architectures.
Expertise in distributed systems, cloud platforms (AKS, Kubernetes), observability stacks (OpenTelemetry, Grafana, App Insights, Datadog), performance tuning, fault tolerance, network fundamentals, DB/service scaling, chaos testing.
Architectural Leadership: Experience designing resilience patterns (timeouts, retries, hedging, circuit breakers). Strong partnership with architects and senior engineers.
Operational Maturity: Led SRE/on‑call organizations. Defined SLOs, SLIs, and error budgets at scale. Track record of driving incident prevention culture.
Leadership & Communication: Builds strong engineering teams and hires top talent.
Influential communicator with executives and cross‑functional teams. Highly collaborative and low‑ego.
Preferred Requirements
AI‑driven anomaly detection, regression analysis, incident clustering, reliability scoring.
Experience with retail POS, payments, edge devices, or store environments. Hybrid cloud + edge architectures.
Leading reliability transformations and scaling engineering organizations (200→500+).
Why This Role Matters
Uptime becomes engineered, not reactive.
Development and QA operate at AI‑enabled speed.
Our platform grows safely while delivering stability and performance.
We match or surpass best‑in‑class tech organizations (Google, Amazon, Azure, Stripe).
You will build the production engineering foundation that powers our next decade of innovation.
Benefits
Group health coverage (medical, dental, & vision)
Employee Assistance Programs
Pre‑tax spending accounts
401(k) plan (with company match)
Company provided life insurance
Pet Insurance
Employee discounts
Generous paid holiday schedule, paid vacation & sick/personal days
Eeo Toshiba Global Commerce Solutions is an equal opportunity/affirmative action employer that evaluates qualified applicants without regard to age, ancestry, color, religious creed, disability, marital status, medical condition, genetic information, military or veteran status, national origin, race, sex, gender, gender identity, gender expression and sexual orientation or any other protected factor. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.
Individuals who need a reasonable accommodation because of a disability for any part of the employment process should email benefits@toshibagcs.com to request an accommodation.
Diversity, Equity & Inclusion We at Toshiba Global Commerce Solutions firmly believe that our people are an integral part to the success of our customers. We are committed to Diversity, Equity, and Inclusion for all our people as highlighted by our 5 Core Principles (Create Outreach, Foster Belonging, Unleash Opportunity, Diverse Cultural Engagement, and Culture of Transparency). We are passionate about our customers in the retail industry and becoming a more responsible company as we help create a brighter future.
About Toshiba Global Commerce Solutions Toshiba Global Commerce Solutions is a dynamic billion‑dollar global company based in Research Triangle Park, NC, providing retail store solutions to your favorite brands. We power self‑checkout at Lowe’s Foods, earned fuel rewards at Kroger, and facilitate payments at retailers such as Walmart, Michaels, Carrefour, The Gap, Calvin Klein, Boots, Cencosud, BJ’s, and Costco—our installed market share leader.
The nature of retail is changing quickly, so if you share our “Together Commerce” vision of a seamless two‑way, participatory shopping experience, let’s get together to drive the new economy.
#J-18808-Ljbffr