Bayone

Lead Site Reliability Engineer

Bayone, San Ramon

Job Description:

As a Senior/Lead Site Reliability Engineer, you'll take ownership of the reliability, performance, and scalability of high-traffic retail platforms.
This role demands deep experience in cloud-native environments, a strong observability mindset (with New Relic as a must), and the ability to lead both incident response and system design discussions with client teams.
You'll serve as a technical leader and mentor, partnering with engineering, DevOps, and product teams to build resilient systems for real-time retail operations-including eCommerce platforms like Shopify (bonus).

Key Responsibilities:

Lead reliability and observability strategy for large-scale retail systems.
Architect and implement robust monitoring using New Relic-dashboards, SLOs, alerts, synthetic monitoring, etc.
Guide incident response processes and run blameless postmortems.
Own availability, performance, and scalability of customer-facing apps and services.
Design infrastructure for high availability using Kubernetes, Docker, and IAC tools (Terraform, CloudFormation).
Collaborate with client engineering teams to optimize system behavior during retail surges (e.g., Black Friday).
Mentor junior SREs and set operational best practices.
Partner with dev and QA to integrate performance testing and failure injection into CI/CD workflows.
Advocate for DevOps/SRE best practices (shift-left monitoring, chaos testing, performance budgets).

Required Qualifications:

8+ years in Site Reliability Engineering, DevOps, or Platform Engineering.
Expertise with New Relic-must be able to architect observability end-to-end.
Proven experience supporting retail or eCommerce platforms at scale.
Strong coding/scripting (Python, Bash, or Go).
Production experience with AWS/GCP/Azure and Kubernetes.
Deep understanding of infrastructure automation (Terraform, Ansible, or Pulumi).
Strong communication skills, client-facing presence, and leadership ability.

Nice to Have:

Experience with Shopify or headless commerce stacks.
Experience leading distributed teams.
Familiarity with traffic-heavy retail events and strategies (caching, autoscaling, edge optimization).
Experience integrating monitoring into microservices, APIs, and frontend apps