Bayone
Job Description:
Nice to Have:
- As a Senior/Lead Site Reliability Engineer, you'll take ownership of the reliability, performance, and scalability of high-traffic retail platforms.
- This role demands deep experience in cloud-native environments, a strong observability mindset (with New Relic as a must), and the ability to lead both incident response and system design discussions with client teams.
- You'll serve as a technical leader and mentor, partnering with engineering, DevOps, and product teams to build resilient systems for real-time retail operations-including eCommerce platforms like Shopify (bonus).
- Lead reliability and observability strategy for large-scale retail systems.
- Architect and implement robust monitoring using New Relic-dashboards, SLOs, alerts, synthetic monitoring, etc.
- Guide incident response processes and run blameless postmortems.
- Own availability, performance, and scalability of customer-facing apps and services.
- Design infrastructure for high availability using Kubernetes, Docker, and IAC tools (Terraform, CloudFormation).
- Collaborate with client engineering teams to optimize system behavior during retail surges (e.g., Black Friday).
- Mentor junior SREs and set operational best practices.
- Partner with dev and QA to integrate performance testing and failure injection into CI/CD workflows.
- Advocate for DevOps/SRE best practices (shift-left monitoring, chaos testing, performance budgets).
- 8+ years in Site Reliability Engineering, DevOps, or Platform Engineering.
- Expertise with New Relic-must be able to architect observability end-to-end.
- Proven experience supporting retail or eCommerce platforms at scale.
- Strong coding/scripting (Python, Bash, or Go).
- Production experience with AWS/GCP/Azure and Kubernetes.
- Deep understanding of infrastructure automation (Terraform, Ansible, or Pulumi).
- Strong communication skills, client-facing presence, and leadership ability.
Nice to Have:
- Experience with Shopify or headless commerce stacks.
- Experience leading distributed teams.
- Familiarity with traffic-heavy retail events and strategies (caching, autoscaling, edge optimization).
- Experience integrating monitoring into microservices, APIs, and frontend apps