Shutterfly
Senior Site Reliability Engineer II
Shutterfly, Fort Mill, South Carolina, United States, 29715
At Shutterfly, we make life’s experiences unforgettable. We believe there is extraordinary power in self‑expression, which is why our family of brands helps customers create products and capture moments that reflect who they uniquely are.
What You’ll Do
Perform advanced performance analysis and troubleshooting across distributed systems to ensure optimal availability, scalability, and cost efficiency. Implement and maintain monitoring, alerting, and observability solutions to provide proactive visibility into application and infrastructure health. Partner with development teams to influence service design and architecture so that new features meet high standards for reliability and scalability. Participate in incident response, including root cause analysis and long‑term reliability improvements. Contribute to capacity planning, cost optimization, and performance tuning of large‑scale systems. Build and maintain automation and tooling that reduces manual effort, accelerates delivery, and minimizes human error. Explore and apply AI/ML technologies (e.g., anomaly detection, predictive scaling, automated alerting) to enhance SRE practices. Share expertise with peers by documenting best practices, solutions, and troubleshooting methodologies. Collaborate across infrastructure, development, and business teams to align on standards and reliability goals. Provide technical depth and decisive action during critical incidents. The Skills You’ll Bring
5‑7+ years of experience in software engineering, SRE, or DevOps roles supporting large‑scale, highly available systems. Strong skills in performance troubleshooting, root cause analysis, and distributed system optimization. Proficiency in at least one programming language (Python, Go, Java, or similar) with ability to write production‑quality code. Hands‑on experience with observability platforms (e.g., Splunk, Datadog, SignalFx, Prometheus, OpenTelemetry). Strong knowledge of AWS services, cloud deployment models, and cost optimization strategies. Experience with Infrastructure as Code (Terraform, CloudFormation) and configuration management (Ansible, Chef, Puppet). Solid understanding of distributed systems concepts (scalability, high availability, fault tolerance). Experience in incident management and driving operational improvements. Exposure to AI/ML or AIOps tools for anomaly detection, predictive analytics, or automated incident response (preferred but not required). Effective communication skills with ability to work across engineering and business teams. Bachelor’s degree in Computer Science, Engineering, or equivalent experience. Benefits & Compensation
Compensation ranges (specific to locations) include: California: $106,000‑151,000 Connecticut and New York: $106,000‑138,250 Colorado, Illinois, Minnesota and Washington: $106,000‑128,000 Nevada: $99,750‑138,250 Maryland and New Jersey: up to $138,250 Hawaii: $99,750‑112,750 This position may be eligible for a bonus incentive, health benefits, a 401(k) program, and other employee perks. More details are available at https://shutterflyinc.com/benefits/. The role can be remote, but candidates must reside in any U.S. state where Shutterfly is registered to do business, except the District of Columbia, North Dakota, Mississippi, Rhode Island, Vermont, and Wyoming. Equal Opportunity
Supporting a diverse and inclusive workforce is important to Shutterfly because it reflects our value of embracing our differences and is the right thing for our business and people. We welcome all applicants and evaluate them based on their qualifications, without regard to age, race, creed, color, national origin, ancestry, marital status, affectional or sexual orientation, gender identity or expression, disability, nationality, sex, or any other characteristic protected by law.
#J-18808-Ljbffr
Perform advanced performance analysis and troubleshooting across distributed systems to ensure optimal availability, scalability, and cost efficiency. Implement and maintain monitoring, alerting, and observability solutions to provide proactive visibility into application and infrastructure health. Partner with development teams to influence service design and architecture so that new features meet high standards for reliability and scalability. Participate in incident response, including root cause analysis and long‑term reliability improvements. Contribute to capacity planning, cost optimization, and performance tuning of large‑scale systems. Build and maintain automation and tooling that reduces manual effort, accelerates delivery, and minimizes human error. Explore and apply AI/ML technologies (e.g., anomaly detection, predictive scaling, automated alerting) to enhance SRE practices. Share expertise with peers by documenting best practices, solutions, and troubleshooting methodologies. Collaborate across infrastructure, development, and business teams to align on standards and reliability goals. Provide technical depth and decisive action during critical incidents. The Skills You’ll Bring
5‑7+ years of experience in software engineering, SRE, or DevOps roles supporting large‑scale, highly available systems. Strong skills in performance troubleshooting, root cause analysis, and distributed system optimization. Proficiency in at least one programming language (Python, Go, Java, or similar) with ability to write production‑quality code. Hands‑on experience with observability platforms (e.g., Splunk, Datadog, SignalFx, Prometheus, OpenTelemetry). Strong knowledge of AWS services, cloud deployment models, and cost optimization strategies. Experience with Infrastructure as Code (Terraform, CloudFormation) and configuration management (Ansible, Chef, Puppet). Solid understanding of distributed systems concepts (scalability, high availability, fault tolerance). Experience in incident management and driving operational improvements. Exposure to AI/ML or AIOps tools for anomaly detection, predictive analytics, or automated incident response (preferred but not required). Effective communication skills with ability to work across engineering and business teams. Bachelor’s degree in Computer Science, Engineering, or equivalent experience. Benefits & Compensation
Compensation ranges (specific to locations) include: California: $106,000‑151,000 Connecticut and New York: $106,000‑138,250 Colorado, Illinois, Minnesota and Washington: $106,000‑128,000 Nevada: $99,750‑138,250 Maryland and New Jersey: up to $138,250 Hawaii: $99,750‑112,750 This position may be eligible for a bonus incentive, health benefits, a 401(k) program, and other employee perks. More details are available at https://shutterflyinc.com/benefits/. The role can be remote, but candidates must reside in any U.S. state where Shutterfly is registered to do business, except the District of Columbia, North Dakota, Mississippi, Rhode Island, Vermont, and Wyoming. Equal Opportunity
Supporting a diverse and inclusive workforce is important to Shutterfly because it reflects our value of embracing our differences and is the right thing for our business and people. We welcome all applicants and evaluate them based on their qualifications, without regard to age, race, creed, color, national origin, ancestry, marital status, affectional or sexual orientation, gender identity or expression, disability, nationality, sex, or any other characteristic protected by law.
#J-18808-Ljbffr