Salesforce, Inc..

Lead Site Reliability Engineer (San Francisco)

Salesforce, Inc.., San Francisco

About Salesforce

Were Salesforce, the Customer Company, inspiring the future of business with AI+ Data +CRM. Leading with our core values, we help companies across every industry blaze new trails and connect with customers in a whole new way. And, we empower you to be a Trailblazer, too driving your performance and career growth, charting new paths, and improving the state of the world. If you believe in business as the greatest platform for change and in companies doing well and doing good youve come to the right place.

The Marketing Automation Platform & Data Operations team operates within the Marketing Technology organization and is instrumental in ensuring the trusted, transparent, and reliable platforms that empower our company to achieve its innovation goals. We are focused on proactively addressing challenges related to platform reliability and operational efficiency, particularly concerning our critical Marketing Technology ecosystem.
Given the importance of incident management and the criticality of our technology, our team requires an experienced and self-motivated Site Reliability Engineer to ensure the highest standards of Trust and Security. This role will collaborate with Platform Operations and Platform Engineering to improve the reliability, performance, and scalability of our systems by implementing and maintaining automated solutions for monitoring, incident response, and system optimization, as well as contributing to strategic planning and technology decisions.

What Are We Looking For?

Role Overview: As a Lead Site Reliability Engineer, you will play a pivotal role in ensuring the reliability, performance, and scalability of our critical software systems and infrastructure within an enterprise IT environment. You will serve as a technical leader, bridging software engineering and system administration, with a particular emphasis on monitoring, visualization, and alerting tools such as Datadog, Splunk, Grafana, New Relic, Tableau, and PagerDuty. You will take ownership of service reliability, lead incident investigations, and drive automation initiatives to enhance system stability and operational efficiency.

Technical Expertise:

Monitoring and Visualization Platforms: Deep expertise in Datadog, Splunk, Grafana, New Relic, and Tableau for proactive monitoring, alerting, and comprehensive visualization of system performance and reliability metrics.
Salesforce Ecosystem: Experience managing reliability and performance within the Salesforce ecosystem, including the Salesforce Platform, Slack, Data Cloud, Tableau and Heroku.
Cloud Infrastructure: Extensive experience with cloud platforms (AWS, Azure, Google Cloud) for infrastructure management and monitoring.
Coding & Scripting : Advanced proficiency in scripting languages such as Python, Go, Java, or equivalent, focused on automation and monitoring integration.
Infrastructure as Code (IaC): Proven capability using tools such as Terraform, Ansible, and Kubernetes for infrastructure automation and provisioning.
CI/CD Pipelines: Comprehensive knowledge of CI/CD processes to ensure reliable and efficient software deployment (Jenkins, Copado, Gearset).

Operational Skills:

Incident Response: Leadership in incident investigations, driving swift resolutions using incident management tools, particularly PagerDuty.
Service Level Objectives (SLOs): Expertise in defining and managing SLOs and SLAs, utilizing monitoring tools (Datadog, Splunk, New Relic) for accurate tracking and reporting.
Documentation and Knowledge Sharing: Strong skills in documenting best practices, incident responses, and operational procedures using tools such as Wikis, Notion, and Tableau dashboards.

Problem Solving:

Root Cause Analysis: Expertise in conducting detailed root cause analyses using data from monitoring and visualization tools like Splunk, Datadog, Grafana, and New Relic.
Troubleshooting: Advanced troubleshooting capabilities across infrastructure, leveraging insights from comprehensive monitoring systems.
Process Improvement: Proven ability to identify and implement automation and process improvements to enhance reliability and reduce manual efforts.

Vendor and Relationship Management:

Collaboration: Excellent ability to collaborate across teams including developers, platform engineers, architects, QA, and operations, maintaining alignment and effective communication.
Stakeholder Engagement: Act as liaison among product, engineering, and operations teams, emphasizing reliability insights derived from platforms like Tableau and Salesforce Tableau.

Disaster Recovery and Incident Management:

Escalation Management: Primary point of contact for escalations related to reliability, utilizing PagerDuty to ensure rapid and structured incident responses.
Disaster Recovery: Active participation in developing and executing disaster recovery plans with continuous monitoring and alerting using the above-mentioned tools.

Communication and Leadership:

Effective Communication: Strong verbal and written communication, especially in presenting complex technical insights through platforms such as Tableau and Salesforce Tableau.
Mentorship: Demonstrated capability in mentoring junior engineers, fostering a high-performance culture focused on proactive monitoring and reliability.

Innovation and Continuous Learning:

Industry Trends: Continuous learning on industry advancements, especially relating to monitoring, observability, and visualization technologies.
Knowledge Management: Contribution to internal training, documentation, and knowledge-sharing practices, leveraging detailed analytics from monitoring and visualization platforms.

Flexibility:

Adaptability: Capability to manage shifting priorities in fast-paced development cycles while maintaining operational excellence and composure.

Minimum Qualifications:

8+ years of relevant industry experience, emphasizing monitoring, alerting, and visualization systems.
Advanced expertise with Datadog, Splunk, Grafana, Tableau, New Relic, and PagerDuty.
Deep knowledge of cloud infrastructures (AWS, Azure, GCP).
Experience managing reliability within the Salesforce ecosystem.
Proven ability in incident escalation and disaster recovery management.
Strong relationship-building skills across technical and business teams.
Excellent verbal, written, and interpersonal skills.

Preferred Qualifications:

Experience in Enterprise-scale environments, particularly with Salesforce technology (Heroku, SF Platform, Data Cloud).
Familiarity with configuration management tools (Ansible, Puppet, Chef) and log management (Elastic, Logstash, Kibana).
Relevant industry certifications (AWS, GCP, Kubernetes).

#J-18808-Ljbffr