General Motors

Senior Staff Software Engineer - Site Reliability and Observability

General Motors, Warren, Michigan, United States, 48091

As a Senior Staff Software Engineer in Site Reliability and Observability, you will play a crucial role in ensuring the performance, reliability, and scalability of our software systems. Join our team and contribute to our mission to deliver exceptional software experiences. Key Responsibilities: System Monitoring and Troubleshooting:

Monitor software systems for performance and availability, quickly resolving issues and implementing proactive measures to avert future incidents. Automation and Infrastructure:

Develop and maintain automation tools and infrastructure to enhance software deployment and system monitoring. Performance Optimization:

Analyze system performance, identify bottlenecks, and execute optimizations effectively. Incident Response:

Manage incident response and conduct root cause analyses to prevent recurrence. Collaboration:

Work closely with development teams to incorporate reliability and scalability into the design and implementation of software. Continuous Improvement:

Identify areas for process improvement and drive initiatives to enhance system reliability and performance. What You'll Do: Implement a scalable and reliable SRE and Observability platform to monitor production health comprehensively. Deliver tools to improve the reliability, scalability, and operability of our services. Collaborate with engineering teams for production readiness reviews and deployments. Partner with stakeholders to effectively integrate observability tools with other systems. Participate in on-call support, providing engineering assistance for production issues. Create run books and tooling to aid in production support activities. Engage in detailed technical discussions with architectural teams. Required Qualifications: 7+ years of hands-on SRE experience with software development and systems monitoring, preferably in Azure. Experience operating high-availability, fault-tolerant software in production. Proficient with monitoring frameworks such as Datadog, Dynatrace, or Azure Monitor. Strong knowledge of containerization (Docker, Kubernetes) and Infrastructure as Code (Terraform). Skilled in scripting languages like Python, Java, Go, and Bash. Familiar with CI/CD automation frameworks (Jenkins/Azure DevOps). Understanding of public cloud networking components. This position requires reporting to our innovation centers in Austin, TX or Atlanta, GA three times per week. Relocation benefits are available for eligible candidates.