Work Arrangement:
This role is categorized as hybrid. The successful candidate is expected to report to either Austin, TX or Atlanta, GA at their respective innovation centers three times per week.
The Role:
- Ensure the reliability, scalability, and performance of software systems through diligent monitoring and troubleshooting of system performance and availability.
- Develop and maintain robust automation tools and infrastructure that streamline software deployment, configuration management, and system monitoring.
- Analyze system performance to identify bottlenecks and implement optimizations for enhanced efficiency and scalability.
- Conduct timely incident response and root cause analysis, implementing corrective actions to avert future issues.
- Collaborate effectively with software development teams to embed reliability and scalability considerations into the software design and implementation.
- Drive continuous improvement by identifying opportunities to enhance systems reliability and performance through process improvements and best practices.
What You'll Do:
- Implement a scalable and secure SRE and Observability platform to monitor our production system's health and provide a holistic view of the environment.
- Deliver innovative tools/software that enhance the reliability, scalability, and operability of services.
- Collaborate with engineering teams to assess architecture, infrastructure resources, and observability to achieve reliability and scalability goals.
- Participate in production readiness reviews alongside engineering teams during deployment, operation, and refinement stages.
- Work closely with stakeholders to ensure effective integration of data and observability tools with existing systems and processes.
- Help define and monitor metrics of availability, latency, and overall service health in partnership with stakeholders.
- Engage in on-call engineering duty to provide support for production incidents.
- Instill Site Reliability best practices through automation, data insights, and real-time observability.
- Conduct initial incident root cause analysis with engineering teams and facilitate incident postmortems.
- Create and maintain run books and tooling necessary for effective production support activities.
- Actively engage in technical discussions and deep dives with the Architectural group.
Your Skills & Abilities (Required Qualifications):
- 7+ years of hands-on experience as an SRE with expertise in software development and systems monitoring, preferably with Azure.
- Proven experience managing high-availability, fault-tolerant, scalable, distributed software in production, including creating monitoring strategies and alert definitions.
- Proficiency in using monitoring and log aggregation frameworks such as Datadog, Azure Monitor/Sentinel, Elasticsearch, and Kibana.
- Strong working knowledge of Docker, Kubernetes, Terraform, Chef, or Ansible.
- Experience troubleshooting JVM-based applications with a solid understanding of chaos engineering principles.
- Extensive familiarity with Infrastructure as Code using Terraform and trace monitoring with OpenTelemetry.
- Strong programming and scripting abilities in languages like Python, Java, Go, PowerShell, and Bash.
- Knowledge of configuration management, SSO, and managing Big Data/No-SQL frameworks in cloud infrastructure.
- Familiarity with CI/CD automation frameworks like Jenkins or Azure DevOps.
- Comprehensive knowledge of public cloud networking components.
- A track record of leading cross-organizational efforts to enhance uptime to at least 99.99%.
- Experience with source control management tools, ideally GitHub.
- Exposure to IoT technology is a plus.
- A degree in Computer Science or Engineering is preferred.
About GM:
Our vision is a world with Zero Crashes, Zero Emissions, and Zero Congestion. We are committed to leading change that will make our world better, safer, and more equitable for all.
Why Join Us:
At GM, we strive for a workplace that fosters inclusion and belonging, where everyone can thrive individually and collectively. We believe in meaningful change through our actions and culture.
Benefits Overview:
From day one, we're focused on your wellness at work and home to help you realize your career ambitions. Explore how GM offers a rewarding career that goes beyond just work.