Gravity IT Resources

Site Reliability Engineer Managing Director (Charlotte)

Gravity IT Resources, Charlotte, North Carolina, United States, 28245

Job Title: SRE Manager Location: Charlotte, NC (onsite) 3 days a week hybrid Direct Hire Key Responsibilities: Lead the expansion of SRE practices from a small and high performing team to a larger global function incorporating on-premise infrastructure technologies. Evaluate current operational workflows and RACIs, identify toil and complete assessment of skills across the global team. Execute a comprehensive roadmap to transition reactive operational day to day activities into proactive, SRE-aligned processes with a focus on reliability, automation, observability, and incident management. Upskill team members through tailored training programs on SRE principles, cloud operations and automation tools. Collaborate with architects, platform engineering, ServiceNow developers and application teams to define and implement an observability framework in order to enhance proactive incident detection and reduce MTTR. Define and implement an automation framework to ensure sustainable, responsible, and effective use of automation to reduce toil and risk. Define and regularly review SLIs, SLOs, SLAs, error budgets, and incident response processes. Oversee recruitment, orientation, and professional development of the global SRE team. Foster a high-performing team culture. Build strong relationships with internal and external stakeholders. Prepare and present reports on operational performance. Oversee incident response and post-incident analysis processes and drive a culture of blameless post-mortems across multiple teams.

Key Requirements: Proven experience in building and leading Operational and Engineering teams. Adept at fostering collaboration between SRE and application development teams to drive operational excellence, reduce downtime, and help application teams accelerate delivery cycles. Have defined and monitored SRE principles including SLIs, SLOs, SLAs, error budgets, and incident response strategies. Has overseen incident response processes, skilled in post-incident analysis and conducting blameless post-mortems with multiple teams, driving proactive measures to prevent future incidents. Experience of spearheading automation initiatives using Terraform, and significantly reducing infrastructure provisioning time. Experience of Monitoring & Observability tools such as Logic Monitor, Azure Monitor, Prometheus, Grafana, Dynatrace and Splunk. Experience with ServiceNow and Azure DevOps and solid understanding of Agile, ITIL and ITSM frameworks. Strong expertise in Azure technologies. Experience with other CSPs highly beneficial. Proficiency in IaC tools including Terraform. Experience with Sharepoint administration highly beneficial. Experience with container orchestration. Strong scripting or programming skills (e.g., Python, Powershell). Excellent communication skills. Experience in managing other managers highly beneficial.