Manager Site Reliability Engineering
Diversity Resource Staffing, Inc. - Sandy Springs, Georgia, United States
Work at Diversity Resource Staffing, Inc.
Overview
- View job
Overview
Manager
in the
Consumer Site Reliability Engineering (SRE) Team
at IMT. IMT is a division of our client, which operates numerous financial and commodity marketplaces and exchanges, including the New York Stock Exchange (NYSE).
This position is for a
hands-on technical manager
to lead a team of SRE engineers, focused on providing resilient, secure, scalable, and supportable services for mortgage borrowers and lenders. You will contribute to the strategy and delivery of the team, as well as managing the day-to-day workload. This role requires building a close relationship with our customer support, operations, engineering, database, and product organizations.
You will be involved in the design of resilient systems, defining and monitoring SLI/SLOs, creating proactive actionable alerts, and driving production incidents. We operate in a hybrid multi-cloud environment supporting Windows, Linux, and container-based applications.
Responsibilities
Provide thought leadership; set the technical direction for the SRE Team Define and manage projects to meet team objectives Set individual goals and manage personal growth of team members Manage and troubleshoot a diverse set of SaaS applications and internal services Serve as the face of a team responsible for the overall health, performance, and capacity of our business applications Develop sustainable SRE practices around simplification and standardization Drive the cultural standards for SRE including defining ways of working, runbooks, and accountability across people, processes, and technology Lead incident response and root cause analysis Partner with other SRE teams and lead by example Knowledge and Experience
3+ years of managing high-performance teams 10+ years of application/systems engineering in 24x7 production services environments Bachelor's degree in Computer Science, Computer Engineering, Math, or equivalent professional experience Experience in designing, deploying, and operating SaaS applications and cloud infrastructure (AWS or equivalent) & on-premise virtualized environments Excellent troubleshooting skills spanning systems, networks, and code, utilizing a systematic problem-solving approach Proven track record of decreasing MTTR, increasing MTTF, and improving overall service quality Ability to lead incident response and root cause analysis (RCA) Fluency with scripting languages used by SRE/DevOps professionals (Powershell, Python, Perl, PHP, Ruby) + Java/.NET development Strong communication skills
#J-18808-Ljbffr