At RH, we believe deeply that the "right" people are our greatest asset. We value people with high energy, who possess the ability to energize others. People who are smart, creative, and have a point of view. People who see the answer in every problem, rather than those who see the problem in every answer. People who are driven, determined, and won't take "no" for an answer. We value team players, people who are more concerned with what's right, rather than who's right.
RESPONSIBILITIES:
We are looking for a principal SRE Engineer to provide strategic support and execute infrastructure, security, continuous integration, deployment, and IT operations practices, scaling and metrics, as well as managing day-to-day operations of production and development infrastructure for a cloud-based commerce/enterprise platform.
If you possess a "can do" attitude, are driven by research and problem-solving, and thrive on challenges, this opportunity will interest you.
You’ll work closely with the Development and QA teams to continuously improve existing features and roll out new services, ensuring the high availability of our platform.
You’re comfortable with infrastructure and configuration, but also happy to roll up your sleeves, fix code, write tests, debug, and ship features.
REQUIREMENTS:
- Obsess about site reliability and performance, and ways to continuously improve these areas.
- Own and lead initiatives to define, design, and implement solutions that help prevent issues impacting availability/performance and reduce resolution time.
- Understand the overall e-commerce architecture and identify opportunities to optimize with an eye on availability/performance.
- Identify and execute automation opportunities in code deployment, problem identification, and resolution.
- Act as a subject matter expert on SRE/DevOps best practices with CloudFormation, Auto Scaling Groups, Build tools, Monitoring, and Configuration Management.
- Perform analysis of best practices and emerging concepts in DevOps, Infrastructure Automation, Akamai configuration management, and Enterprise Security.
- Continuously improve observability capabilities (e.g., Prometheus, Grafana, Splunk) to ensure the right leading indicators are monitored and response workflows are set up.
- Review and audit existing solutions, designs, and system architecture.
- Perform profiling, troubleshooting, and improve the performance of systems under coverage.
Create technical documentation and maintain CI/CD pipelines (Jenkins).
Job Qualifications:
- BS/MS (MS preferred) in Computer Science or equivalent work experience.
- 4+ years of experience supporting mission-critical workloads like e-commerce in a distributed architecture environment.
- Solid technical know-how and proven problem-solving record in a distributed architecture setting.
- Excellent critical thinking skills with a strong work ethic.
- Solid team player with the ability to collaborate cross-functionally with tech and business teams.
- Excellent communication skills; ability to explain complex technical issues to both technical and non-technical audiences; collaborative and partnership-oriented mentality.