Fidelity Investments

Principal Site Reliability Engineer

Fidelity Investments, Durham, North Carolina, United States

Automates with various scripting languages such as Python and Shell scripting to run, build and develop applications. Coordinates systems using Infrastructure as Code (IaC) tools (IAM, ARM, Terraform, and Chef). Deploys applications in a DevOps environment using Cloud Computes and DevOps concepts (CI/CD pipelines). Utilizes modern monitoring tools such as DataDog, Prometheus, and Splunk. Confers with systems analysts, engineers, programmers and others to design systems and to obtain information on project limitations and capabilities, performance requirements and interfaces. Primary Responsibilities: • Provides high scale, highly available, and resilient delivery services using automation and infrastructure code. • Builds reliability using resiliency engineers, automation, observability and chaos tests. • Implements advanced observability practices and techniques at scale. • Maintains and interprets large datasets using query languages and visualization tools. • Troubleshoots new software, methods, and practices and brings them to developers. • Defines and executes a comprehensive reliability and observability strategy available to customers. • Brings together technical, procedural, and financial data to reduce toil and increase efficiency. • Executes plans for technical standardization and process refinement within the engineering organization. • Troubleshoots stack-wide engineers issues related to hardware, software, network, applications, and cloud service providers. • Analyzes user needs and software requirements to determine feasibility of design within time and cost constraints. • Confers with data processing or project managers to obtain information on limitations or capabilities for data processing projects. • Consults with customers or other departments on project status, proposals, or technical issues, such as software system design or maintenance. • Confers with systems analysts and other software engineers/developers to design systems and to obtain information on project limitations and capabilities, performance requirements and interfaces. • Develops and coordinates software system tests and validation procedures, programs, and documentation. Education and Experience: Bachelor’s degree (or foreign education equivalent) in Computer Science, Engineering, Information Technology, Information Systems, Mathematics, Physics, or a closely related field and five (5) years of experience as Principal Site Reliability Engineer (or closely related occupation) designing and developing reliability, performance, and scalability of enterprise-wide full stack applications (ensuring seamless integration and high availability) using Datadog, ELK, and Prometheus in a financial services environment. Or, alternatively, Master’s degree (or foreign education equivalent) in Computer Science, Engineering, Information Technology, Information Systems, Mathematics, Physics, or a closely related field and three (3) years of experience as a Principal Site Reliability Engineer (or closely related occupation) designing and developing reliability, performance, and scalability of enterprise-wide full stack applications (ensuring seamless integration and high availability) using Datadog, ELK, and Prometheus in a financial services environment. Skills and Knowledge: Candidate must also possess: • Demonstrated Expertise (“DE”) designing, architecting, and building scalable and resilient N-tier software solutions, and creating E2E plans for critical services according to DevOps practices, using .Net, Java, Python, Docker, and Kubernetes. • DE delivering high scale, highly available, and resilient services according to automation and Infra-structure-as-Code (IaC) methodologies, using Open Telemetry (OTEL), Datadog, Splunk, Prometheus, and ELK. • DE building cloud-based platforms for consumption at an enterprise level, using AWS EKS, Lambda, EMR, and CloudFormation AWS and Azure services. • DE developing micro-services in EKS platform; and maintaining CI/CD pipelines using DevOps technologies (GitHub, Artifactory, Sonar, Jenkins/Jenkins Core, and Terraform).