JPMorganChase

Lead Site Reliability Engineer

JPMorganChase, Palo Alto, California, United States, 94306

Description Assume a critical role in defining the future of a globally recognized firm and have a direct and significant impact in a domain tailored for top achievers in site reliability. As a Lead Site Reliability Engineer at JPMorgan Chase within the AI4Tech team under the Chief Technology Office organization, you hold a leadership role in your team. Demonstrate strong knowledge across multiple technical domains and advise others on technical and business issues. Lead resiliency design reviews, break down complex problems for other engineers, act as a technical lead for medium to large-sized products, and provide mentorship to team members. Job responsibilities Lead all SRE activities supporting the AI4Tech enterprise AI Agent platform. Promote and exemplify site reliability culture and practices, exerting technical influence within your team. Lead initiatives to enhance the reliability and stability of applications and platforms using data-driven analytics to improve service levels. Collaborate with team members to define service level indicators, establish reasonable service level objectives, and error budgets with stakeholders. Demonstrate high technical expertise in one or more domains, proactively resolving technology-related bottlenecks. Serve as the main contact during major incidents, quickly identifying and resolving issues to prevent financial losses. Document and share knowledge through internal forums and communities of practice. Required qualifications, capabilities, and skills Extensive experience in operating and provisioning Kubernetes clusters. Deep proficiency in AWS services such as EKS, RDS, VPC, EC2, Bedrock. Strong expertise in reliability, scalability, performance, security, enterprise system architecture, and best practices in site reliability engineering. Proficiency in at least one programming language such as Python, Java, Spring, etc. Deep understanding of software applications and technical processes, with emerging expertise in specific technical disciplines. Experience with observability tools like Grafana, Dynatrace, Prometheus, Datadog, Splunk, including monitoring, SLOs, alerting, and telemetry collection. Experience with CI/CD tools such as Jenkins, GitLab, Terraform. Hands-on experience with containers and orchestration tools like Docker, ECS, Kubernetes. Ability to troubleshoot networking issues effectively. Strong problem-solving skills related to complex data structures and algorithms. Preferred qualifications, capabilities, and skills Self-motivated to learn and evaluate new technologies. Ability to teach programming languages and technical skills to team members. Effective at collaborating across different stakeholder groups and organizational levels. Key Skills Kubernetes, FMEA, Continuous Improvement, Elasticsearch, Go, Root Cause Analysis, Maximo, CMMS, Maintenance, Mechanical Engineering, Manufacturing, Troubleshooting Employment Type :

Full-Time Experience :

[Specify years] Vacancy :

1 Monthly Salary : 152000 - 215000

#J-18808-Ljbffr