BNY
Overview
Vice President, Site Reliability Engineer. This role is located in Jersey City, NJ. We’re seeking a future team member for the role of SRE / Site Reliability Engineer to join our Technology team. Responsibilities
Drive reliability and performance by defining SLOs/SLIs, improving observability, and proactively identifying and addressing system bottlenecks across cloud environments. Automate infrastructure and operations using Terraform, Kubernetes, and CI/CD tools to eliminate toil and enable scalable, fault-tolerant deployments. Collaborate cross-functionally with product, infrastructure, and DevOps teams to reduce incidents, build resilient services, and ensure architectural clarity. Lead incident management by participating in on-call rotations, conducting postmortems, and implementing automated recovery to minimize downtime. Build and maintain monitoring systems with tools like Prometheus, Grafana, AppDynamics, and Splunk to support real-time alerting and root cause analysis. Develop platform tooling and pipelines for container orchestration, third-party integrations, and cloud-native operations to improve system efficiency and reliability. Maintain and improve live services by measuring and monitoring latency and overall system health, working closely with tech support and operations teams. Leverage and define KPIs to understand service performance and identify corrective actions. Create, manage, and use dashboards for continuous monitoring and health checks of applications and underlying infrastructure. Design and implement solutions to customer friction points and improve the entire lifecycle of services from inception through sustainment. Assist in creating and maintaining automation to improve reliability and velocity in addressing issues during regular maintenance tasks. Mentor engineers and champion SRE best practices, embedding a reliability-first culture and ensuring technical excellence across engineering teams. Qualifications
Bachelor’s degree in computer science or a related discipline, or equivalent work experience required; advanced degree preferred. 5-8 years of related experience; experience in the securities or financial services industry is a plus. Strong expertise in cloud infrastructure (Azure, AWS, or GCP), containerization (Docker, Kubernetes), and Infrastructure as Code (Terraform, Helm). Proficiency in observability and monitoring tools such as Prometheus, Grafana, AppDynamics, Datadog, Splunk, and experience with incident response and on-call support. Solid programming and scripting skills in languages like Python, Go, or Java, with a focus on automation, tooling, and system integration. Deep understanding of SRE principles, including SLAs, SLOs, error budgets, postmortems, and reliability-focused system design. Familiarity with automated testing, DevSecOps practices, CI/CD methods, performance engineering, and security controls. Strong collaboration and communication skills, with experience working in Agile environments and partnering with cross-functional engineering, product, and operations teams. Previous success in technical engineering and coding experience beyond simple scripts. Additional Information
BNY is an Equal Employment Opportunity/Affirmative Action Employer - Underrepresented racial and ethnic groups/Females/Individuals with Disabilities/Protected Veterans. Base salary is between $83,000 and $155,000 per year at commencement, with the final offer determined based on experience and location, and may include additional compensation components. This position is at-will and the Company reserves the right to modify compensation at any time. Job Details
Location: Jersey City, NJ
#J-18808-Ljbffr
Vice President, Site Reliability Engineer. This role is located in Jersey City, NJ. We’re seeking a future team member for the role of SRE / Site Reliability Engineer to join our Technology team. Responsibilities
Drive reliability and performance by defining SLOs/SLIs, improving observability, and proactively identifying and addressing system bottlenecks across cloud environments. Automate infrastructure and operations using Terraform, Kubernetes, and CI/CD tools to eliminate toil and enable scalable, fault-tolerant deployments. Collaborate cross-functionally with product, infrastructure, and DevOps teams to reduce incidents, build resilient services, and ensure architectural clarity. Lead incident management by participating in on-call rotations, conducting postmortems, and implementing automated recovery to minimize downtime. Build and maintain monitoring systems with tools like Prometheus, Grafana, AppDynamics, and Splunk to support real-time alerting and root cause analysis. Develop platform tooling and pipelines for container orchestration, third-party integrations, and cloud-native operations to improve system efficiency and reliability. Maintain and improve live services by measuring and monitoring latency and overall system health, working closely with tech support and operations teams. Leverage and define KPIs to understand service performance and identify corrective actions. Create, manage, and use dashboards for continuous monitoring and health checks of applications and underlying infrastructure. Design and implement solutions to customer friction points and improve the entire lifecycle of services from inception through sustainment. Assist in creating and maintaining automation to improve reliability and velocity in addressing issues during regular maintenance tasks. Mentor engineers and champion SRE best practices, embedding a reliability-first culture and ensuring technical excellence across engineering teams. Qualifications
Bachelor’s degree in computer science or a related discipline, or equivalent work experience required; advanced degree preferred. 5-8 years of related experience; experience in the securities or financial services industry is a plus. Strong expertise in cloud infrastructure (Azure, AWS, or GCP), containerization (Docker, Kubernetes), and Infrastructure as Code (Terraform, Helm). Proficiency in observability and monitoring tools such as Prometheus, Grafana, AppDynamics, Datadog, Splunk, and experience with incident response and on-call support. Solid programming and scripting skills in languages like Python, Go, or Java, with a focus on automation, tooling, and system integration. Deep understanding of SRE principles, including SLAs, SLOs, error budgets, postmortems, and reliability-focused system design. Familiarity with automated testing, DevSecOps practices, CI/CD methods, performance engineering, and security controls. Strong collaboration and communication skills, with experience working in Agile environments and partnering with cross-functional engineering, product, and operations teams. Previous success in technical engineering and coding experience beyond simple scripts. Additional Information
BNY is an Equal Employment Opportunity/Affirmative Action Employer - Underrepresented racial and ethnic groups/Females/Individuals with Disabilities/Protected Veterans. Base salary is between $83,000 and $155,000 per year at commencement, with the final offer determined based on experience and location, and may include additional compensation components. This position is at-will and the Company reserves the right to modify compensation at any time. Job Details
Location: Jersey City, NJ
#J-18808-Ljbffr