Birlasoft

Program Director

Birlasoft, Jersey City, New Jersey, United States, 07390

Join to apply for the

Program Director

role at

Birlasoft

About the Role We are seeking an SRE Leader to architect and drive the Site Reliability Engineering strategy across our financial services customers.

Key Responsibilities Strategic Leadership

Define and execute the SRE strategy aligned with business goals and engineering priorities.

Establish and evangelize SRE principles, best practices, and culture across engineering and product teams.

Drive adoption of reliability-focused design patterns, automation, and observability across the organization.

Work with multiple stakeholders (Developers, Architects, Operations, Business teams) to define and adopt reliability engineering.

Build‑Side Reliability Initiatives

Embed SRE practices early in the software development lifecycle to ensure reliability is designed into systems from the start.

Partner with development teams to implement shift‑left reliability testing, including automated resilience and chaos tests in CI/CD pipelines.

Define golden paths for developers with pre-built reliability patterns, templates, and infrastructure‑as‑code modules.

Drive build‑time observability by integrating telemetry, logging, and tracing into application code during development.

Champion performance benchmarking and capacity modeling during build phases to prevent scalability issues post‑deployment.

Collaborate with architects to enforce reliability‑driven design reviews before major releases.

Operational Excellence

Own the reliability, availability, and performance of critical services and infrastructure.

Lead incident management, root cause analysis, and post‑mortem processes with a focus on continuous improvement.

Develop and monitor SLAs, SLOs, and SLIs to ensure service health and customer satisfaction.

Team Building & Mentorship

Build, mentor, and scale a world‑class SRE team with a focus on diversity, inclusion, and growth.

Foster a culture of ownership, accountability, and innovation within the team.

Collaborate with engineering, product, and business stakeholders to align reliability goals with product roadmaps.

Tooling & Automation

Drive automation of operational tasks, deployments, and incident response.

Lead efforts in observability, monitoring, alerting, and capacity planning.

Evaluate and implement modern SRE tools and platforms to improve efficiency and reduce toil.

Governance & Compliance

Ensure compliance with security, privacy, and regulatory requirements in all reliability practices.

Establish governance frameworks for change management, risk mitigation, and service continuity.

Qualifications

15‑20 years of experience.

8+ years in leadership roles managing large‑scale SRE programs.

Deep understanding of cloud‑native architectures (AWS, Azure, GCP), microservices, and distributed systems.

Proficiency in using Application Performance Monitoring (APM) tools such as New Relic/Dynatrace for monitoring, logging, tracing, and Splunk for log monitoring.

Expertise in observability tools (e.g., Prometheus, Grafana, Datadog), CI/CD pipelines, and infrastructure as code (Terraform, Ansible).

Strong experience with incident response, chaos engineering, and reliability testing.

Proven ability to influence cross‑functional teams and drive organizational change.

#J-18808-Ljbffr