Birlasoft
Join to apply for the
Program Director
role at
Birlasoft
About the Role We are seeking an SRE Leader to architect and drive the Site Reliability Engineering strategy across our financial services customers.
Key Responsibilities Strategic Leadership
Define and execute the SRE strategy aligned with business goals and engineering priorities.
Establish and evangelize SRE principles, best practices, and culture across engineering and product teams.
Drive adoption of reliability-focused design patterns, automation, and observability across the organization.
Work with multiple stakeholders (Developers, Architects, Operations, Business teams) to define and adopt reliability engineering.
Build‑Side Reliability Initiatives
Embed SRE practices early in the software development lifecycle to ensure reliability is designed into systems from the start.
Partner with development teams to implement shift‑left reliability testing, including automated resilience and chaos tests in CI/CD pipelines.
Define golden paths for developers with pre-built reliability patterns, templates, and infrastructure‑as‑code modules.
Drive build‑time observability by integrating telemetry, logging, and tracing into application code during development.
Champion performance benchmarking and capacity modeling during build phases to prevent scalability issues post‑deployment.
Collaborate with architects to enforce reliability‑driven design reviews before major releases.
Operational Excellence
Own the reliability, availability, and performance of critical services and infrastructure.
Lead incident management, root cause analysis, and post‑mortem processes with a focus on continuous improvement.
Develop and monitor SLAs, SLOs, and SLIs to ensure service health and customer satisfaction.
Team Building & Mentorship
Build, mentor, and scale a world‑class SRE team with a focus on diversity, inclusion, and growth.
Foster a culture of ownership, accountability, and innovation within the team.
Collaborate with engineering, product, and business stakeholders to align reliability goals with product roadmaps.
Tooling & Automation
Drive automation of operational tasks, deployments, and incident response.
Lead efforts in observability, monitoring, alerting, and capacity planning.
Evaluate and implement modern SRE tools and platforms to improve efficiency and reduce toil.
Governance & Compliance
Ensure compliance with security, privacy, and regulatory requirements in all reliability practices.
Establish governance frameworks for change management, risk mitigation, and service continuity.
Qualifications
15‑20 years of experience.
8+ years in leadership roles managing large‑scale SRE programs.
Deep understanding of cloud‑native architectures (AWS, Azure, GCP), microservices, and distributed systems.
Proficiency in using Application Performance Monitoring (APM) tools such as New Relic/Dynatrace for monitoring, logging, tracing, and Splunk for log monitoring.
Expertise in observability tools (e.g., Prometheus, Grafana, Datadog), CI/CD pipelines, and infrastructure as code (Terraform, Ansible).
Strong experience with incident response, chaos engineering, and reliability testing.
Proven ability to influence cross‑functional teams and drive organizational change.
#J-18808-Ljbffr
Program Director
role at
Birlasoft
About the Role We are seeking an SRE Leader to architect and drive the Site Reliability Engineering strategy across our financial services customers.
Key Responsibilities Strategic Leadership
Define and execute the SRE strategy aligned with business goals and engineering priorities.
Establish and evangelize SRE principles, best practices, and culture across engineering and product teams.
Drive adoption of reliability-focused design patterns, automation, and observability across the organization.
Work with multiple stakeholders (Developers, Architects, Operations, Business teams) to define and adopt reliability engineering.
Build‑Side Reliability Initiatives
Embed SRE practices early in the software development lifecycle to ensure reliability is designed into systems from the start.
Partner with development teams to implement shift‑left reliability testing, including automated resilience and chaos tests in CI/CD pipelines.
Define golden paths for developers with pre-built reliability patterns, templates, and infrastructure‑as‑code modules.
Drive build‑time observability by integrating telemetry, logging, and tracing into application code during development.
Champion performance benchmarking and capacity modeling during build phases to prevent scalability issues post‑deployment.
Collaborate with architects to enforce reliability‑driven design reviews before major releases.
Operational Excellence
Own the reliability, availability, and performance of critical services and infrastructure.
Lead incident management, root cause analysis, and post‑mortem processes with a focus on continuous improvement.
Develop and monitor SLAs, SLOs, and SLIs to ensure service health and customer satisfaction.
Team Building & Mentorship
Build, mentor, and scale a world‑class SRE team with a focus on diversity, inclusion, and growth.
Foster a culture of ownership, accountability, and innovation within the team.
Collaborate with engineering, product, and business stakeholders to align reliability goals with product roadmaps.
Tooling & Automation
Drive automation of operational tasks, deployments, and incident response.
Lead efforts in observability, monitoring, alerting, and capacity planning.
Evaluate and implement modern SRE tools and platforms to improve efficiency and reduce toil.
Governance & Compliance
Ensure compliance with security, privacy, and regulatory requirements in all reliability practices.
Establish governance frameworks for change management, risk mitigation, and service continuity.
Qualifications
15‑20 years of experience.
8+ years in leadership roles managing large‑scale SRE programs.
Deep understanding of cloud‑native architectures (AWS, Azure, GCP), microservices, and distributed systems.
Proficiency in using Application Performance Monitoring (APM) tools such as New Relic/Dynatrace for monitoring, logging, tracing, and Splunk for log monitoring.
Expertise in observability tools (e.g., Prometheus, Grafana, Datadog), CI/CD pipelines, and infrastructure as code (Terraform, Ansible).
Strong experience with incident response, chaos engineering, and reliability testing.
Proven ability to influence cross‑functional teams and drive organizational change.
#J-18808-Ljbffr