Tata Consultancy Service Limited
Observability AIOps Architect
Tata Consultancy Service Limited, Atlanta, Georgia, United States, 39901
Must Have Technical/Functional Skills
What You Need to Succeed (Minimum Qualifications)
10+ years of experience in software engineering or infrastructure architecture. 5+ years of experience in observability, automation, or AIOps domains. Strong understanding of cloud platforms (AWS preferred) and cloud-native architectures. Experience with observability tools such as Datadog, Splunk, Prometheus, Grafana, ELK, and OpenTelemetry. Proficiency in automation frameworks and scripting languages (Python, Bash, Terraform, Ansible). Familiarity with AIOps platforms and techniques including anomaly detection, event correlation, and predictive analytics. Experience with CI/CD pipelines and DevOps/DevSecOps practices. Ability to work with large-scale telemetry data and build ML models for operational intelligence. Strong communication skills and ability to present technical concepts to business audiences. Experience with Agile methodologies and working in cross-functional teams. Preferred Qualifications
Experience with Kubernetes, container orchestration, and service mesh technologies. Familiarity with incident management platforms (PagerDuty, Opsgenie). Knowledge of ITIL practices and SRE principles. Experience integrating observability and AIOps into enterprise ITSM platforms. This Job Might Be for You If You Are:
A problem-solver who thrives in complex, fast-paced environments. Passionate about improving system reliability and operational efficiency through intelligent automation. A collaborative team player who values empathy, communication, and continuous learning. Curious and driven to explore new technologies and approaches to enhance observability and automation. Committed to delivering high-quality solutions that make a real impact on customer and employee experiences. Roles & Responsibilities
As Observability Architect/Lead Engineer responsible for designing, implementing, and maintaining observability solutions to ensure the health, performance, and reliability of systems and applications. They work with development, operations, and security teams to integrate observability into the software development lifecycle and define standards and best practices. This role requires strong technical skills in areas like logging, monitoring, tracing, and alerting, as well as the ability to influence and lead teams Architect and implement enterprise-grade observability and automation solutions across distributed systems and cloud-native environments. Lead the strategy and execution of AIOps initiatives to proactively detect, diagnose, and resolve incidents using machine learning and predictive analytics.Collaborate with cross-functional teams including SREs, DevOps, software engineers, and business stakeholders to align observability and automation goals with business outcomes. Design and maintain scalable telemetry pipelines for metrics, logs, traces, and events using modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, ELK). Drive automation of operational tasks and incident response using AI/ML models and rule-based systems. Develop and maintain CI/CD pipelines integrated with observability and AIOps tools to ensure continuous feedback and improvement. Provide technical l eadership in selecting and integrating tools for monitoring, alerting, and automated remediation. Promote best practices in observability, automation, and AIOps through documentation, training, and knowledge sharing. Communicate architecture strategies and technical roadmaps to leadership and stakeholders. Operate within Agile squads and contribute to sprint planning, reviews, and retrospectives. Own and support the solutions you build, ensuring reliability, scalability, and performance. Salary Range-$100,000-$150,000 a year
#LI-KR1
What You Need to Succeed (Minimum Qualifications)
10+ years of experience in software engineering or infrastructure architecture. 5+ years of experience in observability, automation, or AIOps domains. Strong understanding of cloud platforms (AWS preferred) and cloud-native architectures. Experience with observability tools such as Datadog, Splunk, Prometheus, Grafana, ELK, and OpenTelemetry. Proficiency in automation frameworks and scripting languages (Python, Bash, Terraform, Ansible). Familiarity with AIOps platforms and techniques including anomaly detection, event correlation, and predictive analytics. Experience with CI/CD pipelines and DevOps/DevSecOps practices. Ability to work with large-scale telemetry data and build ML models for operational intelligence. Strong communication skills and ability to present technical concepts to business audiences. Experience with Agile methodologies and working in cross-functional teams. Preferred Qualifications
Experience with Kubernetes, container orchestration, and service mesh technologies. Familiarity with incident management platforms (PagerDuty, Opsgenie). Knowledge of ITIL practices and SRE principles. Experience integrating observability and AIOps into enterprise ITSM platforms. This Job Might Be for You If You Are:
A problem-solver who thrives in complex, fast-paced environments. Passionate about improving system reliability and operational efficiency through intelligent automation. A collaborative team player who values empathy, communication, and continuous learning. Curious and driven to explore new technologies and approaches to enhance observability and automation. Committed to delivering high-quality solutions that make a real impact on customer and employee experiences. Roles & Responsibilities
As Observability Architect/Lead Engineer responsible for designing, implementing, and maintaining observability solutions to ensure the health, performance, and reliability of systems and applications. They work with development, operations, and security teams to integrate observability into the software development lifecycle and define standards and best practices. This role requires strong technical skills in areas like logging, monitoring, tracing, and alerting, as well as the ability to influence and lead teams Architect and implement enterprise-grade observability and automation solutions across distributed systems and cloud-native environments. Lead the strategy and execution of AIOps initiatives to proactively detect, diagnose, and resolve incidents using machine learning and predictive analytics.Collaborate with cross-functional teams including SREs, DevOps, software engineers, and business stakeholders to align observability and automation goals with business outcomes. Design and maintain scalable telemetry pipelines for metrics, logs, traces, and events using modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, ELK). Drive automation of operational tasks and incident response using AI/ML models and rule-based systems. Develop and maintain CI/CD pipelines integrated with observability and AIOps tools to ensure continuous feedback and improvement. Provide technical l eadership in selecting and integrating tools for monitoring, alerting, and automated remediation. Promote best practices in observability, automation, and AIOps through documentation, training, and knowledge sharing. Communicate architecture strategies and technical roadmaps to leadership and stakeholders. Operate within Agile squads and contribute to sprint planning, reviews, and retrospectives. Own and support the solutions you build, ensuring reliability, scalability, and performance. Salary Range-$100,000-$150,000 a year
#LI-KR1