Cognizant
About the role
As a
Site Reliability Engineer , you will make an impact by designing and implementing advanced observability solutions tailored for distributed edge computing environments. You will be a valued member of the
Technology & Engineering
team and collaborate closely with infrastructure, application, and DevOps teams to ensure system reliability across remote facilities and centralized platforms.
In this role, you will:
Design and implement observability frameworks for edge environments, including monitoring, logging, tracing, and metrics collection
Define and maintain SLIs, SLOs, and business KPIs to measure and improve system reliability
Build dashboards, visualizations, and alerting systems for real‑time insights and incident response
Implement distributed tracing and log aggregation to troubleshoot complex edge issues
Collaborate with engineering teams to embed observability best practices in resource‑constrained environments
Drive proactive issue detection and resolution, reducing MTTD and MTTR across distributed systems
Lead incident postmortems and implement observability‑driven improvements
Develop automation tools and scripts to enhance observability pipelines
Optimize data storage and querying strategies for performance and scalability
Stay current with emerging observability tools and trends, especially for edge computing
Work model: We believe hybrid work is the way forward as we strive to provide flexibility wherever possible. Based on this role’s business requirements, this is an
Onsite position
requiring
5
days a week in a client or Cognizant office in Scottsdale, AZ. Regardless of your working arrangement, we are here to support a healthy work‑life balance through our various wellbeing programs.
What you need to have to be considered:
3–5 years of experience in service reliability/operations for large‑scale hybrid environments
3–5 years of experience in automation scripting and dashboard development for performance monitoring
2–4 years of experience with programming languages such as Go, Python, Java, or Rust
Working knowledge of databases like Oracle, SQL Server, Redis, ClickHouse, PostgreSQL, MongoDB, or time‑series databases
At least 2 years of experience with cloud platforms and containerization (GCP, AWS, Azure, Rancher, OpenShift)
Experience maintaining containerized apps in GKE/RKE/AKE environments
Hands‑on experience implementing observability using OpenTelemetry (OTEL)
Experience with GraphQL frameworks (Apollo, Prisma, Hasura)
Strong understanding of networking protocols (TCP/IP, HTTP, DNS, Load Balancing, Service Mesh)
These will help you stand out:
Proven experience managing 24/7 high‑availability platforms for critical applications
Familiarity with monitoring tools like Splunk, AppDynamics, Grafana/Prometheus, Dynatrace
Experience with CI/CD tools and platforms (Rally, Confluence, etc.)
Hands‑on experience with Redis and in‑memory caching solutions
Strong debugging skills across integrated platforms and API gateways
Experience with GCS, Cloud SQL, Spanner, and Firestore
Background in enterprise‑level infrastructure and operations
Expertise in Linux/Windows administration and distributed systems
Experience monitoring and troubleshooting HashiCorp Vault environments
Working knowledge of Vertex AI, Gen AI, and BigQuery
Benefits: Cognizant offers the following benefits for this position, subject to applicable eligibility requirements:
Medical/Dental/Vision/Life Insurance
Paid holidays plus Paid Time Off
401(k) plan and contributions
Long‑term/Short‑term Disability
Paid Parental Leave
Employee Stock Purchase Plan
We're excited to meet people who share our mission and can make an impact in a variety of ways. Don't hesitate to apply, even if you only meet the minimum requirements listed. Think about your transferable experiences and unique skills that make you stand out as someone who can bring new and exciting things to this role.
#J-18808-Ljbffr
Site Reliability Engineer , you will make an impact by designing and implementing advanced observability solutions tailored for distributed edge computing environments. You will be a valued member of the
Technology & Engineering
team and collaborate closely with infrastructure, application, and DevOps teams to ensure system reliability across remote facilities and centralized platforms.
In this role, you will:
Design and implement observability frameworks for edge environments, including monitoring, logging, tracing, and metrics collection
Define and maintain SLIs, SLOs, and business KPIs to measure and improve system reliability
Build dashboards, visualizations, and alerting systems for real‑time insights and incident response
Implement distributed tracing and log aggregation to troubleshoot complex edge issues
Collaborate with engineering teams to embed observability best practices in resource‑constrained environments
Drive proactive issue detection and resolution, reducing MTTD and MTTR across distributed systems
Lead incident postmortems and implement observability‑driven improvements
Develop automation tools and scripts to enhance observability pipelines
Optimize data storage and querying strategies for performance and scalability
Stay current with emerging observability tools and trends, especially for edge computing
Work model: We believe hybrid work is the way forward as we strive to provide flexibility wherever possible. Based on this role’s business requirements, this is an
Onsite position
requiring
5
days a week in a client or Cognizant office in Scottsdale, AZ. Regardless of your working arrangement, we are here to support a healthy work‑life balance through our various wellbeing programs.
What you need to have to be considered:
3–5 years of experience in service reliability/operations for large‑scale hybrid environments
3–5 years of experience in automation scripting and dashboard development for performance monitoring
2–4 years of experience with programming languages such as Go, Python, Java, or Rust
Working knowledge of databases like Oracle, SQL Server, Redis, ClickHouse, PostgreSQL, MongoDB, or time‑series databases
At least 2 years of experience with cloud platforms and containerization (GCP, AWS, Azure, Rancher, OpenShift)
Experience maintaining containerized apps in GKE/RKE/AKE environments
Hands‑on experience implementing observability using OpenTelemetry (OTEL)
Experience with GraphQL frameworks (Apollo, Prisma, Hasura)
Strong understanding of networking protocols (TCP/IP, HTTP, DNS, Load Balancing, Service Mesh)
These will help you stand out:
Proven experience managing 24/7 high‑availability platforms for critical applications
Familiarity with monitoring tools like Splunk, AppDynamics, Grafana/Prometheus, Dynatrace
Experience with CI/CD tools and platforms (Rally, Confluence, etc.)
Hands‑on experience with Redis and in‑memory caching solutions
Strong debugging skills across integrated platforms and API gateways
Experience with GCS, Cloud SQL, Spanner, and Firestore
Background in enterprise‑level infrastructure and operations
Expertise in Linux/Windows administration and distributed systems
Experience monitoring and troubleshooting HashiCorp Vault environments
Working knowledge of Vertex AI, Gen AI, and BigQuery
Benefits: Cognizant offers the following benefits for this position, subject to applicable eligibility requirements:
Medical/Dental/Vision/Life Insurance
Paid holidays plus Paid Time Off
401(k) plan and contributions
Long‑term/Short‑term Disability
Paid Parental Leave
Employee Stock Purchase Plan
We're excited to meet people who share our mission and can make an impact in a variety of ways. Don't hesitate to apply, even if you only meet the minimum requirements listed. Think about your transferable experiences and unique skills that make you stand out as someone who can bring new and exciting things to this role.
#J-18808-Ljbffr