Mastech Digital
Role Overview
Were seeking a
Site Reliability Engineer (SRE)
with strong full-stack development expertise and hands-on experience in observability, automation, and reliability engineering. The ideal candidate will design monitoring solutions, optimize system performance, and drive reliability across distributed applications and infrastructure.
Must-Have Technical Skills (Level 3
57+ Years) Full Stack Development:
Strong ability to navigate across front-end, back-end, and infrastructure layers for debugging and optimization. Observability:
Deep understanding of logs, metrics, and traces for system monitoring and diagnostics. Monitoring & Analysis Tools: Dynatrace BigPanda Evolven ThousandEyes Nice to Have Skills Advanced experience with
Grafana
or
Kibana
for analytics and visualization. Familiarity with
cloud platforms (AWS/Azure/GCP)
and infrastructure-as-code tools. Key Responsibilities Define and implement standardized methods to collect and analyze
logs, traces, and metrics
across systems and applications. Develop
dashboards and monitoring frameworks
to improve visibility into system health and performance. Collaborate with development teams to enhance
service reliability , optimize deployments, and streamline release processes. Conduct
root cause analysis , performance tuning, and fault detection using observability tools. Participate in
system design reviews, platform management, and capacity planning . Build
automation pipelines
to reduce manual operations, improve efficiency, and ensure sustainable systems. Establish and maintain
Service Level Indicators (SLIs)
and
Service Level Objectives (SLOs)
to ensure uptime and performance standards are met. Education Bachelors degree preferred , but not required (Computer Science, Engineering, or related field).
Site Reliability Engineer (SRE)
with strong full-stack development expertise and hands-on experience in observability, automation, and reliability engineering. The ideal candidate will design monitoring solutions, optimize system performance, and drive reliability across distributed applications and infrastructure.
Must-Have Technical Skills (Level 3
57+ Years) Full Stack Development:
Strong ability to navigate across front-end, back-end, and infrastructure layers for debugging and optimization. Observability:
Deep understanding of logs, metrics, and traces for system monitoring and diagnostics. Monitoring & Analysis Tools: Dynatrace BigPanda Evolven ThousandEyes Nice to Have Skills Advanced experience with
Grafana
or
Kibana
for analytics and visualization. Familiarity with
cloud platforms (AWS/Azure/GCP)
and infrastructure-as-code tools. Key Responsibilities Define and implement standardized methods to collect and analyze
logs, traces, and metrics
across systems and applications. Develop
dashboards and monitoring frameworks
to improve visibility into system health and performance. Collaborate with development teams to enhance
service reliability , optimize deployments, and streamline release processes. Conduct
root cause analysis , performance tuning, and fault detection using observability tools. Participate in
system design reviews, platform management, and capacity planning . Build
automation pipelines
to reduce manual operations, improve efficiency, and ensure sustainable systems. Establish and maintain
Service Level Indicators (SLIs)
and
Service Level Objectives (SLOs)
to ensure uptime and performance standards are met. Education Bachelors degree preferred , but not required (Computer Science, Engineering, or related field).