No Limit Staffing, Inc.
Site Reliability Engineer
No Limit Staffing, Inc., San Jose, California, United States, 95199
No Limit Staffing, Inc. provided pay range
This range is provided by No Limit Staffing, Inc.. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range
$80.00/hr - $90.00/hr Key Responsibilities
Design, build, and maintain observability tools, including metrics, logging, tracing, and alerting frameworks. Develop and refine SLIs, SLOs, and error budgets, ensuring alignment with business goals and service expectations. Partner with engineering and product teams to ensure systems are designed with reliability and observability in mind from the start. Troubleshoot and resolve incidents, conduct root cause analyses, and implement long-term fixes. Automate repetitive tasks and operational workflows to improve efficiency and reduce human error. Review and enhance existing monitoring and observability pipelines for performance, scalability, and accuracy. Educate teams on observability best practices and tooling, promoting a culture of proactive incident prevention and rapid remediation. Work on infrastructure improvements that enhance fault tolerance and performance under load. Advocate for and implement security best practices within observability tooling and data pipelines. Qualifications
3+ years of experience as a Site Reliability Engineer or similar role with a focus on observability. Strong experience with monitoring and observability tools such as Prometheus, Grafana, ELK stack, Datadog, New Relic, or similar platforms. Solid understanding of distributed systems, cloud infrastructure (AWS, Azure, GCP), and container orchestration (Kubernetes, Docker). Proven experience defining and implementing SLIs, SLOs, and monitoring strategies. Deep expertise in logging, tracing, and metrics-based alerting, including tuning thresholds and reducing noise. Hands-on experience with automation using scripting languages like Python, Bash, or Go. Knowledge of CI/CD pipelines and infrastructure-as-code tools like Terraform, Ansible, or similar. Strong troubleshooting and root cause analysis skills in high-pressure environments. Excellent communication skills and ability to collaborate effectively across teams. Self-starter mindset with the ability to independently lead initiatives and solve problems. Preferred Qualifications
Experience with chaos engineering or reliability testing frameworks. Familiarity with service mesh technologies like Istio or Linkerd. Understanding of application performance optimization. Experience with security observability and compliance monitoring. Seniority level
Mid-Senior level Employment type
Contract Job function
Information Technology Industries
IT Services and IT Consulting, IT System Custom Software Development, and Software Development
#J-18808-Ljbffr
This range is provided by No Limit Staffing, Inc.. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range
$80.00/hr - $90.00/hr Key Responsibilities
Design, build, and maintain observability tools, including metrics, logging, tracing, and alerting frameworks. Develop and refine SLIs, SLOs, and error budgets, ensuring alignment with business goals and service expectations. Partner with engineering and product teams to ensure systems are designed with reliability and observability in mind from the start. Troubleshoot and resolve incidents, conduct root cause analyses, and implement long-term fixes. Automate repetitive tasks and operational workflows to improve efficiency and reduce human error. Review and enhance existing monitoring and observability pipelines for performance, scalability, and accuracy. Educate teams on observability best practices and tooling, promoting a culture of proactive incident prevention and rapid remediation. Work on infrastructure improvements that enhance fault tolerance and performance under load. Advocate for and implement security best practices within observability tooling and data pipelines. Qualifications
3+ years of experience as a Site Reliability Engineer or similar role with a focus on observability. Strong experience with monitoring and observability tools such as Prometheus, Grafana, ELK stack, Datadog, New Relic, or similar platforms. Solid understanding of distributed systems, cloud infrastructure (AWS, Azure, GCP), and container orchestration (Kubernetes, Docker). Proven experience defining and implementing SLIs, SLOs, and monitoring strategies. Deep expertise in logging, tracing, and metrics-based alerting, including tuning thresholds and reducing noise. Hands-on experience with automation using scripting languages like Python, Bash, or Go. Knowledge of CI/CD pipelines and infrastructure-as-code tools like Terraform, Ansible, or similar. Strong troubleshooting and root cause analysis skills in high-pressure environments. Excellent communication skills and ability to collaborate effectively across teams. Self-starter mindset with the ability to independently lead initiatives and solve problems. Preferred Qualifications
Experience with chaos engineering or reliability testing frameworks. Familiarity with service mesh technologies like Istio or Linkerd. Understanding of application performance optimization. Experience with security observability and compliance monitoring. Seniority level
Mid-Senior level Employment type
Contract Job function
Information Technology Industries
IT Services and IT Consulting, IT System Custom Software Development, and Software Development
#J-18808-Ljbffr