Optomi
This range is provided by Optomi. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.
Base pay range $70.00/hr - $85.00/hr
Direct message the job poster from Optomi
Staff Site Reliability Engineer (SRE), Remote – Must work
PST hours
Optomi, in partnership with a leading consulting partner supporting large-scale modernization initiatives, is seeking a Staff Site Reliability Engineer (SRE) to drive performance, reliability, and operational excellence across critical applications. This role will serve as a key member of stream-aligned teams, ensuring the stability, scalability, and observability of enterprise systems through proactive monitoring, automation, and performance optimization. If you’re passionate about high-impact, hands‑on reliability engineering and thrive in complex, fast-moving environments, this is an ideal opportunity to make a measurable difference.
Key Responsibilities:
Conduct load and performance testing using tools like JMeter or Gatling to assess scalability, identify bottlenecks, and optimize system performance. Ensure applications can handle expected workloads while maintaining optimal performance.
Implement and maintain real-time monitoring solutions leveraging Datadog, Prometheus, Grafana, and New Relic. Track health, performance metrics, and SLAs to detect and resolve issues before user impact.
Investigate and resolve incidents, outages, and performance degradations using diagnostic tools and root‑cause analysis. Lead cross‑functional incident response efforts, coordinating communication and documenting post‑incident reviews.
Design and implement effective error handling, alerting, and logging strategies to capture and triage anomalies. Improve reliability through reduced error rates and faster recovery times.
Develop and maintain automation scripts and infrastructure‑as‑code frameworks using Go, Bash, Terraform, Ansible, Docker, and Helm. Streamline deployment, configuration, and operational workflows to reduce manual intervention.
Partner with security engineers to ensure alignment with Zero Trust policies, vulnerability remediation, and ISO standards. Utilize tools such as Synk, Datadog, and Jira for ongoing compliance tracking.
Collaborate with development teams to promote reliability‑focused coding practices and improve observability within applications, reducing operational risk and increasing development velocity.
What the right candidate would enjoy:
Contribute to mission‑critical systems supporting statewide initiatives!
Work with modern DevOps and observability stacks in a large‑scale production environment!
Flexible opportunity with collaborative, outcome‑driven teams!
Required Skills & Experience:
Hands‑on SRE, DevOps, or Systems Engineering experience within enterprise or government environments
Proficiency with performance testing tools (e.g., JMeter, Gatling)
Strong experience with monitoring and observability tools (Datadog, New Relic, Prometheus, Grafana)
Expertise in automation and scripting (Go, Bash) and infrastructure‑as‑code tools (Ansible, Terraform)
Proven troubleshooting and problem‑solving abilities for complex, distributed systems
Understanding of Agile and DevOps methodologies with a focus on continuous improvement
Excellent communication and collaboration skills to work across cross‑functional teams
Technologies & Tools:
Testing: JMeter, Gatling
Seniority level Mid‑Senior level
Employment type Full‑time
Job function IT Services and IT Consulting
#J-18808-Ljbffr
Base pay range $70.00/hr - $85.00/hr
Direct message the job poster from Optomi
Staff Site Reliability Engineer (SRE), Remote – Must work
PST hours
Optomi, in partnership with a leading consulting partner supporting large-scale modernization initiatives, is seeking a Staff Site Reliability Engineer (SRE) to drive performance, reliability, and operational excellence across critical applications. This role will serve as a key member of stream-aligned teams, ensuring the stability, scalability, and observability of enterprise systems through proactive monitoring, automation, and performance optimization. If you’re passionate about high-impact, hands‑on reliability engineering and thrive in complex, fast-moving environments, this is an ideal opportunity to make a measurable difference.
Key Responsibilities:
Conduct load and performance testing using tools like JMeter or Gatling to assess scalability, identify bottlenecks, and optimize system performance. Ensure applications can handle expected workloads while maintaining optimal performance.
Implement and maintain real-time monitoring solutions leveraging Datadog, Prometheus, Grafana, and New Relic. Track health, performance metrics, and SLAs to detect and resolve issues before user impact.
Investigate and resolve incidents, outages, and performance degradations using diagnostic tools and root‑cause analysis. Lead cross‑functional incident response efforts, coordinating communication and documenting post‑incident reviews.
Design and implement effective error handling, alerting, and logging strategies to capture and triage anomalies. Improve reliability through reduced error rates and faster recovery times.
Develop and maintain automation scripts and infrastructure‑as‑code frameworks using Go, Bash, Terraform, Ansible, Docker, and Helm. Streamline deployment, configuration, and operational workflows to reduce manual intervention.
Partner with security engineers to ensure alignment with Zero Trust policies, vulnerability remediation, and ISO standards. Utilize tools such as Synk, Datadog, and Jira for ongoing compliance tracking.
Collaborate with development teams to promote reliability‑focused coding practices and improve observability within applications, reducing operational risk and increasing development velocity.
What the right candidate would enjoy:
Contribute to mission‑critical systems supporting statewide initiatives!
Work with modern DevOps and observability stacks in a large‑scale production environment!
Flexible opportunity with collaborative, outcome‑driven teams!
Required Skills & Experience:
Hands‑on SRE, DevOps, or Systems Engineering experience within enterprise or government environments
Proficiency with performance testing tools (e.g., JMeter, Gatling)
Strong experience with monitoring and observability tools (Datadog, New Relic, Prometheus, Grafana)
Expertise in automation and scripting (Go, Bash) and infrastructure‑as‑code tools (Ansible, Terraform)
Proven troubleshooting and problem‑solving abilities for complex, distributed systems
Understanding of Agile and DevOps methodologies with a focus on continuous improvement
Excellent communication and collaboration skills to work across cross‑functional teams
Technologies & Tools:
Testing: JMeter, Gatling
Seniority level Mid‑Senior level
Employment type Full‑time
Job function IT Services and IT Consulting
#J-18808-Ljbffr