Spectraforce Technologies
Job Title: Sr. Systems Reliability Engineer
Location: Seattle, WA
Duration: 12 Months CTH
Key Responsibilities
Contribute to the SRE strategy and establish best practices for release management, automation, and system reliability.
Mentor and guide SRE, Engineering, and Product teams in adopting core SRE principles such as service ownership, reducing toil, and continuous improvement.
Lead initiatives across SLIs/SLOs, observability, incident management, and postmortem practices, ensuring insights and learnings are captured and acted upon.
Champion SRE practices by implementing repeatable templates for logging, monitoring, and alerting frameworks.
Drive observability and monitoring excellence using tools such as Grafana, AppDynamics (AppD), and Sumo Logic, ensuring proactive detection and resolution of issues.
Partner with engineering to design reliable, fault‑tolerant systems and reduce operational toil through automation.
Implement and leverage the Ansible Automation Platform to help teams automate infrastructure provisioning, configuration management, and event‑driven workflows.
Enable teams to automate operational events and infrastructure changes, reducing manual intervention and improving system resilience.
Exercise sound judgment to ensure operational compliance with security, privacy, audit, disaster recovery, and other company requirements.
Job‑Specific Skills, Experience & Education
Minimum of 5 years of experience in Site Reliability Engineering, IT operations, or related fields.
Bachelor's degree in computer science, engineering, or equivalent experience (2 additional years in lieu of degree).
Technical expertise in system reliability, scalability, application design, and performance.
Hands‑on experience with observability and monitoring tools such as Grafana, AppDynamics, and Sumo Logic.
Experience with automation platforms, particularly Ansible, for infrastructure and event‑driven automation.
Proven ability to mentor and guide engineers in adopting SRE practices and principles.
Excellent communication and collaboration skills across diverse teams and vendors.
Strong judgment and problem‑solving capabilities.
Experience working in multi‑cloud environments.
Strong interpersonal, organizational, communication, and customer service skills.
Preferred
Experience applying ITIL, SRE and IT process best practices.
Experience in tracking major incidents, rollbacks, and hotfixes; leading root cause analysis (RCA) processes; and ensuring resolution and completion of action items.
Experience with technical engineering in IT operations.
#J-18808-Ljbffr
Location: Seattle, WA
Duration: 12 Months CTH
Key Responsibilities
Contribute to the SRE strategy and establish best practices for release management, automation, and system reliability.
Mentor and guide SRE, Engineering, and Product teams in adopting core SRE principles such as service ownership, reducing toil, and continuous improvement.
Lead initiatives across SLIs/SLOs, observability, incident management, and postmortem practices, ensuring insights and learnings are captured and acted upon.
Champion SRE practices by implementing repeatable templates for logging, monitoring, and alerting frameworks.
Drive observability and monitoring excellence using tools such as Grafana, AppDynamics (AppD), and Sumo Logic, ensuring proactive detection and resolution of issues.
Partner with engineering to design reliable, fault‑tolerant systems and reduce operational toil through automation.
Implement and leverage the Ansible Automation Platform to help teams automate infrastructure provisioning, configuration management, and event‑driven workflows.
Enable teams to automate operational events and infrastructure changes, reducing manual intervention and improving system resilience.
Exercise sound judgment to ensure operational compliance with security, privacy, audit, disaster recovery, and other company requirements.
Job‑Specific Skills, Experience & Education
Minimum of 5 years of experience in Site Reliability Engineering, IT operations, or related fields.
Bachelor's degree in computer science, engineering, or equivalent experience (2 additional years in lieu of degree).
Technical expertise in system reliability, scalability, application design, and performance.
Hands‑on experience with observability and monitoring tools such as Grafana, AppDynamics, and Sumo Logic.
Experience with automation platforms, particularly Ansible, for infrastructure and event‑driven automation.
Proven ability to mentor and guide engineers in adopting SRE practices and principles.
Excellent communication and collaboration skills across diverse teams and vendors.
Strong judgment and problem‑solving capabilities.
Experience working in multi‑cloud environments.
Strong interpersonal, organizational, communication, and customer service skills.
Preferred
Experience applying ITIL, SRE and IT process best practices.
Experience in tracking major incidents, rollbacks, and hotfixes; leading root cause analysis (RCA) processes; and ensuring resolution and completion of action items.
Experience with technical engineering in IT operations.
#J-18808-Ljbffr