ECS
Overview
ECS is seeking a
Senior Site Reliability Engineer
to work
remotely . ECS is responsible for designing, building, deploying, operating, and maintaining a complete ‘Data Services’ solution which includes the collection, normalization, visualization, and sharing of cyber data from more than 100 Federal agencies as part of the Continuous Diagnostics and Mitigation (CDM) Cyber data solution for the Cybersecurity and Infrastructure Security Agency (CISA).
The CDM Data Services product is an integrated suite of Commercial Off the Shelf (COTS) products, software configuration packages, and custom code that operates as an integrated solution tailored to meet Department of Homeland Security (DHS) requirements. Our program operates within SAFe. An aptitude and enthusiasm for continuous learning, improvement, and cybersecurity is required.
Responsibilities
Define, implement, and grow our SRE practice to ensure reliability, availability, and performance of production environments.
Contribute to a culture of continuous improvement, identifying areas for enhancement, and driving initiatives to improve system reliability, scalability, and efficiency.
Design, implement, and maintain solutions to ensure systems, including infrastructure and applications, are resilient, highly available, and performant.
Define and measure Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for our solution.
Set up comprehensive logging, monitoring, and alerting using the Elastic stack and other tools; respond to incidents, perform root cause analyses, and implement preventive actions.
Collaborate with cross-functional teams to integrate reliability and observability into the software development lifecycle.
Salary Range: $118,345 - $177,518
Qualifications
US citizenship with ability to obtain Public Trust Suitability.
6+ years of experience as an SRE or equivalent.
6+ years designing, implementing, and maintaining observability solutions (logging, monitoring, alerting).
6+ years hands-on with SRE tools (Elastic, Prometheus, Grafana, Splunk, etc.).
3+ years defining and measuring SLOs and SLIs.
3+ years of experience using cloud platforms (AWS GovCloud preferred).
3+ years of programming or scripting (e.g., Python, Bash).
Strong knowledge of microservices, containerization, and orchestration (Docker, Kubernetes).
Proven ability to collaborate with cross-functional teams to integrate reliability into the software lifecycle.
Strong problem-solving and analytical skills; proactive, detail-oriented.
Preferred Qualifications
Bachelor's degree in Computer Science, Engineering, or related field (or 4 additional years of related experience).
Experience in Agile/SAFe environments with ALM tools (Jira, Confluence, etc.).
Understanding of CI/CD principles and platforms (Jenkins, CircleCI, GitLab, GitHub Actions, Argo, Travis CI).
Experience with configuration management tools (Ansible, Puppet, Chef).
Infrastructure as code (Terraform, CloudFormation).
Networking, security, and Linux system administration knowledge.
Knowledge of version control and branching strategies; disaster recovery planning and data replication.
Experience supporting large Federal programs.
ECS is an equal opportunity employer and does not discriminate on the basis of any protected characteristic. All qualified applicants will receive consideration for employment without regard to disability, veteran status, or any other status protected by law.
About ECS ECS is a leading mid-sized provider of technology services to the United States Federal Government. We focus on people, values, and purpose. Our 3800+ employees support Federal Agencies and Departments to serve, protect, and defend the American people.
#J-18808-Ljbffr
Senior Site Reliability Engineer
to work
remotely . ECS is responsible for designing, building, deploying, operating, and maintaining a complete ‘Data Services’ solution which includes the collection, normalization, visualization, and sharing of cyber data from more than 100 Federal agencies as part of the Continuous Diagnostics and Mitigation (CDM) Cyber data solution for the Cybersecurity and Infrastructure Security Agency (CISA).
The CDM Data Services product is an integrated suite of Commercial Off the Shelf (COTS) products, software configuration packages, and custom code that operates as an integrated solution tailored to meet Department of Homeland Security (DHS) requirements. Our program operates within SAFe. An aptitude and enthusiasm for continuous learning, improvement, and cybersecurity is required.
Responsibilities
Define, implement, and grow our SRE practice to ensure reliability, availability, and performance of production environments.
Contribute to a culture of continuous improvement, identifying areas for enhancement, and driving initiatives to improve system reliability, scalability, and efficiency.
Design, implement, and maintain solutions to ensure systems, including infrastructure and applications, are resilient, highly available, and performant.
Define and measure Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for our solution.
Set up comprehensive logging, monitoring, and alerting using the Elastic stack and other tools; respond to incidents, perform root cause analyses, and implement preventive actions.
Collaborate with cross-functional teams to integrate reliability and observability into the software development lifecycle.
Salary Range: $118,345 - $177,518
Qualifications
US citizenship with ability to obtain Public Trust Suitability.
6+ years of experience as an SRE or equivalent.
6+ years designing, implementing, and maintaining observability solutions (logging, monitoring, alerting).
6+ years hands-on with SRE tools (Elastic, Prometheus, Grafana, Splunk, etc.).
3+ years defining and measuring SLOs and SLIs.
3+ years of experience using cloud platforms (AWS GovCloud preferred).
3+ years of programming or scripting (e.g., Python, Bash).
Strong knowledge of microservices, containerization, and orchestration (Docker, Kubernetes).
Proven ability to collaborate with cross-functional teams to integrate reliability into the software lifecycle.
Strong problem-solving and analytical skills; proactive, detail-oriented.
Preferred Qualifications
Bachelor's degree in Computer Science, Engineering, or related field (or 4 additional years of related experience).
Experience in Agile/SAFe environments with ALM tools (Jira, Confluence, etc.).
Understanding of CI/CD principles and platforms (Jenkins, CircleCI, GitLab, GitHub Actions, Argo, Travis CI).
Experience with configuration management tools (Ansible, Puppet, Chef).
Infrastructure as code (Terraform, CloudFormation).
Networking, security, and Linux system administration knowledge.
Knowledge of version control and branching strategies; disaster recovery planning and data replication.
Experience supporting large Federal programs.
ECS is an equal opportunity employer and does not discriminate on the basis of any protected characteristic. All qualified applicants will receive consideration for employment without regard to disability, veteran status, or any other status protected by law.
About ECS ECS is a leading mid-sized provider of technology services to the United States Federal Government. We focus on people, values, and purpose. Our 3800+ employees support Federal Agencies and Departments to serve, protect, and defend the American people.
#J-18808-Ljbffr