Logo
Jobs via Dice

Senior Site Reliability Engineer

Jobs via Dice, Fairfax, Virginia, United States, 22032

Save Job

Join to apply for the

Senior Site Reliability Engineer

role at

Jobs via Dice .

Apex Systems is seeking a Senior Site Reliability Engineer to contribute to the Continuous Diagnostics and Mitigation (CDM) Cyber data solution. The position operates within the Scaled Agile Framework (SAFe) and focuses on building, deploying, and maintaining a data services solution that collects, normalizes, visualizes, and shares cyber data from more than 100 federal agencies.

Role & Responsibilities The Senior SRE will define, implement, and grow the SRE practice to ensure the reliability, availability, and performance of critical production environments. Key responsibilities include:

Design, implement, and maintain resilient, highly available, and performant infrastructure and applications.

Define and measure Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for the solution.

Set up comprehensive logging, monitoring, and alerting solutions using the Elastic stack and other tools.

Respond to incidents, perform root cause analyses, and implement preventive solutions.

Collaborate with developers, testers, infrastructure engineers, and DevOps engineers to integrate reliability and observability into the software development lifecycle.

Required Skills

Ship with ability to obtain Public Trust Suitability.

6+ years of experience as a Site Reliability Engineer (SRE) or equivalent.

6+ years of demonstrated experience designing, implementing, and maintaining observability solutions to include logging, monitoring, and alerting.

6+ years of hands‑on experience with SRE tools (e.g., Elastic, Prometheus, Grafana, Splunk, etc.).

3+ years defining and measuring SLOs and SLIs.

3+ years of relevant experience using cloud platforms (AWS GovCloud preferred).

3+ years of hands‑on programming or scripting (e.g., Python, Bash, etc.).

Strong knowledge of microservices, containerization, and orchestration tools (Docker, Kubernetes).

Proven ability to collaborate with cross‑functional teams to integrate reliability and observability into the software development lifecycle.

Strong problem‑solving and analytical skills.

Proactive, detail‑oriented approach to identifying inefficiencies and implementing improvements.

Desired Skills

Bachelor’s degree in Computer Science, Engineering, or a related field (or 4 additional years of related experience).

Experience working in an Agile/SAFe environment using ALM tools (Jira, Confluence, or similar).

Strong understanding of CI/CD principles and platforms (Jenkins, CircleCI, GitLab, GitHub Actions, Argo, Travis CI, etc.).

Expertise in configuration management tools (Ansible, Puppet, Chef).

Experience with infrastructure as code (Terraform, CloudFormation).

In‑depth understanding of networking, security, and system administration of Linux operating systems.

Knowledge of version control platforms and branching strategies.

Knowledge of disaster recovery planning, backup strategies, and data replication.

Experience supporting large federal programs ($200M+).

Equal Employment Opportunity Apex Systems is an equal opportunity employer. We do not discriminate on the basis of race, color, religion, creed, sex, age, sexual orientation, gender identity, national origin, ancestry, citizenship, genetic information, marital status, disability, protected veteran status, or any other characteristic protected by law.

#J-18808-Ljbffr