Logo
National Student Clearinghouse

Site Reliability Engineer

National Student Clearinghouse, Herndon, Virginia, United States, 22070

Save Job

Overview

By joining the National Student Clearinghouse, you can be sure that the work you do now will help shape the future of education and the workforce in the U.S. The Clearinghouse is the leading provider of transcript and data exchange services, automated enrollment and degree verifications, learner insights and research, and compliance solutions for schools, businesses, and learners nationwide. As a 501(c)(3) nonprofit organization, the Clearinghouse works with nearly 3,600 postsecondary institutions and thousands of high schools and districts. The Research Center publications inform policymakers and business leaders about student educational pathways. Security and privacy are paramount. Join us as we continue to invest in our talent and new advanced technologies to unlock the power of data on behalf of all learners.

About the Role

The Site Reliability Engineer role exists to ensure the reliability, scalability, and performance of an organization\'s systems and services. This position bridges software engineering and IT operations, applying automation, monitoring, and proactive incident management to maintain highly available and resilient platforms. SREs design and implement solutions that reduce operational toil, improve system efficiency, and enable development teams to deliver features quickly without compromising stability. By focusing on service availability, performance optimization, and continuous improvement, the SRE function is critical to sustaining customer trust and supporting business growth.

This position operates with a high degree of autonomy, making independent decisions on system reliability, performance optimization, and incident response within established service-level objectives. The role requires discretion in prioritizing tasks, implementing automation, and resolving complex technical issues to maintain system stability and meet business continuity goals.

Currently, this is a remote-first position, and this position may be required to periodically work on-site at our office. Candidates must reside within a reasonable distance to commute to our office or be willing to travel to our office in Herndon, when required.

How You Contribute Demonstrate the Clearinghouse\'s core competencies: Customer Focus, Optimizes Work Processes, Collaborates, Communicates Effectively, and Be Open and Authentic.

Reliability Engineering & SLOs: Define SLIs/SLOs and manage error budgets; drive reliability reviews and continuous improvement to protect customer experience.

Observability & Monitoring: Build and operate end-to-end observability (metrics, traces, logs, synthetics, dashboards, alerting), leveraging tools such as Datadog; tune alerts for actionability and reduce noise.

Incident Management: Participate in and help coordinate incident response and on-call rotations; lead blameless post-incident reviews, root-cause analysis, and corrective action tracking.

Automation & CI/CD: Partner with engineering to automate build, test, deploy, and release processes (e.g., GitLab CI) and promote progressive delivery, change safety, and rollback strategies.

Infrastructure as Code & Cloud: Provision and manage cloud infrastructure with Terraform/CloudFormation on AWS/Azure/GCP; enforce configuration baselines and drift detection.

Containers & Orchestration: Operate containerized workloads at scale (Kubernetes, Helm); release strategies (blue/green, canary).

Performance & Capacity: Conduct performance testing and tuning; lead capacity planning and cost-aware scaling.

Security & Compliance: Embed security into pipelines and environments (e.g., IAM guardrails, policy-as-code, audit logging, vulnerability management) in partnership with DevSecOps.

Runbooks & Documentation: Create and maintain runbooks, operational SOPs, and service catalogs; promote knowledge sharing and operational readiness across teams.

Collaboration: Work across engineering, infrastructure, devsecops, security, and product to deliver reliable, scalable services; communicate clearly with technical and non-technical stakeholders.

Continuous Improvement: Identify toil, propose experiments (e.g., chaos testing, game days), and automate repetitive operations to improve MTTR and deployment safety (DORA metrics awareness).

Perform other duties as required.

What You Bring to the Table Bachelors degree in Computer Science, IT or related field. A combination of education and experience including military service will also be considered.

5 years in Site Reliability Engineering, DevOps, or a related role, with demonstrated expertise in cloud platforms (AWS, Azure, or GCP), automation, and system monitoring.

Experience:

Operating and supporting production services in cloud environments (AWS, Azure, or GCP).

Implementing and managing CI/CD pipelines (e.g., GitLab or equivalent) and progressive delivery strategies (blue/green, canary, feature flags).

Managing containerized environments using Docker and Kubernetes.

Infrastructure as Code tools such as Terraform, Ansible, or CloudFormation for automated provisioning.

Automation scripting with Python, Bash, or PowerShell, including configuration baseline enforcement and drift detection.

Observability and monitoring (metrics, logs, traces) and actionable alerting; hands-on experience with tools like Datadog or similar.

Proven ability to lead incident management, perform root-cause analysis, and conduct blameless post-incident reviews.

Cloud Certification: AWS Certified DevOps Engineer or equivalent certification (e.g., Azure DevOps Engineer Expert, Google Professional Cloud DevOps Engineer).

Cloud Platforms: Proven proficiency in deploying and managing scalable infrastructure on AWS, Azure, or GCP.

Programming & Automation: Strong scripting and programming skills in Python, Bash, or Go, with experience automating operational tasks and building CI/CD pipelines.

Monitoring & Observability: Hands-on experience with system health and performance monitoring tools such as Prometheus, Grafana, and the ELK stack; prior experience with Datadog is strongly preferred.

CI/CD & Version Control: Expertise in Git-based workflows and CI/CD tools such as Jenkins, GitLab CI, or GitHub Actions.

Incident Response: Demonstrated ability to manage on-call rotations, perform root cause analysis, and lead post-mortem processes.

Troubleshooting: Skilled in diagnosing complex system issues quickly and effectively.

Residency: Must live within a commutable distance to Herndon, VA or in one of the Clearinghouse\'s approved States for hiring purposes. See HR Policies Page for details.

Authorization: Must be currently authorized to work in the United States on a full-time basis. We do not intend to sponsor external applicants for work visas, and may consider sponsorship only if no qualified candidates can be found who are authorized to work without sponsorship.

Must be at least 18 years old.

Physical Demands Use of a computer for 8 or more hours a day.

Use of a telephone and/or copy machine.

Frequently required to sit for 7 or more hours per day.

Occasionally required to use hands and fingers to operate, handle, and reach.

Vision abilities include close vision and the ability to adjust focus.

Flexibility to occasionally work overtime, evenings, or weekends as business needs arise.

Benefits and Related Information

The National Student Clearinghouse provides a robust benefit program designed to help meet the needs of each employee and their family, both now and in the future. We offer comprehensive medical, dental, and vision insurance, as well as life and disability insurance benefits, for employees and their qualified dependents. Health care accounts, flexible spending accounts, a health savings account with employer contributions, voluntary supplemental health plans, and infertility coverage are options available. We offer a generous 401k matching program, paid leave, holidays, and parental/military leave. New hires accrue vacation and sick time; the organization observes paid holidays annually. Additional perks include memberships reimbursements, education assistance, and LinkedIn Learning access. The organization also supports PSLF eligibility for employees as a 501(c)(3) nonprofit. For details, request a copy of Benefits at a Glance.

Salary range is estimated at 120,000 – 140,000. The pay range guides compensation and is not a guarantee. Internal candidates are encouraged to apply and compensation will be reviewed prior to offers. This job was posted on 12/3/2025 and remains open for at least 3 days.

Equal Opportunity and Additional Notices

Equal Opportunity Employer/Protected Veterans/Individuals with Disabilities:

The Clearinghouse is an Equal Opportunity Employer. Qualified applicants will receive consideration without regard to protected characteristics. Pay Transparency Notice:

The Clearinghouse complies with applicable pay transparency laws. See policy details in the HR materials.

#J-18808-Ljbffr