Onebrief

Senior Site Reliability Engineer, Colorado Springs

Onebrief, Honolulu, Hawaii, United States, 96814

Senior Site Reliability Engineer, Colorado Springs

Join to apply for the

Senior Site Reliability Engineer, Colorado Springs

role at

Onebrief About Onebrief

Onebrief is collaboration and AI-powered workflow software designed specifically for military staffs. By transforming this work, Onebrief makes the staff as a whole superhuman - meaning faster, smarter, and more efficient. Onebrief operates as an all-remote company, with many employees working alongside customers at military commands around the world. Onebrief was founded in 2019 and has raised significant funding to support growth and impact. This role requires regularly working on-site at customer locations in Colorado Springs, Colorado. If you are not within commuting distance, relocation assistance is available. Active Top Secret clearance is required; SCI eligibility is a plus. About The Role

We are hiring a Site Reliability Engineer to join our Infrastructure & Security team. You’ll report to our Director of Infrastructure and work closely with fellow SREs, security, and customer success. You will be the first line of support for mission-critical deployments, and responsible for ensuring best-in-class service quality and issue resolution. You will work in both on-premise DoD environments and AWS cloud environments. Your field experience will shape how our team works, from policy to implementation. You will contribute to solutions that increase stability, performance, security of deployments, and improve the overall experience of deploying and managing Onebrief on premise. About You

You are a force multiplier who views reliability as the most critical feature of any application and believe that "reliability beats novelty." You see infrastructure and operability as a product to be automated, documented, and continuously improved, always leaving systems easier to operate than you found them. You are comfortable leading a post-incident review, designing SLOs in a system design session, or triaging complex production issues. You translate constraints and failure modes into clear, automated guardrails and scalable, resilient architecture. Robust monitoring, actionable alerting, and insightful runbooks are core parts of the engineering process. You mentor others, foster blameless postmortems, and collaborate with application and platform teams to build tools and observability that enable fast recovery. What You'll Do

You will own the reliability, scalability, and security of the production application and/or platform. Responsibilities include: Building a world-class observability platform: design, implement, and manage monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Grafana). Create actionable insights and automated alerting to identify and resolve issues before they impact users. Defining and upholding reliability: define, measure, and own alerting that feeds into Service Level Objectives (SLOs) and increases trust. Leading incident response: act as incident responder and incident commander during critical incidents; lead blameless post-mortems / After Action Reviews (AARs) to identify root causes and drive automated solutions to prevent recurrence. Automating for scale and security: partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible). Embedding security and compliance controls (RMF, STIGs) into automation. Eliminating toil and scaling the team: proactively identify and eliminate operational toil by building automation; advise other teams on best practices in air-gapped environments and production readiness. What We Look For 3 years of experience in Site Reliability Engineering or related field, with firsthand experience managing mission-critical systems within DoD air-gapped environments Active Top Secret security clearance. U.S. citizenship required. Experience automating software delivery, deployment, and providing documentation and self-service tools for engineering teams and customers. Strong understanding of Linux, containerization/orchestration, and virtual machines Experience with centralized logging, metrics, and observability using tools such as Prometheus, Loki, Grafana, ELK stack, or Datadog Networking fundamentals: core protocols and secure configurations Strong incident response skills with experience conducting root cause analyses and driving continuous improvement Clear, concise writing; strong documentation habits and async communication Core skills and technologies: VMware, Kubernetes, Docker, Helm, Ansible, Terraform, Linux, AWS, DoD compliance, Monitoring and Observability tools Bonus points (nice to have) Experience with compliance frameworks (RMF, STIGs/SRGs, ICD 503) Security-minded design for air-gapped environments Active Security+ or another DoD 8570.01-approved credential, or ability to obtain within 3 months of employment Seniority level

Mid-Senior level Employment type

Full-time Job function

Engineering and Information Technology Industries

Software Development Referrals increase your chances of interviewing at Onebrief by 2x. Get notified about new Senior Site Reliability Engineer jobs in Honolulu, HI.

#J-18808-Ljbffr