Northwood
Site Reliability Engineer (Space Communications)
Northwood, Torrance, California, United States, 90504
Overview
Site Reliability Engineer (Space Communications) at Northwood. Join to help build and maintain observability infrastructure and ensure the global space communications network operates reliably as we scale ground stations around the world. Responsibilities
Build and maintain observability stack with tools like Grafana, Prometheus, Loki, Vector, CloudWatch, VictoriaMetrics, etc. for metrics and log ingestion across environments Support and improve CI/CD pipelines using GitLab and ArgoCD, collaborating with development teams on deployment best practices Help build and maintain cloud infrastructure using Terraform on AWS, contributing to the scalability and reliability of space communication systems Work with senior engineers to establish monitoring strategies, alerting, and incident response procedures Deploy and manage Kubernetes applications using Helm charts, focusing on reliability and developer experience Collaborate with engineering teams to implement performance monitoring and troubleshooting across microservices Support identity and access management integration with Okta and HashiCorp Vault Assist in managing NixOS-based infrastructure for reproducible system configurations Participate in incident response efforts and contribute to post-incident reviews and improvements Basic Qualifications
2-4 years of hands-on experience with infrastructure tools and monitoring systems in production environments Experience with containerization (Docker, Kubernetes) and basic container orchestration Familiarity with CI/CD tools (GitLab, Jenkins, or similar) and infrastructure as code concepts Experience with cloud platforms (AWS preferred) and basic infrastructure automation Programming skills in Python or similar language and experience with configuration management Startup mentality with ability to work in fast-paced, high-growth environments and take on diverse responsibilities Experience with logging and metrics collection for production systems Understanding of system reliability principles and interest in learning SRE practices Preferred Qualifications
Some exposure to observability tools like Vector, Loki, Grafana, Prometheus, or similar monitoring systems Experience with Terraform or other infrastructure as code tools Familiarity with NixOS or other declarative system configuration approaches Basic knowledge of HashiCorp Vault, Okta, or similar identity/secrets management tools Interest in distributed systems and troubleshooting complex technical issues Previous startup experience or demonstrated ability to learn quickly and adapt Linux system administration experience AWS certification or demonstrated cloud platform knowledge Additional Information
To conform to U.S. Government space technology export regulations, including the International Traffic in Arms Regulations (ITAR) you must be a U.S. citizen, lawful permanent resident of the U.S., protected individual as defined by 8 U.S.C. 1324b(a)(3), or eligible to obtain the required authorizations from the U.S. Department of State. Northwood is an Equal Opportunity Employer; employment with Northwood is governed on the basis of merit, competence and qualifications and will not be influenced in any manner by race, color, religion, gender, national origin/ethnicity, veteran status, disability status, age, sexual orientation, gender identity, marital status, mental or physical disability or any other legally protected status.
#J-18808-Ljbffr
Site Reliability Engineer (Space Communications) at Northwood. Join to help build and maintain observability infrastructure and ensure the global space communications network operates reliably as we scale ground stations around the world. Responsibilities
Build and maintain observability stack with tools like Grafana, Prometheus, Loki, Vector, CloudWatch, VictoriaMetrics, etc. for metrics and log ingestion across environments Support and improve CI/CD pipelines using GitLab and ArgoCD, collaborating with development teams on deployment best practices Help build and maintain cloud infrastructure using Terraform on AWS, contributing to the scalability and reliability of space communication systems Work with senior engineers to establish monitoring strategies, alerting, and incident response procedures Deploy and manage Kubernetes applications using Helm charts, focusing on reliability and developer experience Collaborate with engineering teams to implement performance monitoring and troubleshooting across microservices Support identity and access management integration with Okta and HashiCorp Vault Assist in managing NixOS-based infrastructure for reproducible system configurations Participate in incident response efforts and contribute to post-incident reviews and improvements Basic Qualifications
2-4 years of hands-on experience with infrastructure tools and monitoring systems in production environments Experience with containerization (Docker, Kubernetes) and basic container orchestration Familiarity with CI/CD tools (GitLab, Jenkins, or similar) and infrastructure as code concepts Experience with cloud platforms (AWS preferred) and basic infrastructure automation Programming skills in Python or similar language and experience with configuration management Startup mentality with ability to work in fast-paced, high-growth environments and take on diverse responsibilities Experience with logging and metrics collection for production systems Understanding of system reliability principles and interest in learning SRE practices Preferred Qualifications
Some exposure to observability tools like Vector, Loki, Grafana, Prometheus, or similar monitoring systems Experience with Terraform or other infrastructure as code tools Familiarity with NixOS or other declarative system configuration approaches Basic knowledge of HashiCorp Vault, Okta, or similar identity/secrets management tools Interest in distributed systems and troubleshooting complex technical issues Previous startup experience or demonstrated ability to learn quickly and adapt Linux system administration experience AWS certification or demonstrated cloud platform knowledge Additional Information
To conform to U.S. Government space technology export regulations, including the International Traffic in Arms Regulations (ITAR) you must be a U.S. citizen, lawful permanent resident of the U.S., protected individual as defined by 8 U.S.C. 1324b(a)(3), or eligible to obtain the required authorizations from the U.S. Department of State. Northwood is an Equal Opportunity Employer; employment with Northwood is governed on the basis of merit, competence and qualifications and will not be influenced in any manner by race, color, religion, gender, national origin/ethnicity, veteran status, disability status, age, sexual orientation, gender identity, marital status, mental or physical disability or any other legally protected status.
#J-18808-Ljbffr