Airspace Link

Site Reliability Engineer

Airspace Link, Detroit, Michigan, United States, 48228

Job Summary In a highly collaborative distributed agile team environment, this role will ensure the scalability, reliability and performance of Airspace Link’s systems and applications. This role in collaboration with the platform and software engineers will work on automating and improving operational processes.

Duties and Responsibilities

Reliability and Performance : Design, implement and maintain systems to ensure reliability, high availability and performance

Scalability : Optimize applications and infrastructure to handle growth

Monitoring and Alerting : Implement monitoring systems and performance metrics to proactively identify and address issues before the impact end users

Incident Response : Respond to incidents in a timely manner. Lead the efforts in resolving critical issues, prevent recurrence and run postmortems.

Automation : Develop tools and scripts to automate manual operational tasks, increasing efficiency

Infrastructure Management : Using Infrastructure as Code tools (IaC) like Terraform, Ansible etc

Capacity Planning : Analyze and forecast future infrastructure needs.

Change Management : Implement practices to safely release code (CI / CD, canary releases, feature flags). Reduce risk with increasing deployment velocity

Collaboration : Work closely with development teams to improve software quality and reliability

Disaster Recovery : Create disaster recovery plans to mitigate system failures

Security and Compliance : Implement security controls and conduct audits and vulnerability assessments. Ensure systems adhere to industry standards and regulations and conduct compliance audits and assessments

Position Type: Full-Time, 40 hours per week

Status: Exempt

Location: Hybrid

Requirements

B.S. in Computer Science or equivalent years to relevant experience or education

3+ years of professional experience in a similar SRE or DevOps role

Strong programming skills (Python, Go, Java)

Experience with cloud platforms (Azure, AWS, GCP)

Experience with containerization technologies (Docker, Kubernetes)

Experience with monitoring and logging tools (Prometheus, Grafana)

Knowledge of system administration

Strong problem solving and analytical skills

Great teamwork skills

An eagerness to learn and adapt to the needs of a greenfield industry

Part 107 or another pilot’s license a plus

#J-18808-Ljbffr