1872 Consulting

Site Reliability Engineer

1872 Consulting, Redwood City, California, United States, 94061

Site Reliability Engineer - 100% Remote Role Summary: Site Reliability Engineers (SREs) are responsible for working with different developer teams to keep our systems running smoothly. They are a blend of pragmatic operators and software craftspeople that apply excellent problem-solving and communication skills to develop or configure tools that will automate, monitor, and alert the reliability of internal Systems

What you will be doing:

Be on-call rotation to respond to LeadIQ availability incidents and support developers with customer incidents

Use your on-call shift to prevent incidents from happening. Step-in either actively or in support of the engineers when they do.

Run our infrastructure with AWS, Terraform, and Kubernetes (EKS).

Think about systems - edge cases, failure modes, behaviors, specific implementations.

Make monitoring and alert on symptoms and not on outages.

Document every action, so your findings turn into repeatable actions–and then into automation.

Improve the deployment process to make it as boring as possible.

Design, build and maintain core infrastructure pieces that allow LeadIQ scaling to support hundreds of thousands of concurrent users.

Debug production issues across services and levels of the stack.

Plan the growth of LeadIQ infrastructure.

Support the definition and building of SLI and SLO for engineering teams

The Requirements:

4+ years working with Terraform and AWS

2+ years working with

Gitlab (or similar) as CI tool

Datadog (or similar) as Alerting tool

Kubernetes

Know your way around Linux and the Unix Shell.

Programming skills on NodeJS and/or Go

Nice to Haves

Have experience with tech stack: Nginx, Docker, Kubernetes, Terraform, Terragrunt, AWS, Gitlab, Helm, ArgoCD, Datadog, or similar technologies

AWS, Terraform, Kubernetes certifications

#J-18808-Ljbffr