SpringbokIT

Senior Site Reliability Engineer

SpringbokIT, Arlington, Texas, United States, 76000

We're seeking a skilled and proactive individual to help maintain and improve the stability, availability, and efficiency of our systems. In this role, you'll collaborate closely with both development and operations teams to enhance our infrastructure, support application delivery, and optimize for cost and performance. Key Responsibilities: Contribute to designing and deploying scalable, dependable systems using Kubernetes, Docker, and Istio Analyze system performance and recommend optimizations for responsiveness, uptime, and throughput Monitor production environments and manage incidents using observability tools like Datadog Write automation scripts to streamline deployment, monitoring, and infrastructure management Apply GitOps practices to ensure reliable, traceable production deployments Work with engineers to identify and troubleshoot system reliability issues Perform load testing to confirm capacity for upcoming product changes or launches Implement progressive deployment strategies such as A/B testing, canary releases, and traffic mirroring Support high-volume systems on AWS, including EKS clusters, load balancing, and network routing Maintain high service availability and user experience while optimizing cloud spend Participate in global on-call rotations to support production reliability Create and maintain internal documentation and promote knowledge sharing Assist in applying best practices for system resiliency and operational excellence Qualifications: 2+ years in SRE, DevOps, or infrastructure-related roles Working knowledge of AWS services Hands-on experience with containerization and orchestration (Kubernetes, Docker, Istio) Familiar with observability platforms like Datadog, Prometheus, Grafana, AppDynamics, or ELK Understanding of auto-scaling using Horizontal Pod Autoscalers (HPAs) Experience with GitOps tools such as Argo CD Familiarity with deployment techniques like blue/green, canary, and traffic splitting Proficiency with infrastructure-as-code and automation tools (e.g., Terraform, Ansible) Awareness of cloud cost optimization principles Strong analytical and troubleshooting skills Adaptability in learning and applying new tools and technologies Self-motivated, detail-oriented, and accountable Excellent collaboration and communication abilities Upholds high standards for work quality and integrity Experience with Golang or Rust is a plus, but not mandatory Job Details

Seniority level: Mid-Senior level Employment type: Contract Job function: Information Technology Industries: Technology, Information and Media

#J-18808-Ljbffr