Software Guidance & Assistance, Inc. (SGA, Inc.)

Sr Site Reliability Engineer

Software Guidance & Assistance, Inc. (SGA, Inc.), Alpharetta, Georgia, United States, 30239

Software Guidance & Assistance, Inc. (SGA)

is searching for a

Sr Site Reliability Engineer

for a

contract

assignment with one of our premier financial services clients in

Alpharetta, GA . We are looking for an experienced Senior Site Reliability Engineer to join our team and play a critical role in ensuring the reliability, scalability, and performance of our cloud infrastructure. You will be a technical leader who combines deep operational expertise with strong automation skills to build and maintain highly available systems. As a Kubernetes expert, you will drive our container orchestration strategy and serve as a technical authority for our platform teams.

Job Responsibilities

Infrastructure & Automation

Design, deploy, and manage cloud infrastructure across AWS and Azure using Terraform and infrastructure-as-code principles

Architect, deploy, and maintain production-grade Kubernetes clusters with a focus on reliability, security, and performance

Serve as the subject matter expert on Kubernetes, providing guidance and best practices to engineering teams

Build and maintain automated provisioning pipelines to ensure consistent, repeatable deployments

Implement and maintain HashiCorp Vault on AWS for secrets management and security, including Vault integration with Kubernetes

Design and implement automated High Availability and Disaster Recovery (HA/DR) capabilities through CI/CD pipelines

Optimize cloud resources and Kubernetes workloads for performance, cost efficiency, and reliability

Observability & Monitoring

Architect and implement comprehensive observability solutions using Datadog for cloud-native applications and Kubernetes infrastructure

Build monitoring, logging, and alerting frameworks for containerized workloads that provide actionable insights into system health

Implement Kubernetes-native monitoring patterns and troubleshoot complex container orchestration issues

Integrate Datadog with PagerDuty and other incident management platforms

Define and track SLIs, SLOs, and error budgets to drive reliability improvements

Create custom dashboards and monitors to track infrastructure, application, and Kubernetes cluster performance

CI/CD & Pipeline Management

Design, build, and maintain robust CI/CD pipelines that enable rapid, safe deployments to Kubernetes

Implement GitOps workflows and automated deployment strategies for containerized applications

Implement automated testing, security scanning, and quality gates within pipelines

Drive solutions through test, QA, and production environments with appropriate controls and safeguards

Automate deployment strategies including blue‑green, canary, and rolling deployments in Kubernetes

Security & Vulnerability Management

Identify, assess, and remediate security vulnerabilities in infrastructure, applications, and Kubernetes clusters

Implement Kubernetes security best practices including RBAC, pod security policies/standards, and network policies

Collaborate with security teams to implement and maintain security best practices

Manage and maintain HashiCorp Vault infrastructure for secure secrets management

Ensure compliance with security policies and industry standards across all environments

Incident Management & Response

Participate in 24/7 on‑call rotation to respond to critical production incidents

Serve as Incident Commander, coordinating cross‑functional response teams during major outages

Lead post‑incident reviews and drive thorough root cause analysis across engineering teams

Troubleshoot complex Kubernetes and distributed systems issues under pressure

Develop and refine incident response procedures and runbooks

Collaboration & Leadership

Partner with engineering teams to improve system reliability and performance

Mentor junior SREs and promote SRE best practices across the organization

Lead Kubernetes adoption efforts and educate teams on container orchestration best practices

Drive initiatives to reduce toil through automation and process improvement

Contribute to architectural decisions with a reliability and operability lens

Required Skills

10-15+ years of experience in Site Reliability Engineering, DevOps, or similar roles

Expert‑level knowledge of Kubernetes, including architecture, operations, and troubleshooting in production environments

Proven track record as a go‑to Kubernetes resource and technical authority

Deep understanding of container technologies (Docker, containerd) and orchestration patterns

Strong hands‑on experience with AWS and Azure cloud platforms

Proficiency in Terraform for infrastructure automation and management

Expert‑level knowledge of Datadog for monitoring, logging, and observability

Experience with HashiCorp Vault, including deployment and management on AWS and Kubernetes integration

Deep understanding of CI/CD pipelines, including design, implementation, and optimization for containerized workloads

Proven ability to implement automated HA/DR solutions through CI/CD workflows

Strong programming skills in Python for automation, tooling, and analysis

Proven experience building observability solutions for distributed cloud applications

Experience configuring monitoring and alerting systems and integrating with paging platforms like PagerDuty

Demonstrated experience identifying and remediating security vulnerabilities

Experience driving deployments through multiple environments (test/QA/production) with proper gates and controls

Demonstrated experience participating in on‑call rotations and responding to production incidents

Experience serving as Incident Commander or leading incident response efforts

Track record of conducting root cause analysis and driving systemic improvements

Strong understanding of networking, security, and cloud architecture principles

Excellent communication skills with ability to work across multiple teams and explain complex Kubernetes concepts

Preferred Qualifications

Experience with Google Cloud Platform (GCP) and GKE

Certified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS)

Experience with service mesh technologies (Istio, Linkerd, Consul)

Knowledge of Helm, Kustomize, and other Kubernetes tooling

Experience with GitOps tools (ArgoCD, Flux)

Familiarity with additional CI/CD tools (Jenkins, GitLab CI, GitHub Actions, CircleCI)

Experience with configuration management tools (Ansible, Chef, Puppet)

Background in software engineering or systems programming

Understanding of chaos engineering and reliability testing methodologies

Experience with cost optimization strategies in cloud and Kubernetes environments

Security certifications (AWS Security Specialty, CISSP, CKS, etc.)

Experience with compliance frameworks (SOC 2, ISO 27001, etc.)

Contributions to open‑source Kubernetes projects or active participation in the Kubernetes community

SGA is a technology and resource solutions provider driven to stand out. We are a women‑owned business. Our mission: to solve big IT problems with a more personal, boutique approach. Each year, we match consultants like you to more than 1,000 engagements. When we say let’s work better together, we mean it. You’ll join a diverse team built on these core values: customer service, employee development, and quality and integrity in everything we do. Be yourself, love what you do and find your passion at work. Please find us at https://sgainc.com/.

SGA is an Equal Opportunity Employer and does not discriminate on the basis of Race, Color, Sex, Sexual Orientation, Gender Identity, Religion, National Origin, Disability, Veteran Status, Age, Marital Status, Pregnancy, Genetic Information, or Other Legally Protected Status. We are committed to providing access, equal opportunity, and reasonable accommodation for individuals with disabilities in employment, and our services, programs, and activities. Please visit our company EEO page to request an accommodation or assistance regarding our policy.

#J-18808-Ljbffr