ArrowCore Group
Data Center Site Reliability Engineer (SRE)
ArrowCore Group, Atlanta, Georgia, United States, 30383
Data Center Site Reliability Engineer (SRE)
Data Center Site Reliability Engineer (SRE)
Direct message the job poster from ArrowCore Group Sr. Technical Recruiter @ ArrowCore Group | Technical Talent Search
Title:
Data Center Site Reliability Engineer (SRE) Duration: FTE About the Role We are seeking a Data Center Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of large-scale data center infrastructure supporting advanced AI workloads. In this role, you will collaborate with cross-functional teams to automate operations, enhance observability, and maintain high availability for distributed systems. This is a hands-on technical position in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of AI, data center operations, and software reliability. Key Responsibilities Maintain and improve the reliability and uptime of on-premises and cloud-based data center environments. Design, implement, and manage monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, PagerDuty). Develop and maintain infrastructure-as-code (Pulumi, Terraform) and continuous deployment pipelines (Buildkite, ArgoCD). Participate in on-call rotations, respond to incidents, perform root cause analysis, and drive post-mortem processes. Analyze system performance, forecast capacity needs, and optimize resource utilization. Collaborate with hardware, networking, and software engineering teams to design and implement resilient, scalable solutions. Create and maintain documentation and standard operating procedures. Required Qualifications Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent experience). 5+ years in site reliability engineering, data center operations, or large-scale infrastructure management. Expert-level knowledge of Kubernetes (on-prem and cloud), infrastructure-as-code tools (Pulumi, Terraform), and CI/CD systems. Proficiency in at least one systems programming language (Rust, C++, Go) and strong scripting/automation skills. Deep understanding of monitoring and observability technologies. Strong troubleshooting skills across hardware, networking, and distributed software systems. Experience with incident management and root cause analysis. Excellent communication and documentation skills. Preferred Qualifications Experience supporting AI/ML workloads or high-density compute environments. Familiarity with data center electrical, cooling, and network systems. Certifications in SRE, Kubernetes, or data center operations. Experience with both on-premises and cloud infrastructure at scale. Seniority level
Seniority level Mid-Senior level Employment type
Employment type Full-time Job function
Job function Information Technology and Consulting Industries IT Services and IT Consulting, IT System Operations and Maintenance, and Software Development Referrals increase your chances of interviewing at ArrowCore Group by 2x Sign in to set job alerts for “Site Reliability Engineer” roles.
Atlanta, GA $70,000.00-$120,000.00 1 month ago Co-op, IT - Software Engineering (Spring, 2025)
Alpharetta, GA $70,000.00-$120,000.00 2 weeks ago Atlanta, GA $1,000.00-$2,000.00 2 months ago Associate Software Development Engineer, Crew
Alpharetta, GA $86,000.00-$125,000.00 1 month ago Back End / Full Stack Software Engineer (Senior)
Associate Software Development Engineer, Crew
Atlanta, GA $120,000.00-$140,000.00 1 week ago We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr
Data Center Site Reliability Engineer (SRE)
Direct message the job poster from ArrowCore Group Sr. Technical Recruiter @ ArrowCore Group | Technical Talent Search
Title:
Data Center Site Reliability Engineer (SRE) Duration: FTE About the Role We are seeking a Data Center Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of large-scale data center infrastructure supporting advanced AI workloads. In this role, you will collaborate with cross-functional teams to automate operations, enhance observability, and maintain high availability for distributed systems. This is a hands-on technical position in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of AI, data center operations, and software reliability. Key Responsibilities Maintain and improve the reliability and uptime of on-premises and cloud-based data center environments. Design, implement, and manage monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, PagerDuty). Develop and maintain infrastructure-as-code (Pulumi, Terraform) and continuous deployment pipelines (Buildkite, ArgoCD). Participate in on-call rotations, respond to incidents, perform root cause analysis, and drive post-mortem processes. Analyze system performance, forecast capacity needs, and optimize resource utilization. Collaborate with hardware, networking, and software engineering teams to design and implement resilient, scalable solutions. Create and maintain documentation and standard operating procedures. Required Qualifications Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent experience). 5+ years in site reliability engineering, data center operations, or large-scale infrastructure management. Expert-level knowledge of Kubernetes (on-prem and cloud), infrastructure-as-code tools (Pulumi, Terraform), and CI/CD systems. Proficiency in at least one systems programming language (Rust, C++, Go) and strong scripting/automation skills. Deep understanding of monitoring and observability technologies. Strong troubleshooting skills across hardware, networking, and distributed software systems. Experience with incident management and root cause analysis. Excellent communication and documentation skills. Preferred Qualifications Experience supporting AI/ML workloads or high-density compute environments. Familiarity with data center electrical, cooling, and network systems. Certifications in SRE, Kubernetes, or data center operations. Experience with both on-premises and cloud infrastructure at scale. Seniority level
Seniority level Mid-Senior level Employment type
Employment type Full-time Job function
Job function Information Technology and Consulting Industries IT Services and IT Consulting, IT System Operations and Maintenance, and Software Development Referrals increase your chances of interviewing at ArrowCore Group by 2x Sign in to set job alerts for “Site Reliability Engineer” roles.
Atlanta, GA $70,000.00-$120,000.00 1 month ago Co-op, IT - Software Engineering (Spring, 2025)
Alpharetta, GA $70,000.00-$120,000.00 2 weeks ago Atlanta, GA $1,000.00-$2,000.00 2 months ago Associate Software Development Engineer, Crew
Alpharetta, GA $86,000.00-$125,000.00 1 month ago Back End / Full Stack Software Engineer (Senior)
Associate Software Development Engineer, Crew
Atlanta, GA $120,000.00-$140,000.00 1 week ago We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr