Logo
ACL Digital

Cloud Operations/Platform Manager

ACL Digital, Tate, Georgia, United States, 30177

Save Job

Job Description - Manager, Cloud/Platform Operations Location: Atlanta / Roswell, GA (Onsite with Offshore Team Management)

Role Summary Rheem is seeking a

Manager, Cloud Operations

to lead, transform, and scale its digital operations landscape across CloudOps, SRE, NOC, Observability, AIOps, and MLOps. This individual will serve as the

single point of accountability for operational stability and innovation , managing offshore teams while working closely with Rheem's U.S. digital leadership. This role is

not a steady-state manager position . The successful candidate will:

Identify operational gaps. Suggest and implement best practices and tools. Introduce automation and innovation strategies. Guide daily deliverables for offshore teams. Demonstrate tangible business impact each quarter (improved uptime, reduced MTTR, cost savings, predictive alerting, etc.). The Manager will report to the Director of Digital Operations and act as Rheem's

Cloud Operations Leader

in practice. Key Responsibilities Operations Strategy & Governance

Define the

vision, strategy, and roadmap

for CloudOps, Reliability, and Operational Excellence. Establish

KPIs and OKRs

aligned with Rheem's business goals (availability, MTTR, cloud cost per device, customer churn reduction). Deliver

quarterly impact reports

to business leadership showcasing operational improvements and ROI. Cloud Operations & FinOps

Own multi-region cloud operations across AWS and Azure platforms. Drive

cost transparency and optimization

via FinOps practices and dashboards. Build

capacity and resiliency models

for predictable operations. Conduct

resiliency drills and game days

to ensure high availability and compliance. Site Reliability Engineering (SRE)

Establish

SLIs, SLOs, and error budgets

to measure reliability. Build

incident management playbooks

and drive blameless postmortems. Proactively improve reliability through automation, self-healing, and continuous testing. Network Operations Center (NOC) Modernization

Transform NOC from alert-driven to

predictive, AIOps-enabled operations . Consolidate monitoring tools and reduce alert fatigue with intelligent correlation. Ensure

24x7 support coverage

through offshore team alignment and escalation management. Observability & Telemetry

Build a unified

observability stack

(logs, metrics, traces, RUM) leveraging OpenTelemetry. Enable

business-oriented dashboards

(device uptime, customer adoption, churn trends). Ensure

end-to-end visibility

from connected devices → cloud microservices → customer-facing apps. AIOps & MLOps [optional]

Deploy

AIOps solutions

for anomaly detection, predictive alerts, event correlation, and automated remediation. Operationalize ML models: rollout, monitoring, drift detection, rollback strategies. Showcase measurable value, e.g., warranty claim reduction, improved customer experience metrics. Process Innovation & Automation

Audit current toolchain and processes;

identify redundancies, gaps, and opportunities for automation . Align with DevOps/SecOps to streamline

release-to-operations handshakes . Drive

Infrastructure-as-Code for operations

(Terraform, Ansible, GitOps). Team Leadership & Offshore Management

Manage and mentor a distributed team (offshore + onsite), setting clear goals and accountability. Define

roles, responsibilities, and shift structures

for 24x7 global coverage. Build a culture of

continuous improvement and operational excellence . Compliance, Security & Risk

Ensure Rheem operations align with compliance standards (SOC2, ISO, HIPAA where applicable). Own

business continuity planning and disaster recovery testing . Proactively identify

operational risks

and mitigate before they impact business. Business Alignment & Change Leadership

Act as the

voice of operations at business leadership tables . Translate technical improvements into

business outcomes

(lower churn, improved uptime, faster installs, fewer complaints). Champion a

quarterly innovation agenda

to showcase improvements in uptime, cost, and reliability. Qualifications Must-Have

Experience & Leadership

10+ years of experience in

Cloud Operations, Site Reliability Engineering, or Digital Operations . Proven track record of

owning operational outcomes

(uptime, MTTR, cost optimization, observability). Experience

managing offshore/global delivery teams

with 24x7 coverage. Strong leadership presence - able to act as a

change agent , operate autonomously, and deliver measurable outcomes without day-to-day direction.

Cloud & Technical Expertise

Hands-on experience with

AWS and/or Azure

(multi-account, multi-region operations). Solid expertise with

observability & monitoring tools

(Datadog, Dynatrace, Splunk, Grafana, Prometheus, ELK/EFK). Familiarity with

Infrastructure-as-Code

(Terraform, Ansible, GitOps). Strong understanding of

SRE principles

(SLIs, SLOs, error budgets, incident management frameworks).

Process & Governance

Demonstrated ability to

design and implement operations frameworks

(Ops playbooks, NOC modernization, incident command systems). Knowledge of

FinOps practices

(cloud cost visibility, optimization, showback/chargeback). Experience ensuring compliance with

SOC2, ISO, HIPAA

or equivalent standards.

Soft Skills

Excellent

stakeholder communication

skills - ability to link operational KPIs with business outcomes. Strong

team leadership and mentoring skills , especially across distributed teams.

Nice-to-Have

Exposure to

AIOps platforms

(Moogsoft, BigPanda, OpsRamp, ServiceNow AI modules). Experience with

MLOps tooling

(MLflow, Kubeflow, SageMaker, Azure ML) for model deployment and monitoring. Prior background in

platform operations

at a product/SaaS company (vs pure IT Ops). Experience leading

automation-first initiatives

(predictive alerts, self-healing infra, auto-remediation pipelines). Hands-on experience with

CI/CD → Ops handshakes

and change-impact assessments. Cloud certifications:

AWS Certified Solutions Architect / DevOps Engineer Microsoft Certified: Azure Administrator / Solutions Architect FinOps Certified Practitioner