ACL Digital
Job Description - Manager, Cloud/Platform Operations
Location: Atlanta / Roswell, GA (Onsite with Offshore Team Management)
Role Summary Rheem is seeking a
Manager, Cloud Operations
to lead, transform, and scale its digital operations landscape across CloudOps, SRE, NOC, Observability, AIOps, and MLOps. This individual will serve as the
single point of accountability for operational stability and innovation , managing offshore teams while working closely with Rheem's U.S. digital leadership. This role is
not a steady-state manager position . The successful candidate will:
Identify operational gaps. Suggest and implement best practices and tools. Introduce automation and innovation strategies. Guide daily deliverables for offshore teams. Demonstrate tangible business impact each quarter (improved uptime, reduced MTTR, cost savings, predictive alerting, etc.). The Manager will report to the Director of Digital Operations and act as Rheem's
Cloud Operations Leader
in practice. Key Responsibilities Operations Strategy & Governance
Define the
vision, strategy, and roadmap
for CloudOps, Reliability, and Operational Excellence. Establish
KPIs and OKRs
aligned with Rheem's business goals (availability, MTTR, cloud cost per device, customer churn reduction). Deliver
quarterly impact reports
to business leadership showcasing operational improvements and ROI. Cloud Operations & FinOps
Own multi-region cloud operations across AWS and Azure platforms. Drive
cost transparency and optimization
via FinOps practices and dashboards. Build
capacity and resiliency models
for predictable operations. Conduct
resiliency drills and game days
to ensure high availability and compliance. Site Reliability Engineering (SRE)
Establish
SLIs, SLOs, and error budgets
to measure reliability. Build
incident management playbooks
and drive blameless postmortems. Proactively improve reliability through automation, self-healing, and continuous testing. Network Operations Center (NOC) Modernization
Transform NOC from alert-driven to
predictive, AIOps-enabled operations . Consolidate monitoring tools and reduce alert fatigue with intelligent correlation. Ensure
24x7 support coverage
through offshore team alignment and escalation management. Observability & Telemetry
Build a unified
observability stack
(logs, metrics, traces, RUM) leveraging OpenTelemetry. Enable
business-oriented dashboards
(device uptime, customer adoption, churn trends). Ensure
end-to-end visibility
from connected devices → cloud microservices → customer-facing apps. AIOps & MLOps [optional]
Deploy
AIOps solutions
for anomaly detection, predictive alerts, event correlation, and automated remediation. Operationalize ML models: rollout, monitoring, drift detection, rollback strategies. Showcase measurable value, e.g., warranty claim reduction, improved customer experience metrics. Process Innovation & Automation
Audit current toolchain and processes;
identify redundancies, gaps, and opportunities for automation . Align with DevOps/SecOps to streamline
release-to-operations handshakes . Drive
Infrastructure-as-Code for operations
(Terraform, Ansible, GitOps). Team Leadership & Offshore Management
Manage and mentor a distributed team (offshore + onsite), setting clear goals and accountability. Define
roles, responsibilities, and shift structures
for 24x7 global coverage. Build a culture of
continuous improvement and operational excellence . Compliance, Security & Risk
Ensure Rheem operations align with compliance standards (SOC2, ISO, HIPAA where applicable). Own
business continuity planning and disaster recovery testing . Proactively identify
operational risks
and mitigate before they impact business. Business Alignment & Change Leadership
Act as the
voice of operations at business leadership tables . Translate technical improvements into
business outcomes
(lower churn, improved uptime, faster installs, fewer complaints). Champion a
quarterly innovation agenda
to showcase improvements in uptime, cost, and reliability. Qualifications Must-Have
Experience & Leadership
10+ years of experience in
Cloud Operations, Site Reliability Engineering, or Digital Operations . Proven track record of
owning operational outcomes
(uptime, MTTR, cost optimization, observability). Experience
managing offshore/global delivery teams
with 24x7 coverage. Strong leadership presence - able to act as a
change agent , operate autonomously, and deliver measurable outcomes without day-to-day direction.
Cloud & Technical Expertise
Hands-on experience with
AWS and/or Azure
(multi-account, multi-region operations). Solid expertise with
observability & monitoring tools
(Datadog, Dynatrace, Splunk, Grafana, Prometheus, ELK/EFK). Familiarity with
Infrastructure-as-Code
(Terraform, Ansible, GitOps). Strong understanding of
SRE principles
(SLIs, SLOs, error budgets, incident management frameworks).
Process & Governance
Demonstrated ability to
design and implement operations frameworks
(Ops playbooks, NOC modernization, incident command systems). Knowledge of
FinOps practices
(cloud cost visibility, optimization, showback/chargeback). Experience ensuring compliance with
SOC2, ISO, HIPAA
or equivalent standards.
Soft Skills
Excellent
stakeholder communication
skills - ability to link operational KPIs with business outcomes. Strong
team leadership and mentoring skills , especially across distributed teams.
Nice-to-Have
Exposure to
AIOps platforms
(Moogsoft, BigPanda, OpsRamp, ServiceNow AI modules). Experience with
MLOps tooling
(MLflow, Kubeflow, SageMaker, Azure ML) for model deployment and monitoring. Prior background in
platform operations
at a product/SaaS company (vs pure IT Ops). Experience leading
automation-first initiatives
(predictive alerts, self-healing infra, auto-remediation pipelines). Hands-on experience with
CI/CD → Ops handshakes
and change-impact assessments. Cloud certifications:
AWS Certified Solutions Architect / DevOps Engineer Microsoft Certified: Azure Administrator / Solutions Architect FinOps Certified Practitioner
Role Summary Rheem is seeking a
Manager, Cloud Operations
to lead, transform, and scale its digital operations landscape across CloudOps, SRE, NOC, Observability, AIOps, and MLOps. This individual will serve as the
single point of accountability for operational stability and innovation , managing offshore teams while working closely with Rheem's U.S. digital leadership. This role is
not a steady-state manager position . The successful candidate will:
Identify operational gaps. Suggest and implement best practices and tools. Introduce automation and innovation strategies. Guide daily deliverables for offshore teams. Demonstrate tangible business impact each quarter (improved uptime, reduced MTTR, cost savings, predictive alerting, etc.). The Manager will report to the Director of Digital Operations and act as Rheem's
Cloud Operations Leader
in practice. Key Responsibilities Operations Strategy & Governance
Define the
vision, strategy, and roadmap
for CloudOps, Reliability, and Operational Excellence. Establish
KPIs and OKRs
aligned with Rheem's business goals (availability, MTTR, cloud cost per device, customer churn reduction). Deliver
quarterly impact reports
to business leadership showcasing operational improvements and ROI. Cloud Operations & FinOps
Own multi-region cloud operations across AWS and Azure platforms. Drive
cost transparency and optimization
via FinOps practices and dashboards. Build
capacity and resiliency models
for predictable operations. Conduct
resiliency drills and game days
to ensure high availability and compliance. Site Reliability Engineering (SRE)
Establish
SLIs, SLOs, and error budgets
to measure reliability. Build
incident management playbooks
and drive blameless postmortems. Proactively improve reliability through automation, self-healing, and continuous testing. Network Operations Center (NOC) Modernization
Transform NOC from alert-driven to
predictive, AIOps-enabled operations . Consolidate monitoring tools and reduce alert fatigue with intelligent correlation. Ensure
24x7 support coverage
through offshore team alignment and escalation management. Observability & Telemetry
Build a unified
observability stack
(logs, metrics, traces, RUM) leveraging OpenTelemetry. Enable
business-oriented dashboards
(device uptime, customer adoption, churn trends). Ensure
end-to-end visibility
from connected devices → cloud microservices → customer-facing apps. AIOps & MLOps [optional]
Deploy
AIOps solutions
for anomaly detection, predictive alerts, event correlation, and automated remediation. Operationalize ML models: rollout, monitoring, drift detection, rollback strategies. Showcase measurable value, e.g., warranty claim reduction, improved customer experience metrics. Process Innovation & Automation
Audit current toolchain and processes;
identify redundancies, gaps, and opportunities for automation . Align with DevOps/SecOps to streamline
release-to-operations handshakes . Drive
Infrastructure-as-Code for operations
(Terraform, Ansible, GitOps). Team Leadership & Offshore Management
Manage and mentor a distributed team (offshore + onsite), setting clear goals and accountability. Define
roles, responsibilities, and shift structures
for 24x7 global coverage. Build a culture of
continuous improvement and operational excellence . Compliance, Security & Risk
Ensure Rheem operations align with compliance standards (SOC2, ISO, HIPAA where applicable). Own
business continuity planning and disaster recovery testing . Proactively identify
operational risks
and mitigate before they impact business. Business Alignment & Change Leadership
Act as the
voice of operations at business leadership tables . Translate technical improvements into
business outcomes
(lower churn, improved uptime, faster installs, fewer complaints). Champion a
quarterly innovation agenda
to showcase improvements in uptime, cost, and reliability. Qualifications Must-Have
Experience & Leadership
10+ years of experience in
Cloud Operations, Site Reliability Engineering, or Digital Operations . Proven track record of
owning operational outcomes
(uptime, MTTR, cost optimization, observability). Experience
managing offshore/global delivery teams
with 24x7 coverage. Strong leadership presence - able to act as a
change agent , operate autonomously, and deliver measurable outcomes without day-to-day direction.
Cloud & Technical Expertise
Hands-on experience with
AWS and/or Azure
(multi-account, multi-region operations). Solid expertise with
observability & monitoring tools
(Datadog, Dynatrace, Splunk, Grafana, Prometheus, ELK/EFK). Familiarity with
Infrastructure-as-Code
(Terraform, Ansible, GitOps). Strong understanding of
SRE principles
(SLIs, SLOs, error budgets, incident management frameworks).
Process & Governance
Demonstrated ability to
design and implement operations frameworks
(Ops playbooks, NOC modernization, incident command systems). Knowledge of
FinOps practices
(cloud cost visibility, optimization, showback/chargeback). Experience ensuring compliance with
SOC2, ISO, HIPAA
or equivalent standards.
Soft Skills
Excellent
stakeholder communication
skills - ability to link operational KPIs with business outcomes. Strong
team leadership and mentoring skills , especially across distributed teams.
Nice-to-Have
Exposure to
AIOps platforms
(Moogsoft, BigPanda, OpsRamp, ServiceNow AI modules). Experience with
MLOps tooling
(MLflow, Kubeflow, SageMaker, Azure ML) for model deployment and monitoring. Prior background in
platform operations
at a product/SaaS company (vs pure IT Ops). Experience leading
automation-first initiatives
(predictive alerts, self-healing infra, auto-remediation pipelines). Hands-on experience with
CI/CD → Ops handshakes
and change-impact assessments. Cloud certifications:
AWS Certified Solutions Architect / DevOps Engineer Microsoft Certified: Azure Administrator / Solutions Architect FinOps Certified Practitioner