SAMO Technologies
Lead Site Reliability Engineer (On-Site) — Glendale, California
SAMO Technologies, Glendale, California, us, 91222
Overview
Lead Site Reliability Engineer (On-Site) — Glendale, California. Drive reliability, scalability, and performance across production systems—using AI and automation to reduce toil, speed incident response, and forecast capacity. What you’ll do
Own monitoring, alerting, and observability with anomaly detection and intelligent root-cause analysis. Lead incident response and post-mortems; implement automated remediation. Define SLOs/error budgets; use predictive analytics to anticipate reliability risks. Architect automation for deploys, scaling, and infra management (including ML-driven workflows). Lead capacity planning and performance optimization with forecasting models. Mentor SREs; partner with engineering and data science to embed AI-enhanced reliability practices. Contribute to on-call while continuously reducing noise and manual work through automation. AI-Enhanced SRE Leadership
Implement and maintain AI-powered incident prediction and prevention systems. Design intelligent alerting systems that reduce noise and provide contextual insights using natural language processing and machine learning. Develop AI-driven capacity planning models that predict resource needs and optimize cost efficiency. Build and maintain chatbots and AI assistants for operational tasks, documentation search, and incident triage. Implement automated root cause analysis using AI correlation engines and log analysis. Required experience
6+ years in SRE/DevOps/infrastructure roles; 2+ years leading initiatives or teams. Strong Python skills and applied AI/ML for ops (anomaly detection, predictive analytics, NLP). Deep cloud experience (AWS and/or GCP) including native AI/ML services. Infrastructure as Code & config management; CI/CD with automated quality gates/decisions. Monitoring/observability at scale, including time-series analysis & statistical modeling. Expert-level in: AWS
(core services, networking, compute, databases, storage) Terraform Kubernetes
(incl. Karpenter) &
Helm Built/ran in-house observability stacks:
OpenTelemetry ,
Loki ,
Grafana ,
Prometheus ,
CloudWatch ,
X-Ray/Jaeger . Bonus:
ArgoCD/Argo Workflows , LLM integration & prompt engineering, automated incident response, intelligent service mesh, AI-driven dashboards & cost optimization, cloud certs. Location
On-site in
Glendale, California Compensation
$185,000 - $195,000 a year
#J-18808-Ljbffr
Lead Site Reliability Engineer (On-Site) — Glendale, California. Drive reliability, scalability, and performance across production systems—using AI and automation to reduce toil, speed incident response, and forecast capacity. What you’ll do
Own monitoring, alerting, and observability with anomaly detection and intelligent root-cause analysis. Lead incident response and post-mortems; implement automated remediation. Define SLOs/error budgets; use predictive analytics to anticipate reliability risks. Architect automation for deploys, scaling, and infra management (including ML-driven workflows). Lead capacity planning and performance optimization with forecasting models. Mentor SREs; partner with engineering and data science to embed AI-enhanced reliability practices. Contribute to on-call while continuously reducing noise and manual work through automation. AI-Enhanced SRE Leadership
Implement and maintain AI-powered incident prediction and prevention systems. Design intelligent alerting systems that reduce noise and provide contextual insights using natural language processing and machine learning. Develop AI-driven capacity planning models that predict resource needs and optimize cost efficiency. Build and maintain chatbots and AI assistants for operational tasks, documentation search, and incident triage. Implement automated root cause analysis using AI correlation engines and log analysis. Required experience
6+ years in SRE/DevOps/infrastructure roles; 2+ years leading initiatives or teams. Strong Python skills and applied AI/ML for ops (anomaly detection, predictive analytics, NLP). Deep cloud experience (AWS and/or GCP) including native AI/ML services. Infrastructure as Code & config management; CI/CD with automated quality gates/decisions. Monitoring/observability at scale, including time-series analysis & statistical modeling. Expert-level in: AWS
(core services, networking, compute, databases, storage) Terraform Kubernetes
(incl. Karpenter) &
Helm Built/ran in-house observability stacks:
OpenTelemetry ,
Loki ,
Grafana ,
Prometheus ,
CloudWatch ,
X-Ray/Jaeger . Bonus:
ArgoCD/Argo Workflows , LLM integration & prompt engineering, automated incident response, intelligent service mesh, AI-driven dashboards & cost optimization, cloud certs. Location
On-site in
Glendale, California Compensation
$185,000 - $195,000 a year
#J-18808-Ljbffr