Optomi

Site Reliability Engineer

Optomi, Florida, New York, United States

Overview

This range is provided by Optomi. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range

$145,000.00/yr - $160,000.00/yr Cloud & Infrastructure Technical Recruiter @ Optomi | Bachelor of Science Site Reliability Engineer Optomi, in partnership with a leading global media organization are seeking a Senior Site Reliability Engineer to join their data platform team. This position operates at the intersection of DevOps, data engineering, and platform reliability, working closely with cross-functional teams to ensure the scalability, observability, and reliability of high-throughput data systems. Requirements and skills

6+ years of professional software engineering experience, with a focus on reliability, infrastructure, or platform engineering Strong programming skills in Python and at least one statically typed language (e.g., Java, TypeScript, Go) Deep hands-on experience with AWS services (Lambda, ECS/EKS, S3, IAM, API Gateway, SNS/SQS, Kinesis) Proven experience operating and scaling distributed systems in production environments Expertise in observability and telemetry design: tracing, metrics, logging Proficiency in CI/CD automation, infrastructure-as-code (e.g., Terraform, AWS CDK), and DevOps best practices Solid understanding of SQL/NoSQL data stores and architectural trade-offs Familiarity with agile development workflows, code reviews, and collaborative SDLC processes Experience leading incident response, root cause analysis, and driving continuous improvement Ability to design and maintain SLAs, SLOs, and SLIs in production systems Strong communication and cross-functional collaboration skills Key responsibilities

Build, deploy, and maintain highly available and scalable infrastructure for data pipelines and platform services using AWS and infrastructure-as-code tools like Terraform or AWS CDK Automate operational processes and reduce toil through scripting (Python, Go, etc.), CI/CD pipelines, and workflow automation Monitor, analyze, and improve system performance, latency, and reliability using tools like CloudWatch, DataDog, and custom telemetry Manage observability for services—design and implement SLIs, SLOs, and SLAs; maintain dashboards and alerts for distributed systems Lead incident response, root cause analysis, and post-mortem reviews; continuously improve incident detection and remediation processes Collaborate with data engineering and product teams to ensure reliable integration of new services and features into the data platform Optimize cost and performance of cloud infrastructure, including autoscaling, provisioning strategies, and storage lifecycle policies Participate in on-call rotation, ensuring timely resolution of production issues and follow-up improvements Maintain compliance and security best practices within data infrastructure, including IAM, auditing, and resource governance Review code, architecture, and infrastructure changes to ensure adherence to reliability and scalability standards Seniority level

Mid-Senior level Employment type

Full-time Job function

Information Technology Industries

Entertainment Providers

#J-18808-Ljbffr