R Systems International Limited

Senior Site Reliability Engineer

R Systems International Limited, Poland, New York, United States

R Systems is seeking an experienced Senior Site Reliability Engineer to design, build, and operate resilient, scalable, and secure systems across multi-cloud environments. The role emphasizes AWS expertise (80%) with a strong Azure foundation (20%). You will lead initiatives in automation, observability, incident management, and release reliability to ensure mission-critical applications run smoothly at enterprise scale.

Responsibilities

Cloud Infrastructure (AWS & Azure)

Proven track record of handling high-severity incidents and driving RCA.

Architect, implement, and manage highly available, fault-tolerant infrastructure.

AWS (primary): EKS, ECS, Lambda, API Gateway, S3, RDS, DynamoDB, IAM, CloudWatch, CloudTrail, CloudFormation/Terraform.

Azure (secondary): AKS, App Services, Azure Functions, Azure Monitor, Azure DevOps Pipelines.

Implement best practices for multi-cloud security, networking, and DR/BCP.

SRE & Reliability Engineering

Define and maintain SLIs, SLOs, and SLAs across distributed systems.

Conduct capacity planning, fault-tolerance reviews, chaos engineering, and DR drills.

Lead incident response, on-call rotations, and blameless postmortems.

Continuously optimize performance, cost, and reliability.

Automation & Infrastructure as Code (IaC)

Automate infrastructure provisioning with Terraform, Helm, Ansible, and GitOps workflows.

Design and maintain CI/CD pipelines (GitHub Actions, Jenkins, GitLab CI, Azure DevOps).

Enforce policy-as-code and integrate security & compliance automation.

Observability, Monitoring & Telemetry

Build comprehensive monitoring and observability solutions: CloudWatch, Prometheus, ELK/EFK, Datadog, Grafana, Splunk, New Relic.

Implement centralized logging, distributed tracing, OpenTelemetry standards.

Enable proactive alerting, anomaly detection, and automated remediation.

Release & Incident Management

Collaborate with DevOps and engineering teams to ensure reliable, safe, and repeatable releases.

Implement blue/green, rolling, and canary deployment strategies.

Drive root cause analysis (RCA), knowledge sharing, and preventive engineering.

Establish incident playbooks and integrate with ITSM tools (ServiceNow, PagerDuty, Opsgenie).

Qualifications

7+ years in SRE / DevOps / Cloud engineering roles.

Deep AWS expertise (60%) with working knowledge of Azure (40%).

Strong proficiency with Kubernetes (EKS/AKS), containers, and microservices.

Hands-on with Terraform, Helm, CI/CD platforms, observability stacks.

Solid foundation in networking, IAM, cloud security, and compliance (SOC2, HIPAA, NIST).

Proven track record of handling high-severity incidents and driving RCA.

Preferred Certifications

AWS Solutions Architect – Professional

Azure Solutions Architect Expert

Certified Kubernetes Administrator (CKA)

What’s In It For You

Hybrid work policy with equipment provided to support work-life balance.

Health coverage with private medical subscription.

Professional development with access to Udemy and paid study time for eligible learners.

Referral bonuses and long-term contribution rewards.

Note: This description reflects the responsibilities and qualifications for the Senior Site Reliability Engineer role at R Systems. It does not include non-essential site notices or unrelated content from the original posting.

#J-18808-Ljbffr