F2OnSite

Sr. Site Reliability Engineer

F2OnSite, Jersey City, New Jersey, United States, 07390

Senior Site Reliability Engineer - Kubernetes & Middleware Platforms Role Overview: As a Senior Site Reliability Engineer, you'll bring software engineering practices to operations - building the reliability framework, defining Service Level Objectives (SLOs), and automating toil away. You'll own the health and performance of container platforms (EKS & OpenShift), Middleware Platforms (Kafka, Redis), and the CI/CD/observability pipelines that power modern, distributed applications. Key Responsibilities:

Platform Operations:

Administer and optimize Kubernetes clusters -

Amazon EKS

and

Red Hat OpenShift Manage platform lifecycle, upgrades, scaling, and security controls

Middleware Management:

Operate and tune event platforms like

Apache Kafka Administer in-memory data stores like

Redis Enterprise Clusters Administer and maintain

3 Scale API Gateway platform.

Automation:

Fine tune Infrastructure-as-Code (IaC) pipelines and platform components Automate manual operations through IaC & configuration management tools/platforms.

Observability & Instrumentation:

Design and implement monitoring dashboards and alerts with

Prometheus ,

Grafana ,

ELK stack , and

Splunk Instrument Java, Node.js, and Python

distributes apps - embed tracing, metrics, and logs at code-level to meet SLOs.

Reliability Engineering:

Define SLIs/SLOs and manage error budgets- use data-driven insights to balance reliability and feature velocity. Lead on-call rotations, incident response, and conduct blameless root cause analysis to drive continuous improvement.

Performance & Capacity:

Forecast and right-size resource usage across clusters and middleware Profile and tune application performance (CPU, memory, GC, threading) in production.

Required Skills & Qualifications:

12+ years of overall industry experience. 6+ years in SRE, DevOps, Platform, or Production Engineering roles. EKS and/or OpenShift administration certification

(CKA, AWS Certified Kubernetes Administrator, Red Hat Certified OpenShift Administrator, or equivalent). Hands-on with Kubernetes internals, networking, Helm charts, and Operators. Middleware expertise: Deploying, scaling, and securing

Kafka

and

Redis clusters. Strong IaC toolchain experience:

Helm, ArgoCD, Terraform, Ansible

or equivalent tools/platforms Observability mastery:

Prometheus ,

Grafana ,

ELK/Splunk

or equivalent tools/platforms. Enforce container security and policy governance

using tools like OPA/Gatekeeper, Kyverno, and scanners such as Trivy, Clair, and Snyk, integrated with CI/CD and admission controls for automated compliance. Implement

Kubernetes network segmentation

using NetworkPolicy and/or Calico, ensuring secure east-west traffic and minimizing blast radius to protect service reliability. Programming/scripting proficiency in

Python ,

Shell Scripting ,

Groovy

or similar automation scripting. Demonstrable experience instrumenting distributed applications (Java, Node.js, Python) with metrics, logs, and tracing libraries. Proven track record of running large-scale production systems with minimal downtime. Strong analytical, debugging, communication, and collaboration skills. Nice-to-Have:

Service mesh experience (Istio, Linkerd). Chaos engineering foundations (Chaos Monkey, LitmusChaos). Familiarity with security/compliance in regulated environments. Experienced with any API Gateway platform (e.g. RedHat 3 Scale API Gateway). What Makes This Role Unique:

You'll be the architect of reliability guardrails - building automation and pipelines that free developers and engineers from manual ops. You'll define and enforce SLO-driven releases, leveraging error budgets to strike the right balance between innovation and uptime. You'll own end-to-end instrumentation: from container runtime metrics through Kafka-backed event flows to application-level traces in code.