F2OnSite
Senior Site Reliability Engineer - Kubernetes & Middleware Platforms
Role Overview:
As a Senior Site Reliability Engineer, you'll bring software engineering practices to operations - building the reliability framework, defining Service Level Objectives (SLOs), and automating toil away. You'll own the health and performance of container platforms (EKS & OpenShift), Middleware Platforms (Kafka, Redis), and the CI/CD/observability pipelines that power modern, distributed applications.
Key Responsibilities:
Platform Operations:
Administer and optimize Kubernetes clusters -
Amazon EKS
and
Red Hat OpenShift Manage platform lifecycle, upgrades, scaling, and security controls
Middleware Management:
Operate and tune event platforms like
Apache Kafka Administer in-memory data stores like
Redis Enterprise Clusters Administer and maintain
3 Scale API Gateway platform.
Automation:
Fine tune Infrastructure-as-Code (IaC) pipelines and platform components Automate manual operations through IaC & configuration management tools/platforms.
Observability & Instrumentation:
Design and implement monitoring dashboards and alerts with
Prometheus ,
Grafana ,
ELK stack , and
Splunk Instrument Java, Node.js, and Python
distributes apps - embed tracing, metrics, and logs at code-level to meet SLOs.
Reliability Engineering:
Define SLIs/SLOs and manage error budgets- use data-driven insights to balance reliability and feature velocity. Lead on-call rotations, incident response, and conduct blameless root cause analysis to drive continuous improvement.
Performance & Capacity:
Forecast and right-size resource usage across clusters and middleware Profile and tune application performance (CPU, memory, GC, threading) in production.
Required Skills & Qualifications:
12+ years of overall industry experience. 6+ years in SRE, DevOps, Platform, or Production Engineering roles. EKS and/or OpenShift administration certification
(CKA, AWS Certified Kubernetes Administrator, Red Hat Certified OpenShift Administrator, or equivalent). Hands-on with Kubernetes internals, networking, Helm charts, and Operators. Middleware expertise: Deploying, scaling, and securing
Kafka
and
Redis clusters. Strong IaC toolchain experience:
Helm, ArgoCD, Terraform, Ansible
or equivalent tools/platforms Observability mastery:
Prometheus ,
Grafana ,
ELK/Splunk
or equivalent tools/platforms. Enforce container security and policy governance
using tools like OPA/Gatekeeper, Kyverno, and scanners such as Trivy, Clair, and Snyk, integrated with CI/CD and admission controls for automated compliance. Implement
Kubernetes network segmentation
using NetworkPolicy and/or Calico, ensuring secure east-west traffic and minimizing blast radius to protect service reliability. Programming/scripting proficiency in
Python ,
Shell Scripting ,
Groovy
or similar automation scripting. Demonstrable experience instrumenting distributed applications (Java, Node.js, Python) with metrics, logs, and tracing libraries. Proven track record of running large-scale production systems with minimal downtime. Strong analytical, debugging, communication, and collaboration skills. Nice-to-Have:
Service mesh experience (Istio, Linkerd). Chaos engineering foundations (Chaos Monkey, LitmusChaos). Familiarity with security/compliance in regulated environments. Experienced with any API Gateway platform (e.g. RedHat 3 Scale API Gateway). What Makes This Role Unique:
You'll be the architect of reliability guardrails - building automation and pipelines that free developers and engineers from manual ops. You'll define and enforce SLO-driven releases, leveraging error budgets to strike the right balance between innovation and uptime. You'll own end-to-end instrumentation: from container runtime metrics through Kafka-backed event flows to application-level traces in code.
Platform Operations:
Administer and optimize Kubernetes clusters -
Amazon EKS
and
Red Hat OpenShift Manage platform lifecycle, upgrades, scaling, and security controls
Middleware Management:
Operate and tune event platforms like
Apache Kafka Administer in-memory data stores like
Redis Enterprise Clusters Administer and maintain
3 Scale API Gateway platform.
Automation:
Fine tune Infrastructure-as-Code (IaC) pipelines and platform components Automate manual operations through IaC & configuration management tools/platforms.
Observability & Instrumentation:
Design and implement monitoring dashboards and alerts with
Prometheus ,
Grafana ,
ELK stack , and
Splunk Instrument Java, Node.js, and Python
distributes apps - embed tracing, metrics, and logs at code-level to meet SLOs.
Reliability Engineering:
Define SLIs/SLOs and manage error budgets- use data-driven insights to balance reliability and feature velocity. Lead on-call rotations, incident response, and conduct blameless root cause analysis to drive continuous improvement.
Performance & Capacity:
Forecast and right-size resource usage across clusters and middleware Profile and tune application performance (CPU, memory, GC, threading) in production.
Required Skills & Qualifications:
12+ years of overall industry experience. 6+ years in SRE, DevOps, Platform, or Production Engineering roles. EKS and/or OpenShift administration certification
(CKA, AWS Certified Kubernetes Administrator, Red Hat Certified OpenShift Administrator, or equivalent). Hands-on with Kubernetes internals, networking, Helm charts, and Operators. Middleware expertise: Deploying, scaling, and securing
Kafka
and
Redis clusters. Strong IaC toolchain experience:
Helm, ArgoCD, Terraform, Ansible
or equivalent tools/platforms Observability mastery:
Prometheus ,
Grafana ,
ELK/Splunk
or equivalent tools/platforms. Enforce container security and policy governance
using tools like OPA/Gatekeeper, Kyverno, and scanners such as Trivy, Clair, and Snyk, integrated with CI/CD and admission controls for automated compliance. Implement
Kubernetes network segmentation
using NetworkPolicy and/or Calico, ensuring secure east-west traffic and minimizing blast radius to protect service reliability. Programming/scripting proficiency in
Python ,
Shell Scripting ,
Groovy
or similar automation scripting. Demonstrable experience instrumenting distributed applications (Java, Node.js, Python) with metrics, logs, and tracing libraries. Proven track record of running large-scale production systems with minimal downtime. Strong analytical, debugging, communication, and collaboration skills. Nice-to-Have:
Service mesh experience (Istio, Linkerd). Chaos engineering foundations (Chaos Monkey, LitmusChaos). Familiarity with security/compliance in regulated environments. Experienced with any API Gateway platform (e.g. RedHat 3 Scale API Gateway). What Makes This Role Unique:
You'll be the architect of reliability guardrails - building automation and pipelines that free developers and engineers from manual ops. You'll define and enforce SLO-driven releases, leveraging error budgets to strike the right balance between innovation and uptime. You'll own end-to-end instrumentation: from container runtime metrics through Kafka-backed event flows to application-level traces in code.