Purple Drive

Site Reliability Engineer (SRE) Apache Flink & Kubernetes

Purple Drive, Jersey City, New Jersey, United States, 07390

**************LOCAL PREFERRED***********************

We are seeking a highly skilled

Site Reliability Engineer (SRE)

with strong expertise in

Apache Flink, Kubernetes, and automation . The ideal candidate will be responsible for designing, deploying, and maintaining scalable, resilient systems, while ensuring high availability and performance in production environments. This role requires a solid background in distributed systems, container orchestration, and DevOps practices.

Key Responsibilities

Design, implement, and maintain

scalable Apache Flink deployments

on

Kubernetes . Develop

automation tools and scripts

to streamline deployment, monitoring, and maintenance of Flink jobs and infrastructure. Ensure

high availability, scalability, and reliability

of production systems. Collaborate with development and infrastructure teams to optimize application performance. Build and manage monitoring/alerting systems using

Prometheus, Grafana, ELK stack, or similar tools . Work with

cloud platforms

(AWS, GCP, Azure) to design and manage infrastructure. Apply best practices for

networking, security, and container orchestration . Troubleshoot complex production issues and drive root cause analysis. Contribute to

CI/CD pipelines

for deployment automation. Participate in

on-call rotations

to ensure uptime and reliability. Required Skills & Qualifications

Strong hands-on experience with

Apache Flink in production environments . Expertise in

Kubernetes

(Helm, Operators, CRDs). Proficiency in scripting languages ( Python, Bash, Go ). Experience with

monitoring & observability tools

(Prometheus, Grafana, ELK, etc.). Solid understanding of

cloud platforms

(AWS, GCP, Azure). Strong knowledge of

networking, security, and container orchestration . Familiarity with

CI/CD pipelines and DevOps practices . Excellent problem-solving, debugging, and communication skills.