Purple Drive
Site Reliability Engineer (SRE) Apache Flink & Kubernetes
Purple Drive, Jersey City, New Jersey, United States, 07390
**************LOCAL PREFERRED***********************
We are seeking a highly skilled
Site Reliability Engineer (SRE)
with strong expertise in
Apache Flink, Kubernetes, and automation . The ideal candidate will be responsible for designing, deploying, and maintaining scalable, resilient systems, while ensuring high availability and performance in production environments. This role requires a solid background in distributed systems, container orchestration, and DevOps practices.
Key Responsibilities
Design, implement, and maintain
scalable Apache Flink deployments
on
Kubernetes . Develop
automation tools and scripts
to streamline deployment, monitoring, and maintenance of Flink jobs and infrastructure. Ensure
high availability, scalability, and reliability
of production systems. Collaborate with development and infrastructure teams to optimize application performance. Build and manage monitoring/alerting systems using
Prometheus, Grafana, ELK stack, or similar tools . Work with
cloud platforms
(AWS, GCP, Azure) to design and manage infrastructure. Apply best practices for
networking, security, and container orchestration . Troubleshoot complex production issues and drive root cause analysis. Contribute to
CI/CD pipelines
for deployment automation. Participate in
on-call rotations
to ensure uptime and reliability. Required Skills & Qualifications
Strong hands-on experience with
Apache Flink in production environments . Expertise in
Kubernetes
(Helm, Operators, CRDs). Proficiency in scripting languages ( Python, Bash, Go ). Experience with
monitoring & observability tools
(Prometheus, Grafana, ELK, etc.). Solid understanding of
cloud platforms
(AWS, GCP, Azure). Strong knowledge of
networking, security, and container orchestration . Familiarity with
CI/CD pipelines and DevOps practices . Excellent problem-solving, debugging, and communication skills.
We are seeking a highly skilled
Site Reliability Engineer (SRE)
with strong expertise in
Apache Flink, Kubernetes, and automation . The ideal candidate will be responsible for designing, deploying, and maintaining scalable, resilient systems, while ensuring high availability and performance in production environments. This role requires a solid background in distributed systems, container orchestration, and DevOps practices.
Key Responsibilities
Design, implement, and maintain
scalable Apache Flink deployments
on
Kubernetes . Develop
automation tools and scripts
to streamline deployment, monitoring, and maintenance of Flink jobs and infrastructure. Ensure
high availability, scalability, and reliability
of production systems. Collaborate with development and infrastructure teams to optimize application performance. Build and manage monitoring/alerting systems using
Prometheus, Grafana, ELK stack, or similar tools . Work with
cloud platforms
(AWS, GCP, Azure) to design and manage infrastructure. Apply best practices for
networking, security, and container orchestration . Troubleshoot complex production issues and drive root cause analysis. Contribute to
CI/CD pipelines
for deployment automation. Participate in
on-call rotations
to ensure uptime and reliability. Required Skills & Qualifications
Strong hands-on experience with
Apache Flink in production environments . Expertise in
Kubernetes
(Helm, Operators, CRDs). Proficiency in scripting languages ( Python, Bash, Go ). Experience with
monitoring & observability tools
(Prometheus, Grafana, ELK, etc.). Solid understanding of
cloud platforms
(AWS, GCP, Azure). Strong knowledge of
networking, security, and container orchestration . Familiarity with
CI/CD pipelines and DevOps practices . Excellent problem-solving, debugging, and communication skills.