Mega Cloud Lab
Overview
We are seeking a skilled Data Pipeline Engineer with deep expertise in building, orchestrating, and optimizing large-scale data ingestion pipelines. This role is perfect for someone who thrives on working with high-volume telemetry sources, refining complex data workflows, and solving challenges like schema drift in a distributed systems environment.
Location:
San Jose, CA (Onsite 2 days per week). Final-round interviews will be conducted in person.
Key Skills:
Proven experience designing and building multiple data pipelines, with deep expertise in Airflow, Kafka, Python (PySpark), and cloud platforms. Must have hands-on experience with large-scale data warehouses (managing multiple TBs).
Key Responsibilities
Design, build, and manage scalable batch and real-time streaming pipelines for ingesting telemetry, log, and event data.
Develop, implement, and maintain robust data orchestration workflows using tools like Apache Airflow or similar platforms.
Onboard new data sources by building efficient connectors (API, Kafka, file-based) and normalizing diverse, security-related datasets.
Proactively monitor and manage schema evolution and drift across various source systems and data formats.
Implement comprehensive pipeline observability, including logging, performance metrics, and alerting systems.
Continuously optimize data ingestion for enhanced performance, reliability, and cost-effectiveness.
Collaborate with cross-functional teams, including detection, threat intelligence, and platform engineering, to align data ingestion with security objectives.
Required Qualifications
5+ years of professional experience in data engineering or infrastructure roles with a focus on pipeline development.
Strong proficiency in Python and extensive experience with distributed data processing frameworks like Apache Spark/PySpark.
Hands-on experience with orchestration and workflow management tools such as Apache Airflow, Dagster, or Prefect.
Deep understanding of data ingestion patterns, schema management, and strategies for handling schema drift.
Practical experience with messaging/streaming platforms (e.g., Kafka) and cloud-native storage services (e.g., S3).
Proven experience developing solutions in a major cloud environment (AWS preferred, Azure, or GCP).
End of description.
#J-18808-Ljbffr
Location:
San Jose, CA (Onsite 2 days per week). Final-round interviews will be conducted in person.
Key Skills:
Proven experience designing and building multiple data pipelines, with deep expertise in Airflow, Kafka, Python (PySpark), and cloud platforms. Must have hands-on experience with large-scale data warehouses (managing multiple TBs).
Key Responsibilities
Design, build, and manage scalable batch and real-time streaming pipelines for ingesting telemetry, log, and event data.
Develop, implement, and maintain robust data orchestration workflows using tools like Apache Airflow or similar platforms.
Onboard new data sources by building efficient connectors (API, Kafka, file-based) and normalizing diverse, security-related datasets.
Proactively monitor and manage schema evolution and drift across various source systems and data formats.
Implement comprehensive pipeline observability, including logging, performance metrics, and alerting systems.
Continuously optimize data ingestion for enhanced performance, reliability, and cost-effectiveness.
Collaborate with cross-functional teams, including detection, threat intelligence, and platform engineering, to align data ingestion with security objectives.
Required Qualifications
5+ years of professional experience in data engineering or infrastructure roles with a focus on pipeline development.
Strong proficiency in Python and extensive experience with distributed data processing frameworks like Apache Spark/PySpark.
Hands-on experience with orchestration and workflow management tools such as Apache Airflow, Dagster, or Prefect.
Deep understanding of data ingestion patterns, schema management, and strategies for handling schema drift.
Practical experience with messaging/streaming platforms (e.g., Kafka) and cloud-native storage services (e.g., S3).
Proven experience developing solutions in a major cloud environment (AWS preferred, Azure, or GCP).
End of description.
#J-18808-Ljbffr