Mega Cloud Lab
Overview
We are seeking a skilled Data Pipeline Engineer with deep expertise in building, orchestrating, and optimizing large-scale data ingestion pipelines. This role is perfect for someone who thrives on working with high-volume telemetry sources, refining complex data workflows, and solving challenges like schema drift in a distributed systems environment.
Location: San Jose, CA (Onsite 2 days per week). Final-round interviews will be conducted in person.
Key Skills: Proven experience designing and building multiple data pipelines, with deep expertise in Airflow, Kafka, Python (PySpark), and cloud platforms. Must have hands-on experience with large-scale data warehouses (managing multiple TBs).
Key Responsibilities
- Design, build, and manage scalable batch and real-time streaming pipelines for ingesting telemetry, log, and event data.
- Develop, implement, and maintain robust data orchestration workflows using tools like Apache Airflow or similar platforms.
- Onboard new data sources by building efficient connectors (API, Kafka, file-based) and normalizing diverse, security-related datasets.
- Proactively monitor and manage schema evolution and drift across various source systems and data formats.
- Implement comprehensive pipeline observability, including logging, performance metrics, and alerting systems.
- Continuously optimize data ingestion for enhanced performance, reliability, and cost-effectiveness.
- Collaborate with cross-functional teams, including detection, threat intelligence, and platform engineering, to align data ingestion with security objectives.
Required Qualifications
- 5+ years of professional experience in data engineering or infrastructure roles with a focus on pipeline development.
- Strong proficiency in Python and extensive experience with distributed data processing frameworks like Apache Spark/PySpark.
- Hands-on experience with orchestration and workflow management tools such as Apache Airflow, Dagster, or Prefect.
- Deep understanding of data ingestion patterns, schema management, and strategies for handling schema drift.
- Practical experience with messaging/streaming platforms (e.g., Kafka) and cloud-native storage services (e.g., S3).
- Proven experience developing solutions in a major cloud environment (AWS preferred, Azure, or GCP).
End of description.
#J-18808-Ljbffr