Logo
Mega Cloud Lab

Data Pipeline Engineer

Mega Cloud Lab, San Jose

Save Job

Overview

We are seeking a skilled Data Pipeline Engineer with deep expertise in building, orchestrating, and optimizing large-scale data ingestion pipelines. This role is perfect for someone who thrives on working with high-volume telemetry sources, refining complex data workflows, and solving challenges like schema drift in a distributed systems environment.

Location: San Jose, CA (Onsite 2 days per week). Final-round interviews will be conducted in person.

Key Skills: Proven experience designing and building multiple data pipelines, with deep expertise in Airflow, Kafka, Python (PySpark), and cloud platforms. Must have hands-on experience with large-scale data warehouses (managing multiple TBs).

Key Responsibilities

  • Design, build, and manage scalable batch and real-time streaming pipelines for ingesting telemetry, log, and event data.
  • Develop, implement, and maintain robust data orchestration workflows using tools like Apache Airflow or similar platforms.
  • Onboard new data sources by building efficient connectors (API, Kafka, file-based) and normalizing diverse, security-related datasets.
  • Proactively monitor and manage schema evolution and drift across various source systems and data formats.
  • Implement comprehensive pipeline observability, including logging, performance metrics, and alerting systems.
  • Continuously optimize data ingestion for enhanced performance, reliability, and cost-effectiveness.
  • Collaborate with cross-functional teams, including detection, threat intelligence, and platform engineering, to align data ingestion with security objectives.

Required Qualifications

  • 5+ years of professional experience in data engineering or infrastructure roles with a focus on pipeline development.
  • Strong proficiency in Python and extensive experience with distributed data processing frameworks like Apache Spark/PySpark.
  • Hands-on experience with orchestration and workflow management tools such as Apache Airflow, Dagster, or Prefect.
  • Deep understanding of data ingestion patterns, schema management, and strategies for handling schema drift.
  • Practical experience with messaging/streaming platforms (e.g., Kafka) and cloud-native storage services (e.g., S3).
  • Proven experience developing solutions in a major cloud environment (AWS preferred, Azure, or GCP).

End of description.

#J-18808-Ljbffr