Inside Higher Ed

ML Data Engineer – Healthcare Data Curation & Cleaning (1 Year Fixed Term)

Inside Higher Ed, Palo Alto, California, United States, 94306

ML Data Engineer – Healthcare Data Curation & Cleaning (1 Year Fixed Term) Stanford University is seeking an experienced ML Data Engineer to drive programmatic curation, cleaning, and generation of healthcare data. This role focuses on developing and maintaining automated, ML-accelerated pipelines to ensure high-quality data for machine learning applications in a healthcare environment.

About the Organization The Department of Biomedical Data Science merges biomedical informatics, biostatistics, computer science and AI to advance precision health across molecular, tissue, medical imaging, EHR, biosensory and population data.

Responsibilities

Design scalable, optimized, fault-tolerant Big Data systems and pipelines for programmatic cleaning, transformation, and curation of healthcare data.

Collaborate with scientific staff, IT professionals, and project managers to understand data requirements for Big Data projects.

Develop, test, implement, and maintain database applications; optimize and tune data designs to be reusable, repeatable, and robust.

Contribute to guidelines, standards, and processes to ensure data quality, integrity, and security.

Participate in setting data architecture strategy and standards using Big Data and analytics tools.

Work with IT and data owners to understand data collected across databases and data warehouses.

Research and suggest new tools and methods to improve data ingestion, storage, and access.

Key Responsibilities

Data Pipeline Engineering:

Design, implement, and maintain robust pipelines for cleaning, transforming, and curating healthcare data.

Develop automated processes to curate and validate data for compliance with healthcare standards (e.g., OMOP CDM, FHIR).

ML Data Engineering:

Apply core ML techniques to generate datasets, clean health records, and join heterogeneous data sources to improve data quality for model training.

Detect and correct data inconsistencies and anomalies in large-scale healthcare datasets.

Healthcare Data Expertise

Work with patient-level health data in compliance with industry regulations and ethical standards.

Utilize the OMOP Common Data Model (OMOP CDM) to standardize and harmonize disparate healthcare data sources for interoperability and scalability.

Collaboration & Continuous Improvement

Collaborate with data scientists, clinical informaticians, and engineers to align data engineering with analytical and clinical needs.

Monitor, troubleshoot, and optimize data workflows to support evolving research and operational requirements.

The expected pay range for this position is $157,945 to $177,385 per annum. Stanford provides pay ranges as part of its good faith estimate for a position; actual compensation will be determined based on scope, qualifications, budget, equity, location, and market factors.

Our benefits and rewards package are discussed during the hiring process. Stanford provides reasonable accommodations to applicants and employees with disabilities. Applicants requesting accommodation should contact Stanford Human Resources.

Equal Employment Opportunity

Stanford is an equal employment opportunity and affirmative action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic protected by law.

Qualifications

Education & Experience (required):

Bachelor’s degree in a scientific or analytic field and five years of relevant experience, or an equivalent combination of education and experience.

Knowledge, Skills, and Abilities (required):

Knowledge of data structures, algorithms, and techniques for high-volume data systems; experience with relational, NoSQL, or NewSQL databases; experience in parallel and distributed data processing; scripting and at least one high-performance language; ability to document use cases and solutions; excellent written and verbal communication.

Certifications & Licenses:

None required.

Desired & Preferred Qualifications

3+ years of software development and data engineering with a focus on data cleaning and transformation.

Proficiency in Python with data processing libraries (Pandas, Polars, NumPy).

Experience with automated data pipelines for large-scale data processing.

Familiarity with ML frameworks (PyTorch, JAX, scikit-learn) as applied to data quality tasks.

Experience with OMOP CDM and healthcare data standards; Linux/UNIX environment proficiency.

Ability to work collaboratively in multidisciplinary teams.

Education & Experience (required) - Summary Bachelor’s degree in scientific or analytic field and five years of relevant experience, or equivalent.

Physical Requirements

Typical office environment; occasional light lifting.

Hybrid work arrangement available.

#J-18808-Ljbffr