OSI Engineering
Our client is scaling production ML systems and needs a hands-on engineer to help build, maintain, and run essential
ML data pipelines . Youll own high-throughput data ingestion and transformation workflows (including image- and array-type modalities), enforce rigorous data quality standards, and partner with research and platform teams to keep models fed with reliable, versioned datasets. Design, build, and operate reliable
ML data pipelines
for batch and/or streaming use cases across cloud environments. Develop robust
ETL/ELT
processes (ingest, validate, cleanse, transform, and publish) with clear SLAs and monitoring. Implement
data quality
gates (schema checks, null/outlier handling, drift and bias signals) and
data versioning
for reproducibility. Optimize pipelines for
distributed computing
and large modalities (e.g., images, multi-dimensional arrays). Automate repetitive workflows with CI/CD and infrastructure-as-code; document, test, and harden for production. Collaborate with ML, Data Science, and Platform teams to align datasets, features, and model training needs. Minimum Qualifications: 5+ years
building and operating data pipelines in production. Cloud:
Hands-on with
AWS ,
Azure , or
GCP
services for storage, compute, orchestration, and security. Programming:
Strong proficiency in
Python
and common data/ML libraries ( pandas ,
NumPy , etc.). Distributed compute:
Experience with at least one of
Spark ,
Dask , or
Ray . Modalities:
Experience handling
image-type
and
array-type
data at scale. Automation:
Proven ability to automate repetitive tasks (shell/Python scripting, CI/CD). Data Quality:
Implemented validation, cleansing, and transformation frameworks in production. Data Versioning:
Familiar with tools/practices such as
DVC ,
LakeFS , or similar. Languages:
Fluent in
English
or
Farsi . Strongly PreferredSQL expertise
(writing performant queries; optimizing on large datasets). Data warehousing/lakehouse
concepts and tools (e.g., Snowflake/BigQuery/Redshift; Delta/Lakehouse patterns). Data virtualization/federation
exposure (e.g., Presto/Trino) and semantic/metadata layers. Orchestration
(Airflow, Dagster, Prefect) and observability/monitoring for data pipelines. MLOps
practices (feature stores, experiment tracking, lineage, artifacts). Containers & IaC
(Docker; Terraform/CloudFormation) and CI/CD for data/ML workflows. Testing
for data/ETL (unit/integration tests, great_expectations or similar). Soft Skills Executes independently and
creatively ; comfortable owning outcomes in ambiguous environments. Proactive communicator who collaborates cross-functionally with DS/ML/Platform stakeholders. Location:
Seattle, WA Duration:
1+ year Pay:
$56/hr
ML data pipelines . Youll own high-throughput data ingestion and transformation workflows (including image- and array-type modalities), enforce rigorous data quality standards, and partner with research and platform teams to keep models fed with reliable, versioned datasets. Design, build, and operate reliable
ML data pipelines
for batch and/or streaming use cases across cloud environments. Develop robust
ETL/ELT
processes (ingest, validate, cleanse, transform, and publish) with clear SLAs and monitoring. Implement
data quality
gates (schema checks, null/outlier handling, drift and bias signals) and
data versioning
for reproducibility. Optimize pipelines for
distributed computing
and large modalities (e.g., images, multi-dimensional arrays). Automate repetitive workflows with CI/CD and infrastructure-as-code; document, test, and harden for production. Collaborate with ML, Data Science, and Platform teams to align datasets, features, and model training needs. Minimum Qualifications: 5+ years
building and operating data pipelines in production. Cloud:
Hands-on with
AWS ,
Azure , or
GCP
services for storage, compute, orchestration, and security. Programming:
Strong proficiency in
Python
and common data/ML libraries ( pandas ,
NumPy , etc.). Distributed compute:
Experience with at least one of
Spark ,
Dask , or
Ray . Modalities:
Experience handling
image-type
and
array-type
data at scale. Automation:
Proven ability to automate repetitive tasks (shell/Python scripting, CI/CD). Data Quality:
Implemented validation, cleansing, and transformation frameworks in production. Data Versioning:
Familiar with tools/practices such as
DVC ,
LakeFS , or similar. Languages:
Fluent in
English
or
Farsi . Strongly PreferredSQL expertise
(writing performant queries; optimizing on large datasets). Data warehousing/lakehouse
concepts and tools (e.g., Snowflake/BigQuery/Redshift; Delta/Lakehouse patterns). Data virtualization/federation
exposure (e.g., Presto/Trino) and semantic/metadata layers. Orchestration
(Airflow, Dagster, Prefect) and observability/monitoring for data pipelines. MLOps
practices (feature stores, experiment tracking, lineage, artifacts). Containers & IaC
(Docker; Terraform/CloudFormation) and CI/CD for data/ML workflows. Testing
for data/ETL (unit/integration tests, great_expectations or similar). Soft Skills Executes independently and
creatively ; comfortable owning outcomes in ambiguous environments. Proactive communicator who collaborates cross-functionally with DS/ML/Platform stakeholders. Location:
Seattle, WA Duration:
1+ year Pay:
$56/hr