Apple Inc.

Senior Research Engineer, Training Data Infrastructure in Foundation Models

Apple Inc., Cupertino, California, United States, 95014

Senior Research Engineer, Training Data Infrastructure in Foundation Models Cupertino, California, United States - Software and Services

Our team is dedicated to solving the high-quality training data problem at the scale required to train advanced Foundation Models. We believe that the advanced model performance (including reasoning, codingارس کاربران, و agentic planning) fundamentally depends on a data-centric approach to Machine Learning. Our objective is to engineer a large-scale system that acquires,ликಿತ್ರ, processes, and curates the data required to advance the state of the art in Artificial Intelligence. We are seeking a Senior Research Engineer who possessesензи deep understanding of distributed systems and a strong intuition for Machine Learning. You will join a culture that values engineering craftsmanship, privacy, and rigorous scientific inquiry, utilizing advanced cloud technologies to build the data systems that power our most capable models.

Description This position operates at the convergence of Software Engineering and Machine Learning Research. Unlike traditional backend roles, this position requires you to design systems where the outcome is the statistical distribution and quality of data itself. Youураль will work alongside Research Scientists to transform theoretical observations into concrete, scalable engineering solutions. Your core focus will be the architecture of our Data Acquisition, Processing, and Repository Management systems for Large Model training. You will lead technical efforts to enable active, quality-driven data curation, including filtering, deduping, synthetic data generation and data mixing, ensuring our models are trained on the highest-quality information available.

Responsibilities

Architect Scalable Ingestion Systems: Design and implement high-throughput distributed systems to ingest petabytes of text and multimodal data from diverse sources, including web crawls and third-party partnerships.

Repository Optimization: Manage the lifecycle of large-scale datasets across data storage and high-performance file systems. Optimize data formats for efficient random access and sequential scanning during model training.

Data Governance & Privacy: Engineer robust data governance and privacy solutions for the training data, in collaboration with compliance and legal teams, to ensure adherence to stringent regulatory standards.

High-Performance Processing Pipelines: Build and maintain distributed data processing workflows using-connect frameworks on cloud infrastructure (e.g., GCP, AWS).

Algorithmic Data Curation: Implement sophisticated data filtering and selection logic to remove low-quality content and develop semantic deduplication at scale to prevent model memorization and improve training efficiency.

Decontamination Removal: Design automated systems to detect and remove benchmark leakage, ensuring that evaluation datasets remain strictly isolated from training corpora.

Infrastructure for Scaling Laws: Collaborate with researchers to enable data ablations and scaling experiments. Build tools to support systematic data mixture optimization and empirically data studies. …

#J-18808-Ljbffr