Jobright.ai
LLM Training Dataset Optimization Engineer, Mid-Level
Jobright.ai, San Francisco, California, United States, 94199
LLM Training Dataset Optimization Engineer, Mid-Level
Join to apply for the
LLM Training Dataset Optimization Engineer, Mid-Level
role at
Jobright.ai LLM Training Dataset Optimization Engineer, Mid-Level
2 days ago Be among the first 25 applicants Join to apply for the
LLM Training Dataset Optimization Engineer, Mid-Level
role at
Jobright.ai Get AI-powered advice on this job and more exclusive features. Together AI is a leader in developing AI infrastructure that powers the training of state-of-the-art models. The company is seeking a Training Dataset and Checkpoint Acceleration Engineer to optimize data pipelines and checkpoint mechanisms for large-scale machine learning workloads, ensuring high performance and reliability in training workflows. Responsibilities: • Design and optimize high-throughput data pipelines for streaming and processing massive training datasets. • Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency. • Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph). • Build and optimize distributed checkpoint mechanisms for large-scale training workflows. • Implement techniques to minimize checkpoint I/O overhead and ensure fault tolerance. • Develop incremental and differential checkpointing solutions to reduce storage costs. • Profile and debug bottlenecks in data pipelines and checkpoint systems. • Optimize for GPU/TPU utilization by ensuring efficient data feeding and checkpoint recovery times. • Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets. • Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs. • Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements. • Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows. Qualifications: Required: • 5+ years of experience in data engineering, distributed systems, or ML infrastructure. • Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI). • Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5). • Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS). • Proficient in Python, C++, or Go for performance-critical systems. • Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching). • Familiarity with compression and serialization for large datasets and checkpoints. • Analytical and problem-solving mindset. • Strong communication and collaboration skills across teams. Preferred: • Experience with ML frameworks (e.g., PyTorch, TensorFlow, JAX) and distributed training. • Familiarity with hardware accelerators (e.g., GPUs, TPUs) and storage optimizations. • Knowledge of open-source contributions or projects related to data pipelines or checkpointing. • Experience with incremental and real-time checkpointing solutions. Company: Together AI is a cloud-based platform designed for constructing open-source generative AI and infrastructure for developing AI models. Founded in 2022, the company is headquartered in San Francisco, California, USA, with a team of 201-500 employees. The company is currently Growth Stage. Together AI has a track record of offering H1B sponsorships. Seniority level
Seniority level Mid-Senior level Employment type
Employment type Full-time Job function
Industries Software Development Referrals increase your chances of interviewing at Jobright.ai by 2x Inferred from the description for this job
Medical insurance Vision insurance 401(k) Get notified when a new job is posted. Sign in to set job alerts for “Optimization Engineer” roles.
San Francisco, CA $160,000.00-$180,000.00 2 weeks ago San Francisco, CA $130,000.00-$238,000.00 2 weeks ago San Francisco, CA $150,000.00-$250,000.00 2 weeks ago Full-Stack Software Engineer (Jr/Mid level)
San Francisco, CA $120,000.00-$180,000.00 1 month ago San Francisco, CA $150,000.00-$230,000.00 3 months ago Software Development Engineer I - Frontend & Mobile
San Francisco, CA $150,000.00-$176,000.00 3 months ago San Francisco, CA $120,000.00-$190,000.00 9 months ago San Francisco, CA $57.00-$61.00 2 days ago San Francisco, CA $57.00-$61.00 2 days ago San Francisco, CA $125,000.00-$175,000.00 2 months ago San Francisco, CA $163,200.00-$223,200.00 2 weeks ago Software Engineer, Frontend (All Levels)
San Francisco, CA $150,000.00-$180,000.00 5 days ago Software Engineer, AI Intern (Winter 2026)
San Francisco, CA $57.00-$61.00 2 days ago Software Engineer, AI Intern (Summer 2026)
San Francisco, CA $57.00-$61.00 2 days ago San Francisco, CA $165,000.00-$165,000.00 2 years ago San Francisco, CA $120,000.00-$200,000.00 2 years ago Alameda, CA $130,000.00-$160,000.00 2 months ago San Francisco, CA $140,000.00-$280,000.00 8 months ago We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr
Join to apply for the
LLM Training Dataset Optimization Engineer, Mid-Level
role at
Jobright.ai LLM Training Dataset Optimization Engineer, Mid-Level
2 days ago Be among the first 25 applicants Join to apply for the
LLM Training Dataset Optimization Engineer, Mid-Level
role at
Jobright.ai Get AI-powered advice on this job and more exclusive features. Together AI is a leader in developing AI infrastructure that powers the training of state-of-the-art models. The company is seeking a Training Dataset and Checkpoint Acceleration Engineer to optimize data pipelines and checkpoint mechanisms for large-scale machine learning workloads, ensuring high performance and reliability in training workflows. Responsibilities: • Design and optimize high-throughput data pipelines for streaming and processing massive training datasets. • Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency. • Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph). • Build and optimize distributed checkpoint mechanisms for large-scale training workflows. • Implement techniques to minimize checkpoint I/O overhead and ensure fault tolerance. • Develop incremental and differential checkpointing solutions to reduce storage costs. • Profile and debug bottlenecks in data pipelines and checkpoint systems. • Optimize for GPU/TPU utilization by ensuring efficient data feeding and checkpoint recovery times. • Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets. • Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs. • Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements. • Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows. Qualifications: Required: • 5+ years of experience in data engineering, distributed systems, or ML infrastructure. • Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI). • Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5). • Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS). • Proficient in Python, C++, or Go for performance-critical systems. • Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching). • Familiarity with compression and serialization for large datasets and checkpoints. • Analytical and problem-solving mindset. • Strong communication and collaboration skills across teams. Preferred: • Experience with ML frameworks (e.g., PyTorch, TensorFlow, JAX) and distributed training. • Familiarity with hardware accelerators (e.g., GPUs, TPUs) and storage optimizations. • Knowledge of open-source contributions or projects related to data pipelines or checkpointing. • Experience with incremental and real-time checkpointing solutions. Company: Together AI is a cloud-based platform designed for constructing open-source generative AI and infrastructure for developing AI models. Founded in 2022, the company is headquartered in San Francisco, California, USA, with a team of 201-500 employees. The company is currently Growth Stage. Together AI has a track record of offering H1B sponsorships. Seniority level
Seniority level Mid-Senior level Employment type
Employment type Full-time Job function
Industries Software Development Referrals increase your chances of interviewing at Jobright.ai by 2x Inferred from the description for this job
Medical insurance Vision insurance 401(k) Get notified when a new job is posted. Sign in to set job alerts for “Optimization Engineer” roles.
San Francisco, CA $160,000.00-$180,000.00 2 weeks ago San Francisco, CA $130,000.00-$238,000.00 2 weeks ago San Francisco, CA $150,000.00-$250,000.00 2 weeks ago Full-Stack Software Engineer (Jr/Mid level)
San Francisco, CA $120,000.00-$180,000.00 1 month ago San Francisco, CA $150,000.00-$230,000.00 3 months ago Software Development Engineer I - Frontend & Mobile
San Francisco, CA $150,000.00-$176,000.00 3 months ago San Francisco, CA $120,000.00-$190,000.00 9 months ago San Francisco, CA $57.00-$61.00 2 days ago San Francisco, CA $57.00-$61.00 2 days ago San Francisco, CA $125,000.00-$175,000.00 2 months ago San Francisco, CA $163,200.00-$223,200.00 2 weeks ago Software Engineer, Frontend (All Levels)
San Francisco, CA $150,000.00-$180,000.00 5 days ago Software Engineer, AI Intern (Winter 2026)
San Francisco, CA $57.00-$61.00 2 days ago Software Engineer, AI Intern (Summer 2026)
San Francisco, CA $57.00-$61.00 2 days ago San Francisco, CA $165,000.00-$165,000.00 2 years ago San Francisco, CA $120,000.00-$200,000.00 2 years ago Alameda, CA $130,000.00-$160,000.00 2 months ago San Francisco, CA $140,000.00-$280,000.00 8 months ago We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr