aion
AION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance, AION democratizes access to compute power for AI training, fine-tuning, inference, data labeling, and beyond. By leveraging underutilized resources such as idle GPUs and data centers, AION provides a scalable, cost-effective, and sustainable solution tailored for developers, researchers, and enterprises. Led by high-pedigree founders with previous exits, AION is well-funded by major VCs with strategic global partnerships. Headquartered in the US with global presence, the company is building its initial core team in India, London and Seattle.
Who You Are
You're an ML systems engineer who's passionate about building high-performance inference infrastructure. You don't need to be an expert in everything - this field is evolving too rapidly for that - but you have strong fundamentals and the curiosity to dive deep into optimization challenges. You thrive in early-stage environments where you'll learn cutting-edge techniques while building production systems. You think systematically about performance bottlenecks and are excited to push the boundaries of what's possible in AI infrastructure.
Responsibilities
Architect and implement distributed training solutions for customers running pre-training, fine-tuning, and RL workloads on AION infrastructure
Guide customers through large-scale training implementations including data parallelism, model parallelism, and pipeline parallelism strategies
Design and optimize multi-GPU training setups with proper gradient synchronization, communication strategies, and scaling configurations
Optimize and develop POCs for customer training accelerators including efficient data loading pipelines, gradient checkpointing, and memory optimization techniques
Create comprehensive monitoring and debugging frameworks for distributed training jobs with performance tracking and bottleneck resolution
Conduct technical workshops and training sessions on distributed training, reasoning techniques, and post-training optimization methodologies
Support customers with advanced fine-tuning workflows including reward model training, constitutional AI, and alignment techniques
Troubleshoot and resolve customer training bottlenecks including scaling inefficiencies and optimization challenges
Collaborate with tech and product teams to translate customer needs into platform improvements and feature requirements
Skills & Experience
High agency individual looking to own customer success and influence training platform architecture
4+ years of ML engineering experience with focus on training large-scale models and distributed systems
Expert-level PyTorch experience including distributed training, DDP implementation, and multi-GPU optimization
Production experience with distributed training techniques including data parallelism, model parallelism, pipeline parallelism
Strong understanding of gradient synchronization and communication strategies for multi-node training
Hands-on experience with large dataset handling and efficient data loading at scale
Proficiency in training infrastructure tools such as Megatron-LM, DeepSpeed, FairScale, or similar frameworks
Excellent communication and teaching skills with ability to explain complex technical concepts to diverse audiences
Customer-facing experience in technical consulting, solutions engineering, or developer relations roles
Experience with RLHF and fine-tuning pipelines including reward model training and post-training optimization
Understanding of reasoning techniques including Chain-of-Thought prompting and advanced reasoning workflows
Nice to have Large-scale pre-training experience (7B+ parameters), advanced reasoning implementation (Tree-of-Thought, self-consistency), DPO and constitutional AI expertise, open-source contributions to training frameworks, conference speaking or technical evangelism experience.
Benefits
Join the ground floor of a mission-driven AI startup revolutionizing compute infrastructure
Work with a high-caliber, globally distributed team backed by major VCs
Competitive compensation and benefits
Fast-paced, flexible work environment with room for ownership and impact
Hybrid model: 3 days in-office, 2 days remote with flexibility to work remotely for part of the year
In case you got any questions about the role please reach out to hiring manager on LinkedIn or X.
#J-18808-Ljbffr
Who You Are
You're an ML systems engineer who's passionate about building high-performance inference infrastructure. You don't need to be an expert in everything - this field is evolving too rapidly for that - but you have strong fundamentals and the curiosity to dive deep into optimization challenges. You thrive in early-stage environments where you'll learn cutting-edge techniques while building production systems. You think systematically about performance bottlenecks and are excited to push the boundaries of what's possible in AI infrastructure.
Responsibilities
Architect and implement distributed training solutions for customers running pre-training, fine-tuning, and RL workloads on AION infrastructure
Guide customers through large-scale training implementations including data parallelism, model parallelism, and pipeline parallelism strategies
Design and optimize multi-GPU training setups with proper gradient synchronization, communication strategies, and scaling configurations
Optimize and develop POCs for customer training accelerators including efficient data loading pipelines, gradient checkpointing, and memory optimization techniques
Create comprehensive monitoring and debugging frameworks for distributed training jobs with performance tracking and bottleneck resolution
Conduct technical workshops and training sessions on distributed training, reasoning techniques, and post-training optimization methodologies
Support customers with advanced fine-tuning workflows including reward model training, constitutional AI, and alignment techniques
Troubleshoot and resolve customer training bottlenecks including scaling inefficiencies and optimization challenges
Collaborate with tech and product teams to translate customer needs into platform improvements and feature requirements
Skills & Experience
High agency individual looking to own customer success and influence training platform architecture
4+ years of ML engineering experience with focus on training large-scale models and distributed systems
Expert-level PyTorch experience including distributed training, DDP implementation, and multi-GPU optimization
Production experience with distributed training techniques including data parallelism, model parallelism, pipeline parallelism
Strong understanding of gradient synchronization and communication strategies for multi-node training
Hands-on experience with large dataset handling and efficient data loading at scale
Proficiency in training infrastructure tools such as Megatron-LM, DeepSpeed, FairScale, or similar frameworks
Excellent communication and teaching skills with ability to explain complex technical concepts to diverse audiences
Customer-facing experience in technical consulting, solutions engineering, or developer relations roles
Experience with RLHF and fine-tuning pipelines including reward model training and post-training optimization
Understanding of reasoning techniques including Chain-of-Thought prompting and advanced reasoning workflows
Nice to have Large-scale pre-training experience (7B+ parameters), advanced reasoning implementation (Tree-of-Thought, self-consistency), DPO and constitutional AI expertise, open-source contributions to training frameworks, conference speaking or technical evangelism experience.
Benefits
Join the ground floor of a mission-driven AI startup revolutionizing compute infrastructure
Work with a high-caliber, globally distributed team backed by major VCs
Competitive compensation and benefits
Fast-paced, flexible work environment with room for ownership and impact
Hybrid model: 3 days in-office, 2 days remote with flexibility to work remotely for part of the year
In case you got any questions about the role please reach out to hiring manager on LinkedIn or X.
#J-18808-Ljbffr