Jobgether
We are currently looking for a
Machine Learning Systems Engineer
in
European Union .
We are seeking a talented Machine Learning Systems Engineer to join a remote-first, globally distributed team working on cutting-edge AI infrastructure. In this role, you will contribute to the development of large-scale language model systems, focusing on high-performance training, inference, and self-improving AI agents. You will work at the intersection of machine learning research, distributed systems, and high-performance computing, building tools and frameworks that enable researchers and organizations worldwide to deploy advanced AI solutions. This role offers the chance to work on technically demanding, open-source projects while collaborating with a passionate international team. Your work will have a direct impact on the future of scalable AI systems.
Accountabilities
Contribute to the development and optimization of large-scale language model frameworks
Implement high-performance distributed training algorithms using Megatron-LM, DeepSpeed, and vLLM
Develop and optimize inference engines and tools for model deployment, fine-tuning, and AI agent self-improvement
Integrate diverse machine learning ecosystems including HuggingFace and other LLM tools
Optimize performance across multi-GPU, multi-node architectures, leveraging HPC and CUDA/ROCm programming
Collaborate with the open-source community to enhance the codebase, implement features, and resolve issues
Research and implement advanced techniques for self-improving AI agents and high-efficiency ML pipelines
Requirements
3+ years of experience in machine learning engineering or research
Proficiency in Python and C/C++, with strong systems programming skills
Deep understanding of high-performance computing concepts, including MPI, BSP, and distributed multi-GPU training
Solid experience with transformer architectures, gradient descent, backpropagation, and deep learning training
Familiarity with distributed training strategies: data parallelism, model parallelism, pipeline parallelism
Experience with containerization (Docker, Kubernetes) and cluster orchestration
Demonstrated experience with ML frameworks like vLLM, Megatron-LM, HuggingFace, or similar
Commitment to open-source development and community collaboration
Excellent problem-solving, debugging, and performance optimization skills
Bonus: Advanced degrees (MS/PhD), experience with SLURM, mixed-precision training, MLOps, or prior contributions to major open-source ML projects
Benefits
Competitive compensation including salary and equity participation
Fully remote, work-from-anywhere flexibility
Comprehensive global benefits including mental health support
Open PTO policy and flexible working hours
Paid parental leave and support for personal well-being
Opportunities for continuous learning and professional development
Regular team offsites, virtual events, and global gatherings to foster team collaboration
Inclusive, transparent, and supportive culture prioritizing growth and knowledge-sharing
The application process is transparent, skills-based, and unbiased, focusing solely on your fit for the role. Thank you for your interest!
#J-18808-Ljbffr
Machine Learning Systems Engineer
in
European Union .
We are seeking a talented Machine Learning Systems Engineer to join a remote-first, globally distributed team working on cutting-edge AI infrastructure. In this role, you will contribute to the development of large-scale language model systems, focusing on high-performance training, inference, and self-improving AI agents. You will work at the intersection of machine learning research, distributed systems, and high-performance computing, building tools and frameworks that enable researchers and organizations worldwide to deploy advanced AI solutions. This role offers the chance to work on technically demanding, open-source projects while collaborating with a passionate international team. Your work will have a direct impact on the future of scalable AI systems.
Accountabilities
Contribute to the development and optimization of large-scale language model frameworks
Implement high-performance distributed training algorithms using Megatron-LM, DeepSpeed, and vLLM
Develop and optimize inference engines and tools for model deployment, fine-tuning, and AI agent self-improvement
Integrate diverse machine learning ecosystems including HuggingFace and other LLM tools
Optimize performance across multi-GPU, multi-node architectures, leveraging HPC and CUDA/ROCm programming
Collaborate with the open-source community to enhance the codebase, implement features, and resolve issues
Research and implement advanced techniques for self-improving AI agents and high-efficiency ML pipelines
Requirements
3+ years of experience in machine learning engineering or research
Proficiency in Python and C/C++, with strong systems programming skills
Deep understanding of high-performance computing concepts, including MPI, BSP, and distributed multi-GPU training
Solid experience with transformer architectures, gradient descent, backpropagation, and deep learning training
Familiarity with distributed training strategies: data parallelism, model parallelism, pipeline parallelism
Experience with containerization (Docker, Kubernetes) and cluster orchestration
Demonstrated experience with ML frameworks like vLLM, Megatron-LM, HuggingFace, or similar
Commitment to open-source development and community collaboration
Excellent problem-solving, debugging, and performance optimization skills
Bonus: Advanced degrees (MS/PhD), experience with SLURM, mixed-precision training, MLOps, or prior contributions to major open-source ML projects
Benefits
Competitive compensation including salary and equity participation
Fully remote, work-from-anywhere flexibility
Comprehensive global benefits including mental health support
Open PTO policy and flexible working hours
Paid parental leave and support for personal well-being
Opportunities for continuous learning and professional development
Regular team offsites, virtual events, and global gatherings to foster team collaboration
Inclusive, transparent, and supportive culture prioritizing growth and knowledge-sharing
The application process is transparent, skills-based, and unbiased, focusing solely on your fit for the role. Thank you for your interest!
#J-18808-Ljbffr