Logo
Jobgether

Machine Learning Systems Engineer (Remote - EU)

Jobgether, Germantown, Ohio, United States

Save Job

We are currently looking for a

Machine Learning Systems Engineer

in

European Union .

We are seeking a talented Machine Learning Systems Engineer to join a remote-first, globally distributed team working on cutting-edge AI infrastructure. In this role, you will contribute to the development of large-scale language model systems, focusing on high-performance training, inference, and self-improving AI agents. You will work at the intersection of machine learning research, distributed systems, and high-performance computing, building tools and frameworks that enable researchers and organizations worldwide to deploy advanced AI solutions. This role offers the chance to work on technically demanding, open-source projects while collaborating with a passionate international team. Your work will have a direct impact on the future of scalable AI systems.

Accountabilities

Contribute to the development and optimization of large-scale language model frameworks

Implement high-performance distributed training algorithms using Megatron-LM, DeepSpeed, and vLLM

Develop and optimize inference engines and tools for model deployment, fine-tuning, and AI agent self-improvement

Integrate diverse machine learning ecosystems including HuggingFace and other LLM tools

Optimize performance across multi-GPU, multi-node architectures, leveraging HPC and CUDA/ROCm programming

Collaborate with the open-source community to enhance the codebase, implement features, and resolve issues

Research and implement advanced techniques for self-improving AI agents and high-efficiency ML pipelines

Requirements

3+ years of experience in machine learning engineering or research

Proficiency in Python and C/C++, with strong systems programming skills

Deep understanding of high-performance computing concepts, including MPI, BSP, and distributed multi-GPU training

Solid experience with transformer architectures, gradient descent, backpropagation, and deep learning training

Familiarity with distributed training strategies: data parallelism, model parallelism, pipeline parallelism

Experience with containerization (Docker, Kubernetes) and cluster orchestration

Demonstrated experience with ML frameworks like vLLM, Megatron-LM, HuggingFace, or similar

Commitment to open-source development and community collaboration

Excellent problem-solving, debugging, and performance optimization skills

Bonus: Advanced degrees (MS/PhD), experience with SLURM, mixed-precision training, MLOps, or prior contributions to major open-source ML projects

Benefits

Competitive compensation including salary and equity participation

Fully remote, work-from-anywhere flexibility

Comprehensive global benefits including mental health support

Open PTO policy and flexible working hours

Paid parental leave and support for personal well-being

Opportunities for continuous learning and professional development

Regular team offsites, virtual events, and global gatherings to foster team collaboration

Inclusive, transparent, and supportive culture prioritizing growth and knowledge-sharing

The application process is transparent, skills-based, and unbiased, focusing solely on your fit for the role. Thank you for your interest!

#J-18808-Ljbffr