Amazon

Senior ML Infrastructure Engineer - Distributed Training

Amazon, Seattle, Washington, us, 98127

Join our dynamic team where we at Annapurna Labs strive to accelerate innovation through cutting-edge silicon and software design. As part of AWS Neuron, our complete software stack for Trainium and Inferentia, you will play a crucial role in the development, enablement, and performance tuning of various ML model families, including large-scale models like GPT and Llama, Stable Diffusion, and Vision Transformers (ViT). In this exciting position as a Senior Machine Learning Engineer in the Distributed Training team for AWS Neuron, you will: Enhance distributed training capabilities in popular machine learning frameworks like PyTorch and JAX, utilizing AWS's specialized AI hardware. Collaborate with our compiler and runtime teams to optimize ML models for efficient operation on AWS's custom AI chips. Build strong foundations in distributed systems while bridging the gap between ML frameworks and hardware acceleration. This opportunity is ideal for those who possess strong programming skills, exhibit enthusiasm for learning complex systems, and have a basic understanding of machine learning concepts. Excellent growth prospects await in the rapidly evolving realm of ML infrastructure. About Annapurna Labs: Acquired by AWS in 2015, we seamlessly integrate across various disciplines, including silicon engineering, hardware design, software, and operations. Our recent advancements feature products like AWS Nitro, ENA, EFA, Graviton, and the innovative Trainium and Inferentia ML Accelerators. Basic Qualifications: Bachelors or Masters degree achieved between December 2022 and September 2025. Proficient in C++ and Python. Experience with ML frameworks, especially PyTorch and/or JAX. Familiarity with parallel computing concepts and CUDA programming. Preferred Qualifications: Contributions to open-source ML frameworks or tools. Experience optimizing ML workloads for performance. Direct knowledge of PyTorch internals or CUDA optimization. Hands-on experience with LLM infrastructure tools like vLLM and TensorRT. We are committed to forming an inclusive culture that empowers our team members to deliver the best results for our customers. We encourage all qualified applicants, including those with disabilities, to consider this unique opportunity at our offices. Our compensation package reflects the competitive labor market across the US, with salaries ranging from $99,500 to $200,000 based on factors including experience and location. This position is based in Los Angeles County, and applicants should note that job duties include effective communication with staff as well as adherence to company policies ensuring a professional work environment. We look forward to your application!