Amazon
Software Engineer- AI/ML, AWS Neuron Distributed Training
Amazon, Seattle, Washington, us, 98127
Software Engineer- AI/ML, AWS Neuron Distributed Training
Join to apply for the
Software Engineer- AI/ML, AWS Neuron Distributed Training
role at
Amazon Web Services (AWS) . DESCRIPTION Annapurna Labs designs silicon and software that accelerates innovation. Customers choose us to create cloud solutions that solve challenges that were unimaginable a short time agoeven yesterday. Our custom chips, accelerators, and software stacks enable us to take on technical challenges that have never been seen before, and deliver results that help our customers change the world. AWS Neuron is the complete software stack for the AWS Trainium (Trn1/Trn2) and Inferentia (Inf1/Inf2) cloud-scale Machine Learning accelerators. This role is for a Senior Machine Learning Engineer in the Distribute Training team for AWS Neuron, responsible for development, enablement, and performance tuning of various ML model families, including large-scale Large Language Models (LLMs) like GPT and Llama, as well as Stable Diffusion, Vision Transformers (ViT), and more. The ML Distributed Training team collaborates with chip architects, compiler engineers, and runtime engineers to create, build, and optimize distributed training solutions with Trainium instances. Experience with training large models using Python is essential. Distributed training libraries such as FSDP (Fully-Sharded Data Parallel), Deepspeed, Nemo, and others are central, and extending these for Neuron-based systems is key. Key job responsibilities You will lead efforts to build distributed training support into PyTorch and JAX using XLA, the Neuron compiler, and runtime stacks. You will optimize models for peak performance and efficiency on AWS custom silicon, including Trainium and Inferentia, as well as Trn2, Trn1, Inf1, and Inf2 servers. Strong software development skills, the ability to deep dive, collaborate effectively within cross-functional teams, and a solid foundation in Machine Learning are critical for success. About The Team Annapurna Labs, acquired by AWS in 2015, is now fully integrated. We provide the infrastructure of AWS, covering silicon engineering, hardware design, software, and operations. Our products include AWS Nitro, ENA, EFA, Graviton, F1 EC2 Instances, AWS Neuron, Inferentia, and Trainium ML Accelerators, and scalable NVMe storage. Our team fosters knowledge-sharing and mentorship, with experienced members providing mentorship and code reviews. We prioritize career growth and assign projects to develop engineering expertise, empowering team members to handle more complex tasks. Diversified Experiences We value diverse experiences. Even if you do not meet all qualifications, we encourage you to apply. Whether youre starting your career, have an unconventional background, or alternative experiences, dont hesitate to apply. About AWS AWS is the worlds most comprehensive cloud platform, trusted by startups to Fortune 500 companies. We innovate continuously to support our customers success. Inclusive Team Culture We promote a culture of learning, curiosity, and inclusion through employee-led affinity groups and events that celebrate diversity. Work/Life Balance We value work-life harmony and offer flexibility to support your success at work and home. Mentorship & Career Growth We provide resources for professional development, mentorship, and career advancement to help you grow as a professional. Basic Qualifications 3+ years of professional software development experience 2+ years of experience in system design or architecture Proficiency in at least one programming language Preferred Qualifications 3+ years of full software development lifecycle experience Bachelor's degree in computer science or equivalent Amazon is an equal opportunity employer. We consider qualified applicants with arrest and conviction records as per the Los Angeles County Fair Chance Ordinance. #J-18808-Ljbffr
Join to apply for the
Software Engineer- AI/ML, AWS Neuron Distributed Training
role at
Amazon Web Services (AWS) . DESCRIPTION Annapurna Labs designs silicon and software that accelerates innovation. Customers choose us to create cloud solutions that solve challenges that were unimaginable a short time agoeven yesterday. Our custom chips, accelerators, and software stacks enable us to take on technical challenges that have never been seen before, and deliver results that help our customers change the world. AWS Neuron is the complete software stack for the AWS Trainium (Trn1/Trn2) and Inferentia (Inf1/Inf2) cloud-scale Machine Learning accelerators. This role is for a Senior Machine Learning Engineer in the Distribute Training team for AWS Neuron, responsible for development, enablement, and performance tuning of various ML model families, including large-scale Large Language Models (LLMs) like GPT and Llama, as well as Stable Diffusion, Vision Transformers (ViT), and more. The ML Distributed Training team collaborates with chip architects, compiler engineers, and runtime engineers to create, build, and optimize distributed training solutions with Trainium instances. Experience with training large models using Python is essential. Distributed training libraries such as FSDP (Fully-Sharded Data Parallel), Deepspeed, Nemo, and others are central, and extending these for Neuron-based systems is key. Key job responsibilities You will lead efforts to build distributed training support into PyTorch and JAX using XLA, the Neuron compiler, and runtime stacks. You will optimize models for peak performance and efficiency on AWS custom silicon, including Trainium and Inferentia, as well as Trn2, Trn1, Inf1, and Inf2 servers. Strong software development skills, the ability to deep dive, collaborate effectively within cross-functional teams, and a solid foundation in Machine Learning are critical for success. About The Team Annapurna Labs, acquired by AWS in 2015, is now fully integrated. We provide the infrastructure of AWS, covering silicon engineering, hardware design, software, and operations. Our products include AWS Nitro, ENA, EFA, Graviton, F1 EC2 Instances, AWS Neuron, Inferentia, and Trainium ML Accelerators, and scalable NVMe storage. Our team fosters knowledge-sharing and mentorship, with experienced members providing mentorship and code reviews. We prioritize career growth and assign projects to develop engineering expertise, empowering team members to handle more complex tasks. Diversified Experiences We value diverse experiences. Even if you do not meet all qualifications, we encourage you to apply. Whether youre starting your career, have an unconventional background, or alternative experiences, dont hesitate to apply. About AWS AWS is the worlds most comprehensive cloud platform, trusted by startups to Fortune 500 companies. We innovate continuously to support our customers success. Inclusive Team Culture We promote a culture of learning, curiosity, and inclusion through employee-led affinity groups and events that celebrate diversity. Work/Life Balance We value work-life harmony and offer flexibility to support your success at work and home. Mentorship & Career Growth We provide resources for professional development, mentorship, and career advancement to help you grow as a professional. Basic Qualifications 3+ years of professional software development experience 2+ years of experience in system design or architecture Proficiency in at least one programming language Preferred Qualifications 3+ years of full software development lifecycle experience Bachelor's degree in computer science or equivalent Amazon is an equal opportunity employer. We consider qualified applicants with arrest and conviction records as per the Los Angeles County Fair Chance Ordinance. #J-18808-Ljbffr