Amazon Web Services (AWS)

Software Engineer- AI/ML, AWS Neuron Distributed Training

Amazon Web Services (AWS), Seattle

Software Engineer- AI/ML, AWS Neuron Distributed Training

Join to apply for the Software Engineer- AI/ML, AWS Neuron Distributed Training role at Amazon Web Services (AWS) .

DESCRIPTION

Annapurna Labs designs silicon and software that accelerates innovation. Customers choose us to create cloud solutions that solve challenges that were unimaginable a short time ago—even yesterday. Our custom chips, accelerators, and software stacks enable us to take on technical challenges that have never been seen before, and deliver results that help our customers change the world.

AWS Neuron is the complete software stack for the AWS Trainium (Trn1/Trn2) and Inferentia (Inf1/Inf2) cloud-scale Machine Learning accelerators. This role is for a Senior Machine Learning Engineer in the Distribute Training team for AWS Neuron, responsible for development, enablement, and performance tuning of various ML model families, including large-scale Large Language Models (LLMs) like GPT and Llama, as well as Stable Diffusion, Vision Transformers (ViT), and more.

The ML Distributed Training team collaborates with chip architects, compiler engineers, and runtime engineers to create, build, and optimize distributed training solutions with Trainium instances. Experience with training large models using Python is essential. Distributed training libraries such as FSDP (Fully-Sharded Data Parallel), Deepspeed, Nemo, and others are central, and extending these for Neuron-based systems is key.

Key job responsibilities

You will lead efforts to build distributed training support into PyTorch and JAX using XLA, the Neuron compiler, and runtime stacks. You will optimize models for peak performance and efficiency on AWS custom silicon, including Trainium and Inferentia, as well as Trn2, Trn1, Inf1, and Inf2 servers. Strong software development skills, the ability to deep dive, collaborate effectively within cross-functional teams, and a solid foundation in Machine Learning are critical for success.

About The Team

Annapurna Labs, acquired by AWS in 2015, is now fully integrated. We provide the infrastructure of AWS, covering silicon engineering, hardware design, software, and operations. Our products include AWS Nitro, ENA, EFA, Graviton, F1 EC2 Instances, AWS Neuron, Inferentia, and Trainium ML Accelerators, and scalable NVMe storage.

Our team fosters knowledge-sharing and mentorship, with experienced members providing mentorship and code reviews. We prioritize career growth and assign projects to develop engineering expertise, empowering team members to handle more complex tasks.

Diversified Experiences

We value diverse experiences. Even if you do not meet all qualifications, we encourage you to apply. Whether you’re starting your career, have an unconventional background, or alternative experiences, don’t hesitate to apply.

About AWS

AWS is the world’s most comprehensive cloud platform, trusted by startups to Fortune 500 companies. We innovate continuously to support our customers’ success.

Inclusive Team Culture

We promote a culture of learning, curiosity, and inclusion through employee-led affinity groups and events that celebrate diversity.

Work/Life Balance

We value work-life harmony and offer flexibility to support your success at work and home.

Mentorship & Career Growth

We provide resources for professional development, mentorship, and career advancement to help you grow as a professional.

Basic Qualifications

3+ years of professional software development experience
2+ years of experience in system design or architecture
Proficiency in at least one programming language

Preferred Qualifications

3+ years of full software development lifecycle experience
Bachelor's degree in computer science or equivalent

Amazon is an equal opportunity employer. We consider qualified applicants with arrest and conviction records as per the Los Angeles County Fair Chance Ordinance.

#J-18808-Ljbffr