Logo
Amazon

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Amazon, Seattle, Washington, us, 98127

Save Job

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS Utility Computing (UC) provides product innovations — from foundational services such as Amazon S3 and EC2, to new product features that distinguish AWS in the industry. As part of the UC organization, you will support the development and management of Compute, Database, Storage, IoT, Platform, and Productivity services, including customer security solutions. Annapurna Labs, within AWS UC, designs silicon and software to accelerate innovation, creating custom chips and accelerators that enable tackling unprecedented technical challenges. AWS Neuron is the software stack for AWS Inferentia and Trainium accelerators, supporting large-scale ML models like GPT-3, stable diffusion, and Vision Transformers. This role is for a senior software engineer in the ML Applications team, focusing on development, enablement, and performance tuning of these models. The ML Apps team collaborates with chip architects, compiler, and runtime engineers to develop distributed training solutions with Trn1. Experience in training large models using Python, and familiarity with libraries like FSDP and Deepspeed, are essential. Key responsibilities

Lead efforts to integrate distributed training and inference support into frameworks like PyTorch, TensorFlow, and Jax using XLA, Neuron compiler, and runtime stacks. Optimize models for performance and efficiency on AWS Trainium and Inferentia silicon, and Trn1, Inf1 servers. Qualifications

Basic

5+ years of professional software development experience Proficiency in at least one programming language Experience in system design, reliability, and scaling Full software development lifecycle experience Leadership or mentorship experience Preferred

Bachelor’s degree in CS or equivalent Knowledge of Machine Learning frameworks and end-to-end training

#J-18808-Ljbffr