Logo
Amazon

Software Engineer - AI/ML Distributed Training

Amazon, Cupertino, California, United States, 95014

Save Job

Are you passionate about solving complex challenges and developing software that transforms how businesses operate? Join the Annapurna Labs team at Amazon Web Services (AWS) as a Software Development Engineer II. In this role, you will design, deliver, and maintain sophisticated software solutions that serve millions of customers worldwide. As part of our innovative organization that encompasses various disciplines, you'll work on AWS Neuron, the software stack for cloud-scale machine learning accelerators such as AWS Inferentia and Trainium. Your contributions will focus on the development, enablement, and optimization of diverse ML model families, including large language models like GPT2 and GPT3, stable diffusion models, and Vision Transformers. The Machine Learning Applications (ML Apps) team is seeking a talented individual to enhance distributed training capabilities in popular frameworks like Pytorch and TensorFlow, utilizing XLA and the Neuron compiler. You will tune models to achieve the highest performance and efficiency running on AWS Trainium and Inferentia silicon. Key Responsibilities:

Lead efforts to integrate distributed training support into leading ML frameworks. Tune models to maximize performance on Trainium and Inferentia servers. Collaborate with chip architects, compiler engineers, and runtime engineers to create optimal distributed training solutions. Utilize libraries like FSDP and Deepspeed for large model training. About Us:

We embrace an inclusive culture at AWS, celebrating diversity among our team members. We offer innovative benefits, flexible working hours, and a focus on achieving work-life balance. Our team values mentorship and career growth, providing opportunities for knowledge sharing and professional development. Basic Qualifications: 3+ years of professional software development experience. 3+ years of experience in system design or architecture. Proficiency in at least one programming language. Experience in the Deep Learning industry. Preferred Qualifications: 3+ years of experience in the full software development lifecycle. Bachelor's degree in computer science or a related field. Expertise with Pytorch/Jax/Tensorflow and distributed training frameworks. The salary for this position ranges from $129,300/year to $223,600/year, depending on location and job-related experience. We offer a comprehensive compensation package, including medical, financial, and other benefits. This position will remain posted until filled. Applicants should apply through our internal or external career sites. We are an equal opportunity employer and do not discriminate based on protected status.