Join our Innovating Team!
As a Senior Software Engineer focusing on AI/ML and Distributed Training, you will contribute to the groundbreaking work at Annapurna Labs. We design silicon and software that drives forward-thinking cloud solutions, enabling our customers to tackle challenges previously thought impossible.
Your role will specifically involve working with the AWS Neuron software stack for Trainium and Inferentia, our powerful machine learning accelerators. You will be at the forefront of developing, enabling, and optimizing the performance of a broad range of ML model families, including large-scale Large Language Models (LLMs) such as GPT, Llama, Stable Diffusion, and Vision Transformers (ViT).
Key Responsibilities:
Lead efforts to power distributed training support for PyTorch and JAX, leveraging XLA, the Neuron compiler, and runtime stacks.
Optimize models to achieve peak performance and efficiency on AWS custom silicon, including Trainium and Inferentia.
Collaborate closely with cross-functional teams comprising chip architects, compiler engineers, and runtime engineers.
Utilize your expertise in distributed training libraries and frameworks like FSDP (Fully-Sharded Data Parallel), Deepspeed, and Nemo.
About Our Team:
Annapurna Labs, acquired by AWS in 2015, is integral to AWS's infrastructure foundation. Our diverse team spans silicon engineering, hardware design, software, and operations. We are the driving force behind innovations such as AWS Nitro, ENA, EFA, Graviton EC2 Instances, and, of course, AWS Neuron.
Diversity is Important to Us:
We value varied experiences. If you feel your background is unconventional or if you don't meet every qualification, we encourage you to apply! Regardless of your path, we believe in the potential of candidates from all walks of life.
Basic Qualifications:
Bachelor's degree in computer science or related field.
5+ years of non-internship professional software development experience.
5+ years of programming experience in at least one language.
5+ years in leading design or architecture in new and existing systems.
5+ years of full software development life cycle experience.
Experience mentoring or leading engineering teams and projects.
Knowledge in machine learning, data mining, information retrieval, statistics, or NLP.
Preferred Qualifications:
Master's degree in computer science or equivalent.
Experience in computer architecture.
Previous software engineering experience with PyTorch, JAX, TensorFlow, and distributed libraries.
Proficient in end-to-end model training processes.
Work Location:
Precise work address
We are an equal opportunity employer and encourage diverse candidates to apply. Let’s innovate together!
See details and apply
Senior Software Engineer - AI/ML and Distributed Training Special...