Logo
Amazon Jobs

Software Development Engineer - AI/ML, AWS Neuron, Multimodal Inference

Amazon Jobs, Seattle, Washington, us, 98127

Save Job

Overview

The Annapurna Labs team at Amazon Web Services (AWS) builds AWS Neuron, the software development kit used to accelerate deep learning and GenAI workloads on Amazon’s custom machine learning accelerators, Inferentia and Trainium. The AWS Neuron SDK is the backbone for accelerating deep learning and GenAI workloads on Inferentia and Trainium, including an ML compiler, runtime, and application framework that integrates with popular ML frameworks like PyTorch and JAX to enable high-performance inference and training. The Inference Enablement and Acceleration team works across the stack from frameworks to hardware, building systematic infrastructure, novel methods, and high-performance kernels for ML functions to fine-tune compute units for demanding workloads. We combine deep hardware knowledge with ML expertise to push the boundaries of AI acceleration. As part of the broader Neuron organization, we collaborate across multiple technology layers—from frameworks and kernels to compiler, runtime, and collectives—optimize current performance, and contribute to future architecture designs while enabling customer models for optimal performance. This role offers an opportunity to work at the intersection of machine learning, high-performance computing, and distributed architectures and help shape the future of AI acceleration technology. You will architect and implement business-critical features and mentor a team of engineers. We work in small, agile teams with no predefined blueprint—inventing and experimenting. The team collaborates with customers to enable model enablement and provide optimization expertise to ensure workloads perform optimally on AWS ML accelerators. We also collaborate with open source ecosystems to deliver seamless integration and peak performance at scale for customers and developers. This role is responsible for development, enablement, and performance tuning of a wide variety of LLM model families, including large models like the Llama family, DeepSeek, and beyond. The team works with compiler and runtime engineers to create distributed inference solutions with Trainium and Inferentia. Experience optimizing inference performance for latency and throughput across the stack—from system-level optimizations to PyTorch or JAX—is required. You can learn more about Neuron via the following resources: Neuron Guide (hosted), aws neuron page, aws-neuron-sdk on GitHub, and Amazon Science article about silicon innovation. The URLs are provided in full below for reference: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-cc/index.html • https://aws.amazon.com/machine-learning/neuron/ • https://github.com/aws/aws-neuron-sdk • https://www.amazon.science/how-silicon-innovation-became-the-secret-sauce-behind-awss-success Key job responsibilities

This role will help lead efforts in building distributed inference support for PyTorch in the Neuron SDK, tuning models for peak performance on AWS Trainium and Inferentia silicon and servers. Demonstrate strong software development using Python, system-level programming, and ML knowledge; collaborate with compiler, runtime, framework, and hardware teams to optimize workloads. Bring expertise in low-level optimization, system architecture, and ML model acceleration; work at the intersection of software, hardware, and ML systems. Design, develop, and optimize machine learning models and frameworks for deployment on custom ML hardware accelerators. Participate in all stages of the ML system development lifecycle, including distributed computing architecture design, performance profiling, hardware-specific optimizations, testing, and production deployment. Build infrastructure to analyze and onboard multiple models with diverse architectures. Design and implement high-performance kernels and features for ML operations using Neuron architecture and programming models. Analyze and optimize system-level performance across multiple generations of Neuron hardware. Conduct detailed performance analysis with profiling tools to identify and resolve bottlenecks. Implement optimizations such as fusion, sharding, tiling, and scheduling. Conduct comprehensive testing, including unit and end-to-end model testing with CI/CD pipelines. Work directly with customers to enable and optimize ML models on AWS accelerators. Collaborate across teams to develop innovative optimization techniques. A day in the life

You will collaborate with a cross-functional team of applied scientists, system engineers, and product managers to deliver state-of-the-art inference capabilities for Generative AI applications. Your work will involve debugging performance issues, optimizing memory usage, and shaping Neuron’s inference stack. You’ll design and code solutions to drive software architecture efficiencies, create metrics, implement automation, and resolve root causes of defects. You will also build high-impact solutions for a large customer base and participate in design discussions, code reviews, and stakeholder communications. You will work cross-functionally to inform business decisions with your technical input in a startup-like development environment. About the team

The Inference Enablement and Acceleration team fosters a builder’s culture with experimentation and measurable impact. We emphasize collaboration, technical ownership, and continuous learning. We support new members through mentoring and thorough but constructive code reviews. We aim to assign projects that help team members develop engineering expertise and empower you to take on more complex tasks in the future. Join us to solve AI/ML infrastructure challenges today. Qualifications

3+ years of non-internship professional software development experience Bachelor’s degree in computer science or equivalent 3+ years of non-internship design or architecture experience (design patterns, reliability, scaling) Fundamentals of machine learning and LLMs, their architecture, training and inference lifecycles, with experience optimizing for model execution Software development experience in C++, Python (at least one language required) Strong understanding of system performance, memory management, and parallel computing principles Proficiency in debugging, profiling, and applying best software engineering practices in large-scale systems Familiarity with PyTorch, JIT compilation, and AOT tracing Familiarity with CUDA kernels or equivalent ML or low-level kernels Experience with performant kernel development (e.g., CUTLASS, FlashInfer) is a plus Familiarity with syntax and tile-level semantics similar to Triton Experience with online/offline inference serving with vLLM, SGLang, TensorRT or similar platforms in production Deep understanding of computer architecture, operating systems-level software, and parallel computing Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status. Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, please visit the accommodations page for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner. Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $129,300/year to $223,600/year depending on location and experience. Amazon is a total compensation company; equity, sign-on payments, and other benefits may be provided in addition to a full range of benefits. For more information, please visit the workplace benefits page. This position will remain posted until filled. Applicants should apply via our internal or external career site.

#J-18808-Ljbffr