Amazon Jobs
Senior Software Development Engineer - AI/ML, AWS Neuron, Multimodal Inference
Amazon Jobs, Seattle, Washington, us, 98127
Overview
The Annapurna Labs team at Amazon Web Services (AWS) builds AWS Neuron, the software development kit used to accelerate deep learning and GenAI workloads on Amazon’s custom ML accelerators, Inferentia and Trainium. The AWS Neuron SDK includes an ML compiler, runtime, and application framework that integrates with popular ML frameworks like PyTorch and JAX to deliver high-performance ML inference and training. The Inference Enablement and Acceleration team focuses on running a wide range of models and optimizing performance for AWS custom ML accelerators. Working across the stack from frameworks to hardware, the team builds infrastructure, innovates methods, and creates high-performance kernels for ML functions to fine-tune compute for demanding workloads. We combine hardware knowledge with ML expertise to push the boundaries of AI acceleration. As part of the Neuron organization, the team collaborates across technology layers—from frameworks and kernels to compiler, runtime, and collectives—while optimizing for current performance and contributing to future architecture designs. You will work at the intersection of machine learning, high-performance computing, and distributed architectures to help shape the future of AI acceleration technology. You will architect and implement business-critical features and mentor a team of experienced engineers. We operate in a fast, startup-like environment with a culture of experimentation and learning. The team works closely with customers to enable model enablement and optimization for AWS accelerators and collaborates with open source ecosystems to deliver peak performance at scale for customers and developers. Responsibilities
Lead efforts in building distributed inference support for PyTorch in the Neuron SDK; tune models to maximize performance on AWS Trainium and Inferentia silicon and servers. Design, develop, and optimize ML models and frameworks for deployment on custom ML hardware accelerators. Participate in the ML system development lifecycle, including architecture design, distributed computing, performance profiling, hardware-specific optimizations, testing, and production deployment. Build infrastructure to analyze and onboard multiple models with diverse architectures. Design and implement high-performance kernels and features for ML operations, leveraging Neuron hardware and programming models. Analyze and optimize system-level performance across multiple Neuron generations; conduct detailed profiling to identify bottlenecks and apply optimizations such as fusion, sharding, tiling, and scheduling. Conduct comprehensive testing, including unit and end-to-end model tests with CI/CD pipelines; work with customers to enable and optimize ML models on AWS accelerators. Collaborate across teams to develop innovative optimization techniques. Qualifications
5+ years of non-internship professional software development experience; bachelor’s degree in computer science or equivalent. 5+ years of experience in design or architecture of large-scale systems; strong fundamentals in ML and LLMs, training and inference lifecycles, and model optimization. Software development experience in C++ and Python (or other languages with strong Python experience). Strong understanding of system performance, memory management, and parallel computing; proficiency in debugging, profiling, and applying software engineering best practices in large-scale systems. Familiarity with PyTorch, JIT compilation, and AOT tracing; familiarity with CUDA kernels or equivalent ML kernels; experience with performant kernel development (e.g., CUTLASS, FlashInfer) is a plus. Experience with syntax and tile-level semantics similar to Triton; experience with online/offline inference serving (e.g., vLLM, TensorRT) in production environments. Deep understanding of computer architecture, operating system concepts, and parallel computing. Strong collaboration skills across compiler, runtime, framework, and hardware teams. Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status. If you require a workplace accommodation during the application or hiring process, please visit the Amazon accommodations page for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner. Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $151,300/year to $261,500/year, depending on location, knowledge, skills, and experience. Amazon is a total compensation company; equity, sign-on, and other benefits may be offered as part of the package. For more information, please visit the Amazon employee benefits page.
#J-18808-Ljbffr
The Annapurna Labs team at Amazon Web Services (AWS) builds AWS Neuron, the software development kit used to accelerate deep learning and GenAI workloads on Amazon’s custom ML accelerators, Inferentia and Trainium. The AWS Neuron SDK includes an ML compiler, runtime, and application framework that integrates with popular ML frameworks like PyTorch and JAX to deliver high-performance ML inference and training. The Inference Enablement and Acceleration team focuses on running a wide range of models and optimizing performance for AWS custom ML accelerators. Working across the stack from frameworks to hardware, the team builds infrastructure, innovates methods, and creates high-performance kernels for ML functions to fine-tune compute for demanding workloads. We combine hardware knowledge with ML expertise to push the boundaries of AI acceleration. As part of the Neuron organization, the team collaborates across technology layers—from frameworks and kernels to compiler, runtime, and collectives—while optimizing for current performance and contributing to future architecture designs. You will work at the intersection of machine learning, high-performance computing, and distributed architectures to help shape the future of AI acceleration technology. You will architect and implement business-critical features and mentor a team of experienced engineers. We operate in a fast, startup-like environment with a culture of experimentation and learning. The team works closely with customers to enable model enablement and optimization for AWS accelerators and collaborates with open source ecosystems to deliver peak performance at scale for customers and developers. Responsibilities
Lead efforts in building distributed inference support for PyTorch in the Neuron SDK; tune models to maximize performance on AWS Trainium and Inferentia silicon and servers. Design, develop, and optimize ML models and frameworks for deployment on custom ML hardware accelerators. Participate in the ML system development lifecycle, including architecture design, distributed computing, performance profiling, hardware-specific optimizations, testing, and production deployment. Build infrastructure to analyze and onboard multiple models with diverse architectures. Design and implement high-performance kernels and features for ML operations, leveraging Neuron hardware and programming models. Analyze and optimize system-level performance across multiple Neuron generations; conduct detailed profiling to identify bottlenecks and apply optimizations such as fusion, sharding, tiling, and scheduling. Conduct comprehensive testing, including unit and end-to-end model tests with CI/CD pipelines; work with customers to enable and optimize ML models on AWS accelerators. Collaborate across teams to develop innovative optimization techniques. Qualifications
5+ years of non-internship professional software development experience; bachelor’s degree in computer science or equivalent. 5+ years of experience in design or architecture of large-scale systems; strong fundamentals in ML and LLMs, training and inference lifecycles, and model optimization. Software development experience in C++ and Python (or other languages with strong Python experience). Strong understanding of system performance, memory management, and parallel computing; proficiency in debugging, profiling, and applying software engineering best practices in large-scale systems. Familiarity with PyTorch, JIT compilation, and AOT tracing; familiarity with CUDA kernels or equivalent ML kernels; experience with performant kernel development (e.g., CUTLASS, FlashInfer) is a plus. Experience with syntax and tile-level semantics similar to Triton; experience with online/offline inference serving (e.g., vLLM, TensorRT) in production environments. Deep understanding of computer architecture, operating system concepts, and parallel computing. Strong collaboration skills across compiler, runtime, framework, and hardware teams. Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status. If you require a workplace accommodation during the application or hiring process, please visit the Amazon accommodations page for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner. Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $151,300/year to $261,500/year, depending on location, knowledge, skills, and experience. Amazon is a total compensation company; equity, sign-on, and other benefits may be offered as part of the package. For more information, please visit the Amazon employee benefits page.
#J-18808-Ljbffr