Logo
Amazon Development Center U.S., Inc.

Senior Software Engineer - EC2 Instance Networking Solutions

Amazon Development Center U.S., Inc., Sunnyvale, California, United States, 94087

Save Job

Join our innovative team working on the critical networking backbone that powers the largest AI training clusters globally. Were focused on developing high-performance RDMA and RoCE solutions that facilitate the distributed training of trillion-parameter models across numerous compute nodes within AWS infrastructure. As a senior engineer, you will play a pivotal role in shaping technical architecture decisions and lead the development of next-generation distributed AI training infrastructure. Your expertise will contribute to creating the software that connects massive AI accelerator clusters, focusing on SmartNIC integration, optimizing collective communication, and ensuring ultra-high-bandwidth inter-rack connectivity. Key Responsibilities: Design and develop high-performance networking software solutions utilizing RDMA and RoCE technologies for extensive AI clusters. Architect SmartNIC integration strategies with EC2 control plane systems and define API specifications. Enhance collective communication patterns and multi-rack networking protocols for distributed AI training. Establish comprehensive performance monitoring, metrics collection, and benchmarking infrastructure. Create automated testing frameworks and stress testing methodologies for large-scale distributed systems. Lead complex system-level debugging across hardware acceleration, kernel networking, and distributed applications. Define technical architecture and strategy for cutting-edge scale-out AI cluster networking. Provide technical leadership and mentorship to engineering teams. Collaborate across functions with hardware, cloud infrastructure, and AI platform teams. Facilitate technical design reviews and advocate engineering best practices. About Our Team: Utility Computing (UC) at AWS is dedicated to pioneering product innovations that set our services apart in the cloud computing landscape. Youll be involved in the development and management of critical services such as EC2 and S3, supporting customers with specialized security needs. We pride ourselves on fostering an inclusive environment that promotes knowledge-sharing and mentorship. Enjoy one-on-one mentoring, constructive code reviews, and opportunities for career advancement that empower team members to tackle progressively complex challenges. Diverse Experiences: We encourage applications from candidates with diverse backgrounds. Regardless of your career path, unconventional experiences, or if you do not meet all qualifications, we encourage you to apply. About AWS: Amazon Web Services is the worlds leading cloud platform, trusted by customers from startups to Global 500 companies for its robust suite of products and services. Inclusive Team Culture: At AWS, we commit to a culture that embraces diversity and learning. We host employee-led affinity groups and ongoing events that celebrate our uniqueness. Work/Life Balance: We value work-life harmony and provide flexibility to ensure success in and out of the workplace. Mentorship & Career Growth: Our goal is to continuously elevate performance. You will find abundant mentorship and resources to help you grow into a well-rounded professional. Basic Qualifications: Experience mentoring or leading engineering teams. 5+ years of professional software development experience. 5+ years programming experience in at least one language. 5+ years leading design or architecture of new and existing systems. 5+ years programming in C/C++ focusing on high-performance distributed systems. 5+ years leading design of large-scale networked systems. Deep expertise in RDMA technologies and RoCE implementations. Extensive experience with collective communication libraries such as NCCL, RCCL, OneCCL, MPI. Experience as a technical lead on complex infrastructure projects. Preferred Qualifications: 5+ years of the full software development life cycle experience. Expert-level experience with SmartNIC programming and network acceleration hardware APIs. In-depth knowledge of AI training infrastructure and cluster networking. Proven track record of performance optimization in distributed environments. Experience with cloud infrastructure integration and large-scale deployment. Understanding of modern AI accelerator architectures. Experience building systems for training large models. Strong communication and technical leadership skills. Masters degree in Computer Science, Computer Engineering, or related field. Experience with AWS cloud infrastructure is a plus. Amazon is an equal opportunity employer and does not discriminate based on protected status. We will consider qualified applicants with arrest and conviction records. In this position, you are expected to work effectively with team members, adhere to standards of excellence, communicate respectfully, and follow all federal, state, and local laws. Pursuant to local ordinances, we will consider qualified applicants with criminal histories. If you require accommodations during the application process, please refer to our resources for more information. Compensation reflects labor market variations across the U.S. The salary for this role ranges from $151,300 to $261,500 annually, influenced by multiple factors including experience and job-related skills. This position will remain open until filled. Please apply through our career site.