xAI
AI/HPC Network Development Engineer - Networking
xAI, Palo Alto, California, United States, 94306
About xAI
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills, able to share knowledge concisely and accurately with teammates. About the Role
xAI was first in the world to build a 100k GPU cluster on an ethernet network and then did it again in 92 days, floors, walls and all. We need an engineer with deep experience in RoCEv2 to develop at hyper scale while optimizing performance and availability. xAI is building at a furious pace with the latest hardware to help people understand the universe. To make the next significant leap forward, we need to understand our network performance and availability, optimize it for training models, and improve customer inference queries. You will spend most of your days deep inside NCCL, building metric dashboards, and tweaking configurations to maximize performance. You will also help design the next iteration of our backend and front-end networks to seamlessly expand GPU infrastructure with minimal engineering assistance. This role involves significant travel to Memphis for capacity building, participating in a team on-call rotation, and supporting scaling and maintenance efforts. As the team grows, engineers will contribute to deployment and operations frameworks to automate repetitive tasks. Location
We have two openings: one in Palo Alto, California, and another in Dublin, Ireland. Expect significant travel to Memphis, Tennessee, for data center buildouts, and to Palo Alto for team collaboration. Ideal Experiences
At least 10 years designing and operating large-scale networks, with 5+ years in ethernet AI/HPC environments. Deep understanding of congestion control on ethernet; Infiniband experience is a plus. Knowledge of AI training and inference workloads and network operation; ability to debug NCCL and contribute to the library. Proficiency in creating performance and operational metrics to optimize fleet performance. Experience with Python for automation and data analysis. Interview Process
After applying, your CV and work statement will be reviewed. Successful candidates will be invited to an initial 45-minute to 1-hour interview with basic questions. If you pass, the process continues with five interviews: Coding assessment in your preferred language. Discussion on data center network technologies and RoCEv2. Presentation of a body of work to the team. Our total rewards include salary, equity, medical, vision, dental, 401(k), disability, life insurance, and other perks. Interested in building your career at xAI? Receive future opportunities directly via email. Apply for this job
Required fields are marked with an asterisk (*). First Name * Last Name * Email * Phone * Resume/CV * Enter manually or upload a file (pdf, doc, docx, txt, rtf). Website LinkedIn Profile What makes you the ideal candidate for this position? Describe your most proud work in 100 words or less. Will you require sponsorship now or in the future to work in the U.S.? *
#J-18808-Ljbffr
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills, able to share knowledge concisely and accurately with teammates. About the Role
xAI was first in the world to build a 100k GPU cluster on an ethernet network and then did it again in 92 days, floors, walls and all. We need an engineer with deep experience in RoCEv2 to develop at hyper scale while optimizing performance and availability. xAI is building at a furious pace with the latest hardware to help people understand the universe. To make the next significant leap forward, we need to understand our network performance and availability, optimize it for training models, and improve customer inference queries. You will spend most of your days deep inside NCCL, building metric dashboards, and tweaking configurations to maximize performance. You will also help design the next iteration of our backend and front-end networks to seamlessly expand GPU infrastructure with minimal engineering assistance. This role involves significant travel to Memphis for capacity building, participating in a team on-call rotation, and supporting scaling and maintenance efforts. As the team grows, engineers will contribute to deployment and operations frameworks to automate repetitive tasks. Location
We have two openings: one in Palo Alto, California, and another in Dublin, Ireland. Expect significant travel to Memphis, Tennessee, for data center buildouts, and to Palo Alto for team collaboration. Ideal Experiences
At least 10 years designing and operating large-scale networks, with 5+ years in ethernet AI/HPC environments. Deep understanding of congestion control on ethernet; Infiniband experience is a plus. Knowledge of AI training and inference workloads and network operation; ability to debug NCCL and contribute to the library. Proficiency in creating performance and operational metrics to optimize fleet performance. Experience with Python for automation and data analysis. Interview Process
After applying, your CV and work statement will be reviewed. Successful candidates will be invited to an initial 45-minute to 1-hour interview with basic questions. If you pass, the process continues with five interviews: Coding assessment in your preferred language. Discussion on data center network technologies and RoCEv2. Presentation of a body of work to the team. Our total rewards include salary, equity, medical, vision, dental, 401(k), disability, life insurance, and other perks. Interested in building your career at xAI? Receive future opportunities directly via email. Apply for this job
Required fields are marked with an asterisk (*). First Name * Last Name * Email * Phone * Resume/CV * Enter manually or upload a file (pdf, doc, docx, txt, rtf). Website LinkedIn Profile What makes you the ideal candidate for this position? Describe your most proud work in 100 words or less. Will you require sponsorship now or in the future to work in the U.S.? *
#J-18808-Ljbffr