ByteDance
Software Engineer, ML System Architecture
Overview
Joining a team that develops and maintains massively distributed ML training and inference systems for LLM/AIGC/AGI, with a focus on high performance, reliability, and scalability. You will work on a large-scale heterogeneous system integrating GPUs/NPU/RDMA/Storage, contribute to system design and optimization, and collaborate with a global team from the United States, China, and Singapore.
Responsibilities
Responsible for the design and development of Machine Learning infrastructure for LLM/AIGC, etc.
Build a super large machine learning system integrating GPUs, RDMA networking, and high-performance storage.
Address technical challenges related to high stability and availability of the system.
Organize and coordinate multiple teams to complete system construction, including Data center, network, computing, storage, and resource teams.
Qualifications
Minimum Qualifications:
Be proficient in 1 to 2 programming languages such as C++/Go/Python/Shell in Linux environment.
Understand the principles of distributed systems and have experience in design, development and maintenance of large-scale machine learning systems.
Familiar with Kubernetes architecture, with rich experience in system-level development and tuning.
Excellent logical analysis ability, capable of abstracting and splitting business logic.
Strong sense of responsibility, good learning ability, communication skills, and self-drive.
Preferred Qualifications:
Familiar with the ML infrastructure of large model training and inference.
Experience in AI Infrastructure, HW/SW Co-Design, High Performance Computing, ML Hardware Architecture (GPU, Accelerators, Networking).
About Doubao (Seed) Founded in 2023, the ByteDance Doubao (Seed) Team is dedicated to pioneering advanced AI foundation models, with research areas spanning deep learning, reinforcement learning, Language, Vision, Audio, AI Infra and AI Safety. Our team has labs and research positions across China, Singapore, and the US.
Why Join ByteDance ByteDance focuses on inspiring creativity and building products that help people express themselves, discover, and connect. Our global, diverse teams work together with curiosity, humility, and an "Always Day 1" mindset to achieve meaningful breakthroughs for our company and users.
Diversity & Inclusion ByteDance is committed to creating an inclusive space where employees are valued for their skills and perspectives. We celebrate diverse voices and aim to reflect the communities we reach.
Reasonable Accommodation ByteDance provides reasonable accommodations in our recruitment processes for candidates with disabilities or other protected reasons. If you need assistance, please reach out to us at the provided accommodation contact.
Job Information Compensation: The base salary range for this position is stated in the posted location. Total compensation may include bonuses and equity, varies by qualifications and location. Benefits include medical, dental, vision, 401(k) with company match, paid parental leave, disability coverage, life insurance, wellbeing benefits, and paid time off. The company reserves the right to modify benefits at any time.
Job Function
Engineering and Information Technology
Industry
Technology, Information and Internet
#J-18808-Ljbffr
Responsibilities
Responsible for the design and development of Machine Learning infrastructure for LLM/AIGC, etc.
Build a super large machine learning system integrating GPUs, RDMA networking, and high-performance storage.
Address technical challenges related to high stability and availability of the system.
Organize and coordinate multiple teams to complete system construction, including Data center, network, computing, storage, and resource teams.
Qualifications
Minimum Qualifications:
Be proficient in 1 to 2 programming languages such as C++/Go/Python/Shell in Linux environment.
Understand the principles of distributed systems and have experience in design, development and maintenance of large-scale machine learning systems.
Familiar with Kubernetes architecture, with rich experience in system-level development and tuning.
Excellent logical analysis ability, capable of abstracting and splitting business logic.
Strong sense of responsibility, good learning ability, communication skills, and self-drive.
Preferred Qualifications:
Familiar with the ML infrastructure of large model training and inference.
Experience in AI Infrastructure, HW/SW Co-Design, High Performance Computing, ML Hardware Architecture (GPU, Accelerators, Networking).
About Doubao (Seed) Founded in 2023, the ByteDance Doubao (Seed) Team is dedicated to pioneering advanced AI foundation models, with research areas spanning deep learning, reinforcement learning, Language, Vision, Audio, AI Infra and AI Safety. Our team has labs and research positions across China, Singapore, and the US.
Why Join ByteDance ByteDance focuses on inspiring creativity and building products that help people express themselves, discover, and connect. Our global, diverse teams work together with curiosity, humility, and an "Always Day 1" mindset to achieve meaningful breakthroughs for our company and users.
Diversity & Inclusion ByteDance is committed to creating an inclusive space where employees are valued for their skills and perspectives. We celebrate diverse voices and aim to reflect the communities we reach.
Reasonable Accommodation ByteDance provides reasonable accommodations in our recruitment processes for candidates with disabilities or other protected reasons. If you need assistance, please reach out to us at the provided accommodation contact.
Job Information Compensation: The base salary range for this position is stated in the posted location. Total compensation may include bonuses and equity, varies by qualifications and location. Benefits include medical, dental, vision, 401(k) with company match, paid parental leave, disability coverage, life insurance, wellbeing benefits, and paid time off. The company reserves the right to modify benefits at any time.
Job Function
Engineering and Information Technology
Industry
Technology, Information and Internet
#J-18808-Ljbffr