Logo
ByteDance

Software Engineer, ML System Scheduling

ByteDance, San Jose, California, United States, 95199

Save Job

Responsibilities

The Machine Learning (ML) System sub-team combines system engineering and the art of machine learning to develop and maintain massively distributed ML training and inference system/services around the world, providing high-performance, highly reliable, scalable systems for LLM/AIGC/AGI. In our team, you'll have the opportunity to build large-scale heterogeneous systems integrating with GPU/NPU/RDMA/Storage and keep them running stable and reliable, enrich your expertise in coding, performance analysis and distributed systems, and be involved in the decision-making process. You'll also be part of a global team with members from the United States, China and Singapore working collaboratively towards unified project direction. Design and development of resource scheduling, including model training, model evaluation and model inference in various scenarios (LLM/AIGC/NLP/CV/Speech, etc.) Optimal orchestration of computing resources (GPU, CPU, other heterogeneous hardware) to realize stable, efficient, and multi-resource utilization across clouds Optimal combination of computing resources, RDMA high-speed networks, and storage resources to maximize performance of large-scale distributed clusters Offline and online workload scheduling in global data centers integrating multi-cloud scenarios to achieve rational distributions Qualifications

Minimum Qualifications Proficiency in 1 to 2 programming languages such as Go, Python or Shell in a Linux environment Familiarity with Kubernetes architecture and container technologies (Docker/Containerd/Kata/Podman), with rich experience in ML system practice and development Understanding of distributed systems principles and experience designing, developing and maintaining large-scale distributed systems Strong logical analysis ability, with the capacity to abstract and split business logic Strong sense of responsibility, good learning ability, communication skills and self-drive; able to respond quickly Preferred Qualifications Familiarity with at least one major ML framework (TensorFlow/PyTorch) Experience in AI Infrastructure, HW/SW Co-Design, High Performance Computing, or ML Hardware Architecture (GPU, accelerators, networking) About Doubao (Seed)

Founded in 2023, the ByteDance Doubao (Seed) Team is dedicated to pioneering advanced AI foundation models. Our goal is to lead in cutting-edge research and drive technological and societal advancements. Our research areas span deep learning, reinforcement learning, language, vision, audio, AI Infra and AI Safety. Our team has labs and research positions across China, Singapore, and the US. Why Join ByteDance

Inspiring creativity is at the core of ByteDance's mission. Our innovative products help people authentically express themselves, discover and connect. Our global, diverse teams make that possible. We strive to create value for communities, inspire creativity and enrich life through collaboration. We aim to do great things with great people and foster an "Always Day 1" mindset to achieve meaningful breakthroughs for our company, our users, and our communities. Diversity & Inclusion

ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences and unique perspectives. Our platform connects people globally and our workplace reflects this diversity. We celebrate diverse voices and strive to create an environment that reflects the communities we reach. Reasonable Accommodation

ByteDance provides reasonable accommodations in our recruitment processes for candidates with disabilities, pregnancy, sincerely held religious beliefs or other protected reasons. If you need assistance, please reach out to us at the provided accommodation contact. Job Information

Compensation

Description (Annually): The base salary range for this position in the selected city is $177,688 - $341,734 annually. Compensation may vary outside this range based on qualifications, skills, competencies, experience, and location. Base pay is part of the Total Package and may include discretionary bonuses/incentives and restricted stock units. Benefits may vary by location. Employees have day-one access to medical, dental, and vision insurance, a 401(k) plan with company match, paid parental leave, disability coverage, life insurance, wellbeing benefits, and paid time off. The company reserves the right to modify benefits programs at any time, with or without notice. #J-18808-Ljbffr