Logo
ByteDance

Software Engineer, ML System Scheduling

ByteDance, San Jose, California, United States, 95199

Save Job

Overview

The Machine Learning (ML) System sub-team combines system engineering and the art of machine learning to develop and maintain massively distributed ML training and inference system/services around the world, providing high-performance, highly reliable, scalable systems for LLM/AIGC/AGI. You will have the opportunity to build large-scale heterogeneous systems integrating with GPU/NPU/RDMA/Storage, keep them running reliably, deepen your skills in coding, performance analysis, and distributed systems, and participate in decision-making. You will be part of a global team with members from the United States, China, and Singapore working toward a unified project direction. Responsibilities

Design and development of resource scheduling, including model training, model evaluation, and model inference in various scenarios (LLM/AIGC/NLP/CV/Speech, etc.). Optimal orchestration of computing resources (GPU, CPU, and other heterogeneous hardware) to achieve stable, efficient, tidal, mixed, and multi-cloud resource usage. Optimal combination of computing resources, RDMA high-speed network resources, and storage resources to maximize the power of large-scale distributed clusters. Offline and online workload scheduling in global data centers, integrating multi-cloud scenarios to achieve rational distributions. Qualifications

Minimum Qualifications Proficiency in 1 to 2 programming languages such as Go, Python, or Shell in a Linux environment. Familiarity with Kubernetes architecture and container technologies (Docker/Containerd/Kata/Podman), with rich experience in ML system practice and development. Understanding of distributed systems principles and experience designing, developing, and maintaining large-scale distributed systems. Excellent logical analysis ability, with the capacity to abstract and split business logic effectively. Strong sense of responsibility, good learning ability, communication skills, and self-drive; able to respond quickly. Preferred Qualifications Familiarity with at least one major ML framework (TensorFlow/PyTorch). Experience in AI Infrastructure, HW/SW Co-Design, High Performance Computing, or ML hardware architecture (GPU, accelerators, networking). About Doubao (Seed)

Founded in 2023, the ByteDance Doubao (Seed) Team is dedicated to pioneering advanced AI foundation models. Our goal is to lead in cutting-edge research and drive technological and societal advancements. Our research areas span deep learning, reinforcement learning, language, vision, audio, AI infrastructure, and AI safety. Our team has labs and research positions across China, Singapore, and the US. Why Join ByteDance

Inspiring creativity is at the core of ByteDance’s mission. Our innovative products help people authentically express themselves, discover, and connect. Our global, diverse teams make that possible. We strive to create value for communities, inspire creativity, and enrich life every day. As ByteDancers, we aim to do great things with great people. We lead with curiosity, humility, and a drive to make an impact in a rapidly growing tech company. By iterating and maintaining an ‘Always Day 1’ mindset, we achieve meaningful breakthroughs for our people, our company, and our users. Join us. Diversity & Inclusion

ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. We connect people across the globe, and so does our workplace. Our mission is to inspire creativity and enrich life, celebrating diverse voices and creating an environment that reflects the communities we reach. Reasonable Accommodation

ByteDance provides reasonable accommodations in our recruitment processes for candidates with disabilities, pregnancy, sincerely held religious beliefs, or other reasons protected by applicable laws. If you need assistance or a reasonable accommodation, please reach out to us at https://tinyurl.com/RA-request Job Information

For Pay Transparency: Compensation Description (Annually) The base salary range for this position in the selected city is $177,688 - $341,734 annually. Compensation may vary outside this range based on qualifications, skills, competencies, experience, and location. Base pay is part of the Total Package and may be eligible for bonuses, incentives, and restricted stock units. Benefits may vary by location and employment type. Day-one medical, dental, and vision insurance; 401(k) with company match; paid parental leave; disability coverage; life insurance; wellbeing benefits; and paid time off are included where applicable. The Company reserves the right to modify or change these benefits at any time, with or without notice. For Los Angeles County (unincorporated) Candidates: Qualified applicants with arrest or conviction records will be considered for employment in accordance with federal, state, and local laws, including the Los Angeles County Fair Chance Ordinance and the California Fair Chance Act. This may affect job duties such as client interaction, handling confidential information, and exercising sound judgment.

#J-18808-Ljbffr