Essential AI
Member of Technical Staff: ML Infrastructure, Platform Engineer
Essential AI, San Francisco, California, United States, 94199
Member of Technical Staff: ML Infrastructure, Platform Engineer
Join to apply for the
Member of Technical Staff: ML Infrastructure, Platform Engineer
role at
Essential AI Member of Technical Staff: ML Infrastructure, Platform Engineer
1 day ago Be among the first 25 applicants Join to apply for the
Member of Technical Staff: ML Infrastructure, Platform Engineer
role at
Essential AI About Us
We believe that a small, focused team of motivated individuals can create outsized breakthroughs. We are building an open platform to fuel and accelerate AI breakthroughs globally. Essential AI’s technology and products have the means to shape AI advancements while supporting scalable and sustainable business models. About Us
We believe that a small, focused team of motivated individuals can create outsized breakthroughs. We are building an open platform to fuel and accelerate AI breakthroughs globally. Essential AI’s technology and products have the means to shape AI advancements while supporting scalable and sustainable business models.
The Role
The ML Infra Platform Engineer will be responsible for architecting and building the compute infra that powers the training and serving of our models. This requires a full understanding of the complete backend stack → from frameworks to compilers to runtimes to kernels.
Running and training models at scale often requires solving novel system problems. As an Infra Systems Engineer, you'll be responsible for identifying these problems and then developing systems that optimize the throughput and robustness of distributed systems. With proven experience building large-scale platforms, you will be responsible for building and advancing our systems that allow research and engineering organizations to iteratively develop, test, and deploy new features reliably, with high velocity, and with a frictionless-fast development cycle.
What You’ll Be Working On
Design, build, and maintain scalable machine learning infrastructure to support our model training, inference and applications Design and implement scalable machine learning and distributed systems that enable training and scaling of LLMs. Work on parallelism methods improve training of in a fast and reliable way You will help oversee and drive the vision of how we should build, test, and deploy models, while taking ownership and transform state-of-the-art development experience for research Develop tools and frameworks to automate and streamline ML experimentation and management Collaborate with other researchers and product engineers to bring magical product experiences through large language models Working on lower levels of the stack to build high-performing and optimal training and serving infrastructure including researching new techniques and writing custom kernels as needed to achieve improvements Be willing to optimize performance and efficiency across different accelerators
What
we are looking for
A strong understanding of architectures of new AI accelerators like GPU, TPU, IPU, HPU etc and their tradeoffs. Knowledge of parallel computing concepts and distributed systems. Experience with Kernels, Low precision training, MoE. Prior experience in performance tuning of training and/or inference LLM workloads. Experience with MLPerf or internal production workloads will be valued. 6+ years of relevant industry experience in leading the design of large-scale & production ML infra systems. Experience with Communication Libraries. Experience with training and building large language models using frameworks such as Megatron, DeepSpeed, etc and deployment frameworks like vLLM, TGI, TensorRT-LLM etc Comfortable with working under-the-hood with kernel languages like OAI Triton, Pallas and compilers like XLA Experience with INT8/FP8 training and inference, quantization and/or distillation Knowledge of container technologies like Docker and Kubernetes and cloud platforms like AWS, GCP, etc. Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls etc
We encourage you to apply for this position even if you don’t check all of the above requirements but want to spend time pushing on these techniques.
Essential AI commits to providing a work environment free of discrimination and harassment, as well as equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. You may view all of Essential AI’s recruiting notices here, including our EEO policy, recruitment scam notice, and recruitment agency policy.
Seniority level
Seniority level Mid-Senior level Employment type
Employment type Full-time Job function
Job function Engineering and Information Technology Industries Research Services Referrals increase your chances of interviewing at Essential AI by 2x Get notified about new Member of Technical Staff jobs in
San Francisco, CA . Berkeley, CA $60,000.00-$240,000.00 1 year ago Member of Technical Staff - Computational Biologist
San Lorenzo, CA $5,495.00-$8,402.00 1 month ago Member of Technical Staff, DevSecOps / Infrastructure
San Francisco, CA $136,947.00-$239,699.00 6 months ago Associate Director of Counseling & Psychological Services - (Administrator II) - Counseling and Psychological Services
San Francisco, CA $10,000.00-$120,000.00 9 months ago Commercial Internal Audit Senior Consultant
San Francisco, CA $74,100.00-$147,800.00 2 weeks ago Project Archaeologist/ Cultural Resources Specialist
Oakland, CA $140,000.00-$220,000.00 1 month ago Martinez, CA $121,160.47-$216,090.75 4 days ago San Mateo, CA $141,800.00-$221,600.00 1 week ago San Francisco, CA $107,768.00-$118,282.00 1 month ago San Mateo, CA $90,000.00-$140,000.00 1 week ago Division of Gastroenterology - Gastroenterologist
San Francisco, CA $110,500.00-$164,700.00 5 hours ago Member of Technical Staff (Student Internship)
San Francisco, CA $150,000.00-$300,000.00 1 month ago Member of Technical Staff - Compute Platform
Member of Technical Staff (Senior/Staff)
San Francisco, CA $145,000.00-$220,000.00 5 months ago San Francisco, CA $110,000.00-$400,000.00 1 month ago (New Grad) Member of Technical Staff, Integrations
Member of Technical Staff - General Interest
Quantum Engineer - Member of Technical Staff
San Francisco, CA $120,000.00-$180,000.00 3 months ago Member of Technical Staff, Post-Training
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr
Join to apply for the
Member of Technical Staff: ML Infrastructure, Platform Engineer
role at
Essential AI Member of Technical Staff: ML Infrastructure, Platform Engineer
1 day ago Be among the first 25 applicants Join to apply for the
Member of Technical Staff: ML Infrastructure, Platform Engineer
role at
Essential AI About Us
We believe that a small, focused team of motivated individuals can create outsized breakthroughs. We are building an open platform to fuel and accelerate AI breakthroughs globally. Essential AI’s technology and products have the means to shape AI advancements while supporting scalable and sustainable business models. About Us
We believe that a small, focused team of motivated individuals can create outsized breakthroughs. We are building an open platform to fuel and accelerate AI breakthroughs globally. Essential AI’s technology and products have the means to shape AI advancements while supporting scalable and sustainable business models.
The Role
The ML Infra Platform Engineer will be responsible for architecting and building the compute infra that powers the training and serving of our models. This requires a full understanding of the complete backend stack → from frameworks to compilers to runtimes to kernels.
Running and training models at scale often requires solving novel system problems. As an Infra Systems Engineer, you'll be responsible for identifying these problems and then developing systems that optimize the throughput and robustness of distributed systems. With proven experience building large-scale platforms, you will be responsible for building and advancing our systems that allow research and engineering organizations to iteratively develop, test, and deploy new features reliably, with high velocity, and with a frictionless-fast development cycle.
What You’ll Be Working On
Design, build, and maintain scalable machine learning infrastructure to support our model training, inference and applications Design and implement scalable machine learning and distributed systems that enable training and scaling of LLMs. Work on parallelism methods improve training of in a fast and reliable way You will help oversee and drive the vision of how we should build, test, and deploy models, while taking ownership and transform state-of-the-art development experience for research Develop tools and frameworks to automate and streamline ML experimentation and management Collaborate with other researchers and product engineers to bring magical product experiences through large language models Working on lower levels of the stack to build high-performing and optimal training and serving infrastructure including researching new techniques and writing custom kernels as needed to achieve improvements Be willing to optimize performance and efficiency across different accelerators
What
we are looking for
A strong understanding of architectures of new AI accelerators like GPU, TPU, IPU, HPU etc and their tradeoffs. Knowledge of parallel computing concepts and distributed systems. Experience with Kernels, Low precision training, MoE. Prior experience in performance tuning of training and/or inference LLM workloads. Experience with MLPerf or internal production workloads will be valued. 6+ years of relevant industry experience in leading the design of large-scale & production ML infra systems. Experience with Communication Libraries. Experience with training and building large language models using frameworks such as Megatron, DeepSpeed, etc and deployment frameworks like vLLM, TGI, TensorRT-LLM etc Comfortable with working under-the-hood with kernel languages like OAI Triton, Pallas and compilers like XLA Experience with INT8/FP8 training and inference, quantization and/or distillation Knowledge of container technologies like Docker and Kubernetes and cloud platforms like AWS, GCP, etc. Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls etc
We encourage you to apply for this position even if you don’t check all of the above requirements but want to spend time pushing on these techniques.
Essential AI commits to providing a work environment free of discrimination and harassment, as well as equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. You may view all of Essential AI’s recruiting notices here, including our EEO policy, recruitment scam notice, and recruitment agency policy.
Seniority level
Seniority level Mid-Senior level Employment type
Employment type Full-time Job function
Job function Engineering and Information Technology Industries Research Services Referrals increase your chances of interviewing at Essential AI by 2x Get notified about new Member of Technical Staff jobs in
San Francisco, CA . Berkeley, CA $60,000.00-$240,000.00 1 year ago Member of Technical Staff - Computational Biologist
San Lorenzo, CA $5,495.00-$8,402.00 1 month ago Member of Technical Staff, DevSecOps / Infrastructure
San Francisco, CA $136,947.00-$239,699.00 6 months ago Associate Director of Counseling & Psychological Services - (Administrator II) - Counseling and Psychological Services
San Francisco, CA $10,000.00-$120,000.00 9 months ago Commercial Internal Audit Senior Consultant
San Francisco, CA $74,100.00-$147,800.00 2 weeks ago Project Archaeologist/ Cultural Resources Specialist
Oakland, CA $140,000.00-$220,000.00 1 month ago Martinez, CA $121,160.47-$216,090.75 4 days ago San Mateo, CA $141,800.00-$221,600.00 1 week ago San Francisco, CA $107,768.00-$118,282.00 1 month ago San Mateo, CA $90,000.00-$140,000.00 1 week ago Division of Gastroenterology - Gastroenterologist
San Francisco, CA $110,500.00-$164,700.00 5 hours ago Member of Technical Staff (Student Internship)
San Francisco, CA $150,000.00-$300,000.00 1 month ago Member of Technical Staff - Compute Platform
Member of Technical Staff (Senior/Staff)
San Francisco, CA $145,000.00-$220,000.00 5 months ago San Francisco, CA $110,000.00-$400,000.00 1 month ago (New Grad) Member of Technical Staff, Integrations
Member of Technical Staff - General Interest
Quantum Engineer - Member of Technical Staff
San Francisco, CA $120,000.00-$180,000.00 3 months ago Member of Technical Staff, Post-Training
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr