TrueFoundry
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
TrueFoundry, San Mateo, California, United States, 94409
Join to apply for the
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
role at
TrueFoundry 2 weeks ago Be among the first 25 applicants Join to apply for the
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
role at
TrueFoundry Get AI-powered advice on this job and more exclusive features. Build the Future of Scalable AI at TrueFoundry
At
TrueFoundry , we're redefining how ML teams train, deploy, and scale their models. Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models reliably, and deploy them seamlessly on Kubernetes—with the same muscle as Big Tech.
We're looking for
ML Systems Engineers
who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.
What You'll Work On
Write clean, modular, and scalable Python code , with a strong emphasis on reliability and performance. Build platform for training and finetuning large-scale ML models
across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools. Own the infrastructure and code
that enables high-throughput, low-latency inference pipelines for state-of-the-art models. Build platform for developing, deploying and evaluating
agentic applications for our end customers. Help shape internal standards and best practices across the engineering team for high-scale ML workloads.
What We're Looking For
5+ years of hands-on experience
building and deploying ML systems at scale. 5+ years of writing production quality high performance code. Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework. Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT). Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus. A pragmatic mindset—you know when to optimize and when to ship. Bonus: Familiarity with open-source LLM training/fine-tuning.
Why Join TrueFoundry?
Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni. First-hand exposure to building and scaling a deep-tech startup—insights you'll carry if you want to start your own one day. Be part of a fearlessly experimental culture focused on customer success and long-term impact. Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).
Seniority level
Seniority level Mid-Senior level Employment type
Employment type Full-time Job function
Job function Engineering and Information Technology Industries Software Development Referrals increase your chances of interviewing at TrueFoundry by 2x Sign in to set job alerts for “Platform Engineer” roles.
Software Engineer, AI Platform - New Grad
Mountain View, CA $167,200.00-$250,800.00 1 week ago Mountain View, CA $167,200.00-$250,800.00 1 day ago San Francisco, CA $150,000.00-$300,000.00 8 months ago San Francisco, CA $150,000.00-$300,000.00 8 months ago Member of Technical Staff AI Platform Engineer
San Francisco, CA $150,000.00-$300,000.00 8 months ago Member of Technical Staff Platform Engineer
Mountain View, CA $117,200.00-$294,000.00 2 weeks ago Founding Engineer - Up to $200K + Equity
San Francisco, CA $150,000.00-$200,000.00 2 weeks ago San Francisco, CA $75,000.00-$95,000.00 2 days ago San Francisco, CA $150,000.00-$200,000.00 5 months ago Site Reliability Engineer, AI/ML Platforms
San Mateo, CA $195,000.00-$255,000.00 5 months ago San Francisco, CA $180,000.00-$340,000.00 2 weeks ago Sr. Software Engineer, ML Platform - Slack
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
role at
TrueFoundry 2 weeks ago Be among the first 25 applicants Join to apply for the
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
role at
TrueFoundry Get AI-powered advice on this job and more exclusive features. Build the Future of Scalable AI at TrueFoundry
At
TrueFoundry , we're redefining how ML teams train, deploy, and scale their models. Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models reliably, and deploy them seamlessly on Kubernetes—with the same muscle as Big Tech.
We're looking for
ML Systems Engineers
who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.
What You'll Work On
Write clean, modular, and scalable Python code , with a strong emphasis on reliability and performance. Build platform for training and finetuning large-scale ML models
across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools. Own the infrastructure and code
that enables high-throughput, low-latency inference pipelines for state-of-the-art models. Build platform for developing, deploying and evaluating
agentic applications for our end customers. Help shape internal standards and best practices across the engineering team for high-scale ML workloads.
What We're Looking For
5+ years of hands-on experience
building and deploying ML systems at scale. 5+ years of writing production quality high performance code. Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework. Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT). Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus. A pragmatic mindset—you know when to optimize and when to ship. Bonus: Familiarity with open-source LLM training/fine-tuning.
Why Join TrueFoundry?
Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni. First-hand exposure to building and scaling a deep-tech startup—insights you'll carry if you want to start your own one day. Be part of a fearlessly experimental culture focused on customer success and long-term impact. Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).
Seniority level
Seniority level Mid-Senior level Employment type
Employment type Full-time Job function
Job function Engineering and Information Technology Industries Software Development Referrals increase your chances of interviewing at TrueFoundry by 2x Sign in to set job alerts for “Platform Engineer” roles.
Software Engineer, AI Platform - New Grad
Mountain View, CA $167,200.00-$250,800.00 1 week ago Mountain View, CA $167,200.00-$250,800.00 1 day ago San Francisco, CA $150,000.00-$300,000.00 8 months ago San Francisco, CA $150,000.00-$300,000.00 8 months ago Member of Technical Staff AI Platform Engineer
San Francisco, CA $150,000.00-$300,000.00 8 months ago Member of Technical Staff Platform Engineer
Mountain View, CA $117,200.00-$294,000.00 2 weeks ago Founding Engineer - Up to $200K + Equity
San Francisco, CA $150,000.00-$200,000.00 2 weeks ago San Francisco, CA $75,000.00-$95,000.00 2 days ago San Francisco, CA $150,000.00-$200,000.00 5 months ago Site Reliability Engineer, AI/ML Platforms
San Mateo, CA $195,000.00-$255,000.00 5 months ago San Francisco, CA $180,000.00-$340,000.00 2 weeks ago Sr. Software Engineer, ML Platform - Slack
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr