Rethink recruit
Senior ML Engineer — Distributed LLM Training Infrastructure
Rethink recruit, Poland, New York, United States
About Templar
Templar is at the forefront of community-driven AI development, redefining how large language models (LLMs) are trained. Our team enables
permissionless pretraining , allowing collaborators across diverse computational environments to jointly train LLMs without centralized coordination. Our latest research, Incentivizing Permissionless Distributed Learning of LLMs, introduces
Gauntlet —an incentive system deployed on-chain that powered a truly decentralized 1.2B parameter LLM training run. You can read the paper on arXiv here: Incentivizing Permissionless Distributed Learning of LLMs.
Role Overview We’re looking for a seasoned
Senior ML Engineer
to architect and scale the infrastructure that enables distributed LLM training at scale. You will design robust systems atop existing frameworks, extend permissionless protocols, and optimize for decentralized environments across heterogeneous hardware.
Responsibilities Distributed Training Infrastructure
Architect scalable training across frameworks like TorchTitan, Megatron-LM, DeepSpeed, FairScale
Implement model/data/pipeline parallelism, efficient gradient sync, all-reduce in heterogeneous clusters
Build fault‑tolerant systems including checkpointing and node‑failure recovery
Optimize memory usage and GPU operations with custom CUDA kernels
Framework Development & Optimization
Extend frameworks to enable multi‑party permissionless training
Implement optimization features like gradient compression, quantization, sparsification
Build resilient communication backends for high‑latency, unreliable networks
Develop resource managers, schedulers, and profiling tools for distributed training
System Architecture & Scaling
Architect full training pipelines (data ingestion ➜ deployment)
Build containerized systems via Kubernetes/Docker across cloud platforms
Design model sharding for 100B+ parameter spaces
Implement CI/CD pipelines for distributed infrastructure
Performance Engineering
Profile and optimize throughput, memory, communication patterns
Leverage mixed‑precision, gradient accumulation, fused kernels
Build benchmarking and performance regression suites
Required Qualifications
Bachelor’s or Master’s in CS, Engineering, or related field
5+ years of experience with large‑scale distributed systems / HPC
Deep expertise with distributed LLM frameworks (TorchTitan, Megatron‑LM, DeepSpeed, FairScale)
Expert‑level PyTorch knowledge and hands‑on with DDP/FSDP/RPC
Strong experience in Python and C++/CUDA systems programming
Familiarity with Kubernetes, Docker, and cloud platforms (AWS/GCP/Azure)
Proven track record scaling ML training workloads efficiently
Preferred Experience
Training models >10B parameters via model parallelism
CUDA and GPU optimization for deep learning workloads
Proficiency with NCCL, MPI, high‑throughput data pipelines
Familiarity with decentralized systems, P2P, or blockchain technologies
Contributions to open‑source ML/distributed training projects
Immediates (0–6 Mo)
Benchmark distributed frameworks for permissionless setups
Build proof‑of‑concept infrastructure: gradient compression, node management, fault tolerance
Develop performance benchmarking and observability tools
Collaborate with research teams to transition algorithms to production
Longer‑Term (6+ Mo)
Lead next‑gen distributed training platform support for 1000+ participants
Implement advanced CUDA/kernel optimizations
Release SDKs/APIs for permissionless participation
Establish performance benchmarks and open‑source infrastructure components
Why Join Templar? Your work will directly enable the democratization of LLM training—making large‑scale models accessible across the globe, regardless of centralized resources. You’ll be central to building the distributed systems that support permissionless AI innovation.
#J-18808-Ljbffr
permissionless pretraining , allowing collaborators across diverse computational environments to jointly train LLMs without centralized coordination. Our latest research, Incentivizing Permissionless Distributed Learning of LLMs, introduces
Gauntlet —an incentive system deployed on-chain that powered a truly decentralized 1.2B parameter LLM training run. You can read the paper on arXiv here: Incentivizing Permissionless Distributed Learning of LLMs.
Role Overview We’re looking for a seasoned
Senior ML Engineer
to architect and scale the infrastructure that enables distributed LLM training at scale. You will design robust systems atop existing frameworks, extend permissionless protocols, and optimize for decentralized environments across heterogeneous hardware.
Responsibilities Distributed Training Infrastructure
Architect scalable training across frameworks like TorchTitan, Megatron-LM, DeepSpeed, FairScale
Implement model/data/pipeline parallelism, efficient gradient sync, all-reduce in heterogeneous clusters
Build fault‑tolerant systems including checkpointing and node‑failure recovery
Optimize memory usage and GPU operations with custom CUDA kernels
Framework Development & Optimization
Extend frameworks to enable multi‑party permissionless training
Implement optimization features like gradient compression, quantization, sparsification
Build resilient communication backends for high‑latency, unreliable networks
Develop resource managers, schedulers, and profiling tools for distributed training
System Architecture & Scaling
Architect full training pipelines (data ingestion ➜ deployment)
Build containerized systems via Kubernetes/Docker across cloud platforms
Design model sharding for 100B+ parameter spaces
Implement CI/CD pipelines for distributed infrastructure
Performance Engineering
Profile and optimize throughput, memory, communication patterns
Leverage mixed‑precision, gradient accumulation, fused kernels
Build benchmarking and performance regression suites
Required Qualifications
Bachelor’s or Master’s in CS, Engineering, or related field
5+ years of experience with large‑scale distributed systems / HPC
Deep expertise with distributed LLM frameworks (TorchTitan, Megatron‑LM, DeepSpeed, FairScale)
Expert‑level PyTorch knowledge and hands‑on with DDP/FSDP/RPC
Strong experience in Python and C++/CUDA systems programming
Familiarity with Kubernetes, Docker, and cloud platforms (AWS/GCP/Azure)
Proven track record scaling ML training workloads efficiently
Preferred Experience
Training models >10B parameters via model parallelism
CUDA and GPU optimization for deep learning workloads
Proficiency with NCCL, MPI, high‑throughput data pipelines
Familiarity with decentralized systems, P2P, or blockchain technologies
Contributions to open‑source ML/distributed training projects
Immediates (0–6 Mo)
Benchmark distributed frameworks for permissionless setups
Build proof‑of‑concept infrastructure: gradient compression, node management, fault tolerance
Develop performance benchmarking and observability tools
Collaborate with research teams to transition algorithms to production
Longer‑Term (6+ Mo)
Lead next‑gen distributed training platform support for 1000+ participants
Implement advanced CUDA/kernel optimizations
Release SDKs/APIs for permissionless participation
Establish performance benchmarks and open‑source infrastructure components
Why Join Templar? Your work will directly enable the democratization of LLM training—making large‑scale models accessible across the globe, regardless of centralized resources. You’ll be central to building the distributed systems that support permissionless AI innovation.
#J-18808-Ljbffr