Rethink recruit

Senior ML Engineer — Distributed LLM Training Infrastructure

Rethink recruit, Poland, New York, United States

About Templar Templar is at the forefront of community-driven AI development, redefining how large language models (LLMs) are trained. Our team enables

permissionless pretraining , allowing collaborators across diverse computational environments to jointly train LLMs without centralized coordination. Our latest research, Incentivizing Permissionless Distributed Learning of LLMs, introduces

Gauntlet —an incentive system deployed on-chain that powered a truly decentralized 1.2B parameter LLM training run. You can read the paper on arXiv here: Incentivizing Permissionless Distributed Learning of LLMs.

Role Overview We’re looking for a seasoned

Senior ML Engineer

to architect and scale the infrastructure that enables distributed LLM training at scale. You will design robust systems atop existing frameworks, extend permissionless protocols, and optimize for decentralized environments across heterogeneous hardware.

Responsibilities Distributed Training Infrastructure

Architect scalable training across frameworks like TorchTitan, Megatron-LM, DeepSpeed, FairScale

Implement model/data/pipeline parallelism, efficient gradient sync, all-reduce in heterogeneous clusters

Build fault‑tolerant systems including checkpointing and node‑failure recovery

Optimize memory usage and GPU operations with custom CUDA kernels

Framework Development & Optimization

Extend frameworks to enable multi‑party permissionless training

Implement optimization features like gradient compression, quantization, sparsification

Build resilient communication backends for high‑latency, unreliable networks

Develop resource managers, schedulers, and profiling tools for distributed training

System Architecture & Scaling

Architect full training pipelines (data ingestion ➜ deployment)

Build containerized systems via Kubernetes/Docker across cloud platforms

Design model sharding for 100B+ parameter spaces

Implement CI/CD pipelines for distributed infrastructure

Performance Engineering

Profile and optimize throughput, memory, communication patterns

Leverage mixed‑precision, gradient accumulation, fused kernels

Build benchmarking and performance regression suites

Required Qualifications

Bachelor’s or Master’s in CS, Engineering, or related field

5+ years of experience with large‑scale distributed systems / HPC

Deep expertise with distributed LLM frameworks (TorchTitan, Megatron‑LM, DeepSpeed, FairScale)

Expert‑level PyTorch knowledge and hands‑on with DDP/FSDP/RPC

Strong experience in Python and C++/CUDA systems programming

Familiarity with Kubernetes, Docker, and cloud platforms (AWS/GCP/Azure)

Proven track record scaling ML training workloads efficiently

Preferred Experience

Training models >10B parameters via model parallelism

CUDA and GPU optimization for deep learning workloads

Proficiency with NCCL, MPI, high‑throughput data pipelines

Familiarity with decentralized systems, P2P, or blockchain technologies

Contributions to open‑source ML/distributed training projects

Immediates (0–6 Mo)

Benchmark distributed frameworks for permissionless setups

Build proof‑of‑concept infrastructure: gradient compression, node management, fault tolerance

Develop performance benchmarking and observability tools

Collaborate with research teams to transition algorithms to production

Longer‑Term (6+ Mo)

Lead next‑gen distributed training platform support for 1000+ participants

Implement advanced CUDA/kernel optimizations

Release SDKs/APIs for permissionless participation

Establish performance benchmarks and open‑source infrastructure components

Why Join Templar? Your work will directly enable the democratization of LLM training—making large‑scale models accessible across the globe, regardless of centralized resources. You’ll be central to building the distributed systems that support permissionless AI innovation.

#J-18808-Ljbffr