Clockwork Systems
Clockwork Systems is hiring: Software Engineer - Distributed Training in Palo Al
Clockwork Systems, Palo Alto, CA, US, 94306
Software Engineer - Distributed Training Join to apply for the Software Engineer - Distributed Training role at Clockwork Systems, Inc.
Continue with Google Continue with Google
Software Engineer - Distributed Training 1 week ago Be among the first 25 applicants
Join to apply for the Software Engineer - Distributed Training role at Clockwork Systems, Inc.
Get AI-powered advice on this job and more exclusive features.
Sign in to access AI-powered advices Continue with Google Continue with Google
Continue with Google Continue with Google
Continue with Google Continue with Google
Continue with Google Continue with Google
Continue with Google Continue with Google
Continue with Google Continue with Google
Clockwork.io is a Silicon Valley startup that delivers state-of-the-art AI compute acceleration.
We are founded by Stanford researchers and veteran systems engineers with a shared belief: distributed systems powering modern AI require a new approach to managing time, reliability, and performance. Unlike traditional solutions that rely on specialized hardware or embedded telemetry in switches, Clockwork's system brings insane visibility, resilience, acceleration and efficiency to the network layer entirely through software. As AI workloads continue to scale in size, urgency, and impact, networks must evolve to keep up. Clockwork exists to make that evolution possible.
About Us
Clockwork.io A Software-Driven Revolution in AI Networking
Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI networking, delivering deterministic time, ultra-low latency, and seamless scalability for modern distributed systems.
To learn more, visit www.clockwork.io.
About the Role
We are looking for an experienced software engineer to help build, optimize, and maintain large-scale distributed training infrastructure based on the PyTorch ecosystem. This role focuses on production-grade training workflows involving multi-GPU and multi-node orchestration, high-performance communication layers, and advanced parallelism strategies.
You'll work alongside infrastructure and machine learning teams to ensure training jobs are efficient, scalable, and resilient.
What You will do
Develop and support distributed PyTorch training jobs using torch.distributed / c10d
Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks
Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)
Optimize performance across communication, I/O, and memory bottlenecks
Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs
Write tooling and scripts to streamline training workflows and experiment management
Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)
What We're Looking For
Deep experience with PyTorch and torch.distributed (c10d)
Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale
Proficiency in Python and Linux shell scripting
Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar
Strong understanding of NCCL, collective communication, and GPU topology
Familiarity with debugging tools and techniques for distributed systems
Preferred Skills
Experience scaling LLM training across 8+ GPUs and multiple nodes
Knowledge of tensor, pipeline, and data parallelism
Familiarity with containerized training environments (Docker, Singularity)
Exposure to HPC environments or cloud GPU infrastructure
Experience with training workload orchestration tools or custom job launchers
Comfort with large-scale checkpointing, resume/restart logic, and model I/O
Bonus Skills
Profiling tools: PyTorch Profiler, Nsight, nvprof, or equivalent
Experience with performance tuning in distributed training environments
Contributions to ML infrastructure open-source projects
Familiarity with storage, networking, or RDMA/GPU Direct technologies
Understanding of observability in ML pipelines (metrics, logs, dashboards)
Enjoy
Challenging projects.
A friendly and inclusive workplace culture.
Competitive compensation.
A great benefits package.
Catered lunch
Clockwork is assembling world class teams to build cutting edge software. We look for bright people from all walks of life and we grow together. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against on the basis of disability. Seniority level Seniority level Mid-Senior level
Employment type Employment type Full-time
Job function Job function Engineering and Information Technology
Industries Software Development
Referrals increase your chances of interviewing at Clockwork Systems, Inc. by 2x
Get notified about new Software Engineer jobs in Palo Alto, CA .
Palo Alto, CA $160,000 - $180,000 19 hours ago
Software Engineer, AI Intern (Fall 2025) San Francisco Bay Area $57 - $61 2 weeks ago
Mountain View, CA $125,400 - $188,100 1 week ago
Software Engineer, AI Platform - New Grad San Jose, CA $130,000 - $180,000 1 week ago
Software Engineer (L4), Content & Business Products New Grads 2025 - Software Engineer, Algorithm San Jose, CA
$120,000.00
-
$165,000.00
9 months ago
Palo Alto, CA
$96,000.00
-
$200,000.00
2 weeks ago
New Grads 2025 - General Software Engineer San Jose, CA
$120,000.00
-
$165,000.00
5 months ago
Alameda, CA
$130,000.00
-
$160,000.00
3 weeks ago
Software Engineer(s) - New Grad (Fall 2025 Graduation) Software Engineer 4 - TV & Web Player Platform San Francisco Bay Area
$160,000.00
-
$180,000.00
20 hours ago
Palo Alto, CA
$115,000.00
-
$260,000.00
1 hour ago
Full Stack Software Engineer - Post-training (General Hire) Software Engineer Graduate (Advertisement Team) - 2025 Start (BS/MS) San Jose, CA
$113,500.00
-
$250,000.00
2 weeks ago
Full Stack Software Engineer (L4), Product Localization Engineering San Jose, CA
$142,400.00
-
$190,100.00
2 weeks ago
Sunnyvale, CA
$117,000.00
-
$234,000.00
1 week ago
San Jose, CA
$113,400.00
-
$206,300.00
1 week ago
San Jose, CA
$113,400.00
-
$206,300.00
2 weeks ago
Software Engineer(s) - New Grad (Fall 2025 Graduation) San Jose, CA $113,400 - $206,300 2 weeks ago
Santa Clara, CA $150,000 - $175,000 7 months ago
San Jose, CA $113,400 - $206,300 2 weeks ago
Palo Alto, CA $152,400 - $228,700 2 weeks ago
New College Grad Software Engineer, Software Engineering Development (Apps) San Jose, CA $92,735 - $131,300 6 days ago
Were unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr