Causal Labs, Inc.
Machine Learning - Infrastructure
Causal Labs, Inc., San Francisco, California, United States, 94199
About us
Our mission is to build causal intelligence, starting with physics models to predict and control the weather.
We're building a small team driven by a deep passion and urgency to solve this civilizationally important problem.
Our founding team has led & shipped models across self‑driving cars, humanoid robotics, protein folding, and video generation at world‑class institutions including Google DeepMind, Cruise, Waymo, Meta, Nabla Bio, and Apple.
Responsibilities
Design, deploy, and maintain large distributed ML training and inference clusters
Develop efficient, scalable end‑to‑end pipelines to manage petabyte‑scale datasets and model training throughout the entire ML lifecycle
Research and test various training approaches including parallelization techniques and numerical precision trade‑offs across different model scales
Analyze, profile and debug low‑level GPU operations to optimize performance
Stay up‑to‑date on research to bring new ideas to work
What we’re looking for We value a relentless approach to problem‑solving, rapid execution, and the ability to quickly learn in unfamiliar domains.
Strong grasp of state‑of‑the‑art techniques for optimizing training and inference workloads
Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models
Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings
Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)
Background working on distributed task management systems and scalable model serving & deployment architectures
Understanding of monitoring, logging, observability, and version control best practices for ML systems
You don’t have to meet every single requirement above.
Benefits
Work on deeply challenging, unsolved problems
Competitive cash and equity compensation
Medical, dental, and vision insurance
Catered lunch & dinner
Unlimited paid time off
Visa sponsorship & relocation support
#J-18808-Ljbffr
We're building a small team driven by a deep passion and urgency to solve this civilizationally important problem.
Our founding team has led & shipped models across self‑driving cars, humanoid robotics, protein folding, and video generation at world‑class institutions including Google DeepMind, Cruise, Waymo, Meta, Nabla Bio, and Apple.
Responsibilities
Design, deploy, and maintain large distributed ML training and inference clusters
Develop efficient, scalable end‑to‑end pipelines to manage petabyte‑scale datasets and model training throughout the entire ML lifecycle
Research and test various training approaches including parallelization techniques and numerical precision trade‑offs across different model scales
Analyze, profile and debug low‑level GPU operations to optimize performance
Stay up‑to‑date on research to bring new ideas to work
What we’re looking for We value a relentless approach to problem‑solving, rapid execution, and the ability to quickly learn in unfamiliar domains.
Strong grasp of state‑of‑the‑art techniques for optimizing training and inference workloads
Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models
Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings
Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)
Background working on distributed task management systems and scalable model serving & deployment architectures
Understanding of monitoring, logging, observability, and version control best practices for ML systems
You don’t have to meet every single requirement above.
Benefits
Work on deeply challenging, unsolved problems
Competitive cash and equity compensation
Medical, dental, and vision insurance
Catered lunch & dinner
Unlimited paid time off
Visa sponsorship & relocation support
#J-18808-Ljbffr