Logo
Appliedcompute

Infrastructure Engineer, ML Systems

Appliedcompute, San Francisco, California, United States, 94199

Save Job

Applied Compute builds Specific Intelligence for enterprises, unlocking the knowledge inside a company to train custom models and deploy an in-house agent workforce.

Today’s state-of-the-art AI isn’t one-size-fits-all—it’s a tailored system that continuously learns from a company’s processes, data, expertise, and goals. The same way companies compete today by having the best human workforce, the companies building for the future will compete by having the best agent workforce supporting their human bosses. We call this

Specific Intelligence

and we’re already building them today.

We are a small, talent-dense team of engineers, researchers, and operators who have built some of the most influential AI systems in the world, including reinforcement learning infrastructure at OpenAI and data foundations at Scale AI, with additional experience from Together, Two Sigma, and Watershed.

We’re backed with

$80M

from Benchmark, Sequoia, Lux, Hanabi, Neo, Elad Gil, Victor Lazarte, Omri Casspi, and others. We work in-person in San Francisco.

The Role As a founding

Infrastructure Engineer, ML Systems , you’ll be responsible for designing, implementing, and optimizing large-scale machine learning systems that power both customer deployments and frontier reinforcement learning research. Frontier systems are exciting yet brittle, and require diligence and attention to detail to engineer correctly. We value performance with correctness - on their own, each of these are necessary but not sufficient to train frontier models effectively.

You’ll work closely with our researchers and product engineers to bring frontier LLM post training software into enterprise deployments. This role is perfect for systems enthusiasts who thrive with implementing high performance, reliable systems at scale.

What You’ll Do

Design and optimize a frontier LLM post training stack, optimizing our training and inference pipelines

Implement and debug systems with an eye towards how they affect ML (e.g., using low precision numerics)

Design tooling and observability to allow researchers and customers to inspect and profile our large training systems

What We’re Looking For

Fearlessness and curiosity to understand all levels of the training system

Uncompromising desire to learn and keep up with frontier techniques

Background in programming with and managing training jobs large scale GPU systems

Bias toward fast implementation, paired with a high bar for reliability and efficiency

Experience with open-weights models (architecture and inference)

Background in reinforcement learning or integration of inference with RL training loops

Demonstrated technical creativity through published projects, OSS contributions, or side projects

Logistics Location:

This role is based in San Francisco, California.

Benefits:

Applied Compute offers generous health benefits, unlimited PTO, paid parental leave, lunches and dinners at the office, and relocation support as needed. We work in-person at a beautiful office in San Francisco’s Design District.

Visa sponsorship:

We sponsor visas. While we can’t guarantee success for every candidate or role, if you’re the right fit, we’re committed to working through the visa process with you.

We encourage you to apply even if you do not believe you meet every single qualification. As set forth in Applied Compute’s Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.

#J-18808-Ljbffr