Logo
The Recruiting Guy

Senior Cloud Infrastructure Engineer

The Recruiting Guy, Boston, Massachusetts, us, 02298

Save Job

Senior Cloud Infrastructure Engineer

Location:

San Francisco, CA (On‑Site only, no remote availability). Must live within commuting distance of San Francisco or be willing to relocate.

Relocation Assistance:

No.

Employment Type:

Salaried W2 Full‑Time.

Salary Range:

$175,000 – $250,000 per year. About The Company

We represent a pioneering open‑source technology company in San Francisco that is transforming the way creators interact with generative AI. We build a powerful, node‑based visual interface that lets artists, developers, and innovators design, control, and customize AI workflows with complete flexibility. The platform allows users to connect modular components, build complex pipelines, and run everything locally with impressive speed and precision. Our mission is to make generative AI open, transparent, and accessible to everyone, empowering users to experiment freely and bring their ideas to life. About The Role

In this role, you will lead the design, deployment, and maintenance of large‑scale distributed systems that power AI workloads. You should be deeply technical, self‑sufficient, and motivated by solving complex infrastructure challenges. You will work closely with core engineers to shape the company’s long‑term infrastructure vision while ensuring scalability, performance, and reliability across all environments. What You’ll Do

Design, build, and maintain the core infrastructure that powers AI workloads at scale. Manage and automate GPU compute clusters using Python, Kubernetes, Terraform, and Ansible. Architect and operate systems for orchestration, observability, distributed storage, and networking. Ensure reliability, scalability, and performance across production environments. Collaborate closely with core engineers to design infrastructure for new features and systems. Contribute to technical strategy and the company’s long‑term infrastructure vision. Drive best practices for infrastructure automation, deployment, and monitoring. Requirements

5+ years of experience as an Infrastructure Engineer or Site Reliability Engineer building and operating large‑scale distributed systems. Strong skills in Python and comfort working with infrastructure‑as‑code tools such as Terraform and Ansible. Familiarity with container orchestration systems (Kubernetes) and related tooling (FluxCD, Prometheus, Grafana). Ability to manage high‑performance GPU environments across cloud and bare‑metal setups. Highly adaptable, resourceful, and motivated by building things from the ground up. Excited to work in a small, fast‑growing team where autonomy and accountability are key. Comfortable working on‑site in a startup setting where collaboration and speed matter most. Bonus Points

Experience contributing to or maintaining open‑source projects. Background working with AI infrastructure, ML pipelines, or GPU orchestration. Strong computer science fundamentals and ability to work across different programming languages or frameworks.

#J-18808-Ljbffr