Logo
The Recruiting Guy

Senior Cloud Infrastructure Engineer

The Recruiting Guy, San Francisco, California, United States, 94199

Save Job

If this role is still posted then we are still recruiting and needing applications.

Job Title Senior Cloud Infrastructure Engineer

Location San Francisco, CA. Remote unavailable.

Modality On-Site only. Must live within commuting distance of San Francisco or be willing to relocate.

Relocation Assistance No

Employment Type Salaried W2 Full-Time.

Salary Range $175,000 - $250,000

Company Overview We represent a pioneering open source technology company in San Francisco that is transforming the way creators interact with generative AI. They are the team behind a powerful, node‑based visual interface that gives artists, developers, and innovators the ability to design, control, and customize AI workflows with complete flexibility. Their platform allows users to connect modular components, build complex pipelines, and run everything locally with impressive speed and precision. Their mission is to make generative AI open, transparent, and accessible to everyone. Built around community collaboration and creative empowerment, their tools help users experiment freely and bring their ideas to life. Whether it is visual storytelling, image generation, or advanced machine learning, their technology gives creators the freedom to explore without limitations.

About The Role In this role, you will take the lead on designing, deploying, and maintaining large‑scale distributed systems that power AI workloads. The ideal candidate is deeply technical, self‑sufficient, and motivated by solving complex infrastructure challenges. You will work closely with core engineers to shape the company’s long‑term infrastructure vision while ensuring scalability, performance, and reliability across environments.

What You’ll Do

Design, build, and maintain the core infrastructure that powers AI workloads at scale

Manage and automate GPU compute clusters using tools such as Python, Kubernetes, Terraform, and Ansible

Architect and operate systems for orchestration, observability, distributed storage, and networking

Ensure reliability, scalability, and performance across production environments

Collaborate closely with core engineers to design infrastructure for new features and systems

Contribute to technical strategy and long‑term infrastructure vision

Drive best practices for infrastructure automation, deployment, and monitoring

Requirements

5+ years experience as an Infrastructure Engineer or Site Reliability Engineer building and operating large‑scale distributed systems

Skilled in Python and comfortable working with infrastructure‑as‑code tools such as Terraform and Ansible

Familiar with container orchestration systems such as Kubernetes and related tooling like FluxCD, Prometheus, and Grafana

Capable of managing high‑performance GPU environments across cloud and bare metal setups

Highly adaptable, resourceful, and motivated by building things from the ground up

Excited to work in a small, fast‑growing team where autonomy and accountability are key

Comfortable working on‑site in a startup setting where collaboration and speed matter most

Bonus Points

Experience contributing to or maintaining open‑source projects

Background working with AI infrastructure, ML pipelines, or GPU orchestration

Strong computer science fundamentals and ability to work across different programming languages or frameworks

Skills FluxCD, Ansible, Kubernetes, Grafana, Prometheus, Python, Terraform, infrastructure

#J-18808-Ljbffr