Fluidstack

Senior / Staff Site Reliability Engineer

Fluidstack, Kenmore, New York, United States

Overview Senior / Staff Site Reliability Engineer (SRE) role at Fluidstack. Fluidstack is building GPU supercomputers for AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivated, and focused on providing a world class supercomputing experience. We strive to win repeated business and customer referrals by placing the customer first in everything we do.

You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset.

About the Role

Senior / Staff SREs sit at the core of Fluidstack's infrastructure, working across software, hardware, and operations to ensure reliability and performance of our global GPU cloud. They partner with teams including networking, platform engineering, and data center operations to build systems that scale with AI workloads. SREs are hands-on and possess deep systems knowledge with strong communication skills. You’ll tackle complex production issues, deploy resilient infrastructure, and continuously improve the stability and observability of our platform as we grow.

Responsibilities

Deploy clusters of 1,000+ GPUs using custom written playbooks; modify tools as necessary to provide the best solution for a customer.

Validate correctness and performance of compute, storage, and networking infrastructure; collaborate with providers to optimize subsystems.

Migrate petabytes of data from public cloud platforms to local storage quickly and cost-effectively.

Debug issues across the stack, from hardware anomalies to data loader optimization across regions.

Build internal tooling to decrease deployment time and increase cluster reliability, including automation where the customer benefits outweigh the overhead.

This role involves being part of an on-call rotation up to one week per month.

Qualifications

2+ years of SRE, DevOps, Sysadmin, and/or HPC engineering experience.

Strong verbal and written communication skills in English.

Experience deploying and operating Kubernetes and/or SLURM clusters.

Experience writing Go, Python, Bash.

Experience using Ansible, Terraform, and other automation or IaC tools.

Strong engineering background, preferably in Computer Science, Software Engineering, Math, Computer Engineering, or similar fields.

Exceptional candidates have one or more of the following experiences:

Built and operated an AI workload at 1000+ GPU scale.

Built multi-tenant, hyperscale Kubernetes-based services.

Physically deployed infrastructure in a datacenter and managed bare metal hardware via MaaS or Netbox.

Deployed and managed multi-tenant InfiniBand or RoCE networks.

Deployed and managed petabyte-scale all-flash storage systems (e.g., DDN, VAST, Weka) or Ceph, LUSTRE, or similar open-source tools.

Benefits

Competitive total compensation package (cash + equity).

Retirement or pension plan, in line with local norms.

Health, dental, and vision insurance.

Generous PTO policy, in line with local norms.

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Engineering and Information Technology

Industries

Technology, Information and Internet

#J-18808-Ljbffr