Fluidstack
Senior / Staff Site Reliability Engineer, Compute
Fluidstack, San Francisco, California, United States, 94199
About Fluidstack
Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivated, and focused on providing a world class supercomputing experience. We put out customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals. We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us. You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset. About the Role
Our Senior / Staff Site Reliability Engineers (Storage) are the backbone of Fluidstack’s platform. You’ll utilise deep systems expertise and software engineering to keep our bare-metal and virtualised compute fleet fast, reliable and cost-efficient at petabyte scale. Focus
Super-charge virtualisation. Tune hypervisors (KVM/QEMU), kernel subsystems and NUMA layouts to squeeze micro-seconds off tail-latency for AI & HPC jobs.
Deploy & optimise at scale. Roll out new CPU/GPU/DPU nodes, validate SmartNIC and BlueField off-loads and harden workload isolation.
Automate observability. Build kernel-to-orchestrator telemetry, incident-response bots and performance dashboards.
Root-cause the gnarly stuff. Lead crash-dumps, kexec/kdump analyses and performance regressions; turn insights into upstream patches and config templates.
Drive kernel & hardware collaboration. Pair with silicon and Linux teams to debug drivers, accelerate I/O paths and integrate emerging compute hardware (TPUs, DPUs).
Continuously improve. Inject chaos, run game-days and codify post-mortem learnings into SLIs/SLOs that matter to customers.
About you
5+ yrs in compute-heavy SRE, kernel or virtualisation engineering.
Mastery of Linux internals (scheduler, memory, drivers) and system-level debugging.
Production experience with KVM, Xen, QEMU, VMware or similar hypervisors.
Fluency in C, Go or Rust; solid Infra-as-Code & CI/CD chops.
Familiarity with SmartNICs/DPUs and kernel-bypass networking.
Proven track record scaling high-throughput compute or HPC platforms.
Benefits
Competitive total compensation package (cash + equity).
Retirement or pension plan, in line with local norms.
Health, dental, and vision insurance.
Generous PTO policy, in line with local norms.
#J-18808-Ljbffr
Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivated, and focused on providing a world class supercomputing experience. We put out customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals. We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us. You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset. About the Role
Our Senior / Staff Site Reliability Engineers (Storage) are the backbone of Fluidstack’s platform. You’ll utilise deep systems expertise and software engineering to keep our bare-metal and virtualised compute fleet fast, reliable and cost-efficient at petabyte scale. Focus
Super-charge virtualisation. Tune hypervisors (KVM/QEMU), kernel subsystems and NUMA layouts to squeeze micro-seconds off tail-latency for AI & HPC jobs.
Deploy & optimise at scale. Roll out new CPU/GPU/DPU nodes, validate SmartNIC and BlueField off-loads and harden workload isolation.
Automate observability. Build kernel-to-orchestrator telemetry, incident-response bots and performance dashboards.
Root-cause the gnarly stuff. Lead crash-dumps, kexec/kdump analyses and performance regressions; turn insights into upstream patches and config templates.
Drive kernel & hardware collaboration. Pair with silicon and Linux teams to debug drivers, accelerate I/O paths and integrate emerging compute hardware (TPUs, DPUs).
Continuously improve. Inject chaos, run game-days and codify post-mortem learnings into SLIs/SLOs that matter to customers.
About you
5+ yrs in compute-heavy SRE, kernel or virtualisation engineering.
Mastery of Linux internals (scheduler, memory, drivers) and system-level debugging.
Production experience with KVM, Xen, QEMU, VMware or similar hypervisors.
Fluency in C, Go or Rust; solid Infra-as-Code & CI/CD chops.
Familiarity with SmartNICs/DPUs and kernel-bypass networking.
Proven track record scaling high-throughput compute or HPC platforms.
Benefits
Competitive total compensation package (cash + equity).
Retirement or pension plan, in line with local norms.
Health, dental, and vision insurance.
Generous PTO policy, in line with local norms.
#J-18808-Ljbffr