Logo
E2b

Distributed Systems Engineer

E2b, San Francisco, California, United States, 94199

Save Job

Languages : 5+ years of experience in Go, Terraform



Skills : Go, Building and managing large clusters, Linux, Networking, Kubernetes, Virtualization

Who we are E2B is a fast growing

Series A startup with 7-figure revenue . We've raised

over $32M

in total since our funding in 2023 and are supported by

great investors like Insight Partners . Our customers are companies like

Perplexity, Hugging Face, Manus, or Groq.

We're building the next hyperscaler for AI agents.

About the role You will be building the next cloud platform for running AI software - a cloud where AI apps are building other software apps.

Your job will be:

Building a distributed system for millions and billions of AI agents running on E2B

Building an orchestrator for placing sandboxes in the right nodes

Adding support for sandbox live migrations

Making sure our self-hosting DX is as smooth as possible (we’re open-source)

Not letting our sandboxes take more than 200ms to start (starting with the user hitting enter)

Scaling to millions and later billions of sandboxes running at the same time

Building an observability stack starting at the kernel level of virtual machines

We’re looking for an infrastructure engineer passionate about making things run fast and efficiently, and running A LOT of them at the same time.

If you aren’t afraid of going into the kernel of a VM and words like Firecracker, eBPF, UFFD, block device, L4 load balancing, noisy neighbor problem, or hugepages sound exciting to you, we want to hear from you!

What we're looking for

7+ years building distributed systems

- You've operated infrastructure at serious scale (100K+ RPS, multi-region, PB-scale data) and understand the trade-offs between consistency, availability, and partition tolerance in practice, not just theory

Deep Linux internals expertise

- You're comfortable working at the kernel level. You've debugged performance issues using eBPF, understand CPU scheduling, memory management, and can explain the difference between cgroups v1 and v2 without looking it up

VM hypervisor experience

- You've worked with Firecracker, QEMU, KVM, or similar. You understand virtio, know what a hypercall is, and have opinions about nested virtualization trade-offs

Systems programming skills

- Strong in at least one of: Go, Rust, C/C++. You've written performance-critical code and know when to reach for lock-free data structures, memory-mapped files, or io_uring

Production orchestration experience

- You've built or operated orchestration systems (Kubernetes, Nomad, or custom). You understand bin-packing algorithms, resource scheduling, and have dealt with noisy neighbor problems at scale

Performance obsession

- You've shaved milliseconds off hot paths, understand CPU caches and memory locality, and have profiled production systems under load. You know what "p99 latency" means and care deeply about making it better

Networking expertise

- Strong understanding of L4/L7 load balancing, network namespaces, iptables/nftables, and how to build secure, isolated network topologies for multi-tenant systems

Located in San Francisco or willing to relocate

- We work in person as a team and believe in the magic that happens when engineers collaborate face-to-face on hard problems

Excited about open source

- Comfortable with our code and infrastructure being public. You contribute to discussions, write clear documentation, and help the community succeed with self-hosting

Bonus points for:

Experience with userfaultfd (UFFD), copy-on-write mechanisms, or lazy loading

GPU passthrough or PCIe device virtualization experience

Built or maintained infrastructure for AI/ML workloads

Contributions to Firecracker, Cloud Hypervisor, or similar open source projects

Experience with observability at scale (distributed tracing, kernel-level metrics)

#J-18808-Ljbffr