Logo
Andiamo

Network Reliability Engineer - Decentralized High-Performance Computing Leader

Andiamo, Seattle, Washington, us, 98127

Save Job

Network Reliability Engineer - Decentralized High-Performance Computing Leader About The Role We’re searching for an expert Network Reliability Engineer to architect, optimize, and operate the high-performance network fabrics that power large-scale AI and HPC workloads. You’ll be at the core of the engineering team responsible for building ultra-low-latency, high-throughput networks that connect thousands of GPUs and servers across global datacenters. This isn’t a traditional networking role — it’s an opportunity to shape the performance backbone of some of the world’s most demanding compute environments. You’ll blend deep networking expertise with software engineering to deliver systems that are not only reliable and scalable but also faster and more efficient than ever before.

What You’ll Do

Engineer next-generation network performance: Fine‑tune TCP/IP, RDMA (RoCE), kernel‑by‑pass technologies (DPDK, XDP, eBPF), and NIC offloads to push latency and throughput to their physical limits for high‑performance computing workloads.

Deploy and scale at massive capacity: Roll out and optimize large‑scale network fabrics across datacenters using top‑tier hardware (Arista, NVIDIA/Mellanox, Juniper, and more). Configure advanced BGP/EVPN topologies, spine‑leaf architectures, and congestion management for lossless transport.

Automate network intelligence: Build telemetry pipelines and automated systems for real‑time performance monitoring, packet‑loss detection, and predictive congestion analysis across complex environments.

Debug at the deepest levels: Lead investigations into packet loss, latency anomalies, and congestion hot spots — diving into kernel traces, switch firmware, and flow control mechanisms to pinpoint and resolve issues.

Collaborate with the industry’s best: Work directly with hardware and silicon vendors to debug firmware, optimize RDMA and RoCE paths, validate optics, and integrate emerging technologies like 800G+ links and CPO/LPO networking.

Design for resilience and reliability: Simulate large‑scale network failures, run game‑day exercises, and turn lessons learned into robust automation, playbooks, and SLOs that drive measurable reliability improvements.

Who You Are

7+ years of experience in network engineering, SRE, or performance infrastructure roles — ideally within AI, HPC, or large‑scale cloud environments.

Deep understanding of the Linux networking stack, including kernel‑level debugging, TCP/IP, InfiniBand, and RoCE.

Proven hands‑on experience managing multi‑layer datacenter networks, network overlays (VXLAN, Geneve), and multi‑vendor environments (Arista, NVIDIA/Mellanox, Juniper, etc.).

Strong programming proficiency in Python, Go, or Rust, and experience with Infrastructure‑as‑Code and modern CI/CD practices.

Practical knowledge of DPDK, XDP, eBPF, and hardware acceleration frameworks used in low‑latency networking.

Demonstrated success in building and scaling high‑performance, low‑latency network architectures for data‑intensive systems or compute clusters.

Why This Role Matters Modern AI and high‑performance computing workloads push data through networks at unprecedented speed and scale. This role sits at the intersection of innovation and reliability — where every microsecond and packet matters. As a Senior Network Reliability Engineer, you’ll design and operate the connective tissue of advanced compute infrastructure, ensuring the world’s most powerful systems run seamlessly, efficiently, and at peak performance.

#J-18808-Ljbffr