Head of Site Reliability Engineering at ChipStack
ChipStack - San Jose, California, United States, 95199
Work at ChipStack
Overview
- View job
Overview
Join to apply for the
Head of Site Reliability Engineering at ChipStack
role at
ChipStack . About ChipStack Chips power everything, yet chip‑design tooling hasn’t kept up with the exploding complexity. ChipStack reinvents verification with AI‑native software already in use at 10+ semiconductor innovators. Backed by Khosla Ventures, Cerberus, and Clear Ventures, our small, fast team ships at the intersection of AI, EDA, and systems engineering. Locations
San Jose, CA – On‑site Employment Type
Full‑time Job Function
Engineering and Information Technology Industries
Software Development The Opportunity
We need reliable, low‑latency deployments—often inside customer data centers with no internet egress. As our first dedicated reliability owner, you’ll design, automate, and operate these hybrid/on‑prem environments to ensure “five nines” availability without touching the underlying plumbing. What You’ll Do
Own end‑to‑end reliability by architecting, deploying, and monitoring production clusters (on‑prem & cloud) running micro‑services, LLM workloads, and GPU back‑ends. Automate the stack with IaC pipelines (Terraform), GitOps workflows, and zero‑downtime rollout strategies. Observe & respond by instrumenting apps with Prometheus/Grafana, setting SLOs/SLIs, leading incident response, performing root‑cause analysis, and hardening runbooks. Secure & comply by implementing network segmentation, secrets management, RBAC, and vulnerability scanning to meet industry standards. Collaborate with product engineers on performance, scalability, and customer issues. Improve practices in testing, CI/CD, and chaos engineering to promote a fast, quality-focused culture. Must‑Have Skills
5+ years building and operating production systems as an SRE/DevOps/Platform Engineer. Hands‑on experience with Kubernetes and Docker in hybrid or bare‑metal setups. Strong Python skills; proficiency with TypeScript services. Deep Linux administration knowledge (kernel tuning, networking, storage, security hardening). Track record of delivering 99.9%+ uptime for latency-sensitive services. Experience with observability tools (Prometheus, Grafana, Loki/ELK, Alertmanager). Proficiency with Terraform or similar IaC tools and Git workflows. Excellent communication skills and a proactive approach to solving first‑of‑its‑kind problems. Nice‑to‑Have
Experience with GPU workloads, ML inference, or EDA toolchains in production. Familiarity with air‑gapped/restricted‑network deployments and data‑center operations. Knowledge of security certifications (SOC 2, ISO 27001) or semiconductor customer audits. Experience working at an early-stage startup. Our Culture
Challenge status quo Strong opinions, loosely held Ship fast, ship quality Proud of our craft Ready to harden the infrastructure that will redefine chip design?
Apply now and help keep ChipStack running flawlessly for the world’s most advanced silicon teams. Additional Details
Seniority level: Mid‑Senior level Employment type: Full‑time Job function: Engineering and IT Industries: Software Development
#J-18808-Ljbffr