ZipRecruiter

Head of Site Reliability Engineering at ChipStack (San Jose)

ZipRecruiter, San Jose

Job DescriptionJob Description

Locations San Jose, CA Onsite Fulltime Engineering

About ChipStack

Chips power everything, yet chipdesign tooling hasnt kept up with the exploding complexity. ChipStack reinvents verification with AI software already in use at 10+ semiconductor innovators. Backed by Khosla Ventures, Cerberus, and Clear Ventures, our small, fast team ships at the intersection of AI, EDA, and systems engineering.

The Opportunity

We need rocksolid, lowlatency deploymentsoften inside customer data centers with no internet egress. As our first dedicated reliability owner, youll design, automate and operate these hybrid/onprem environments so customers experience five nines availability without touching the underlying plumbing.

What Youll Do

Own endtoend reliability architect, deploy, and monitor production clusters (onprem & cloud) running our Python/TypeScript microservices, LLM workloads and GPU backends.
Automate the stack build IaC pipelines (Terraform), GitOps workflows and zerodowntime rollout strategies.
Observe & respond instrument apps with Prometheus/Grafana, set SLOs/SLIs, lead incident response, perform rootcause analysis, and harden runbooks.
Secure & comply implement network segmentation, secrets management, RBAC and vulnerability scanning to satisfy strict semiconductorindustry requirements.
Collaborate pair with product engineers on performance profiling, scalability bottlenecks and customer issue triage.
Continually improve champion best practices in testing, CI/CD, and chaos drills to push our ship fast, ship quality culture.

MustHave Skills

5+ years building and operating production systems as an SRE / DevOps / Platform Engineer.
Handson expertise with Kubernetes and Docker in hybrid or baremetal setups.
Strong Python for automation tooling; proficiency reading TypeScript services.
Deep Linux administration knowledge (kernel tuning, networking, storage, security hardening).
Proven track record delivering 99.9 %+ uptime for latencysensitive services.
Observability stack experience (Prometheus, Grafana, Loki / ELK, Alertmanager).
Proficiency with Terraform (or equivalent IaC) and Gitbased workflows.
Excellent communication and a bias for action when facing vague, firstofitskind problems.

NicetoHave

Experience running GPU workloads, ML inference or EDA toolchains in production.
Familiarity with airgapped / restrictednetwork deployments and datacenter operations.
Exposure to security certifications (SOC 2, ISO 27001) or semiconductor customer audits.
Prior work at an earlystage startup.

Our Culture (What Youll Thrive In)

Challenge statusquo Strong opinions, loosely held Ship fast, ship quality Proud of our craft

Ready to harden the infrastructure that will redefine chip design? Apply now and keep ChipStack running flawlessly for the worlds most advanced silicon teams.

#J-18808-Ljbffr