Head of Site Reliability Engineering at ChipStack
Join to apply for the Head of Site Reliability Engineering at ChipStack role at ChipStack .
About ChipStack
Chips power everything, yet chip‑design tooling hasn’t kept up with the exploding complexity. ChipStack reinvents verification with AI‑native software already in use at 10+ semiconductor innovators. Backed by Khosla Ventures, Cerberus, and Clear Ventures, our small, fast team ships at the intersection of AI, EDA, and systems engineering.
Locations
- San Jose, CA – On‑site
Employment Type
Full‑time
Job Function
Engineering and Information Technology
Industries
Software Development
The Opportunity
We need reliable, low‑latency deployments—often inside customer data centers with no internet egress. As our first dedicated reliability owner, you’ll design, automate, and operate these hybrid/on‑prem environments to ensure “five nines” availability without touching the underlying plumbing.
What You’ll Do
- Own end‑to‑end reliability by architecting, deploying, and monitoring production clusters (on‑prem & cloud) running micro‑services, LLM workloads, and GPU back‑ends.
- Automate the stack with IaC pipelines (Terraform), GitOps workflows, and zero‑downtime rollout strategies.
- Observe & respond by instrumenting apps with Prometheus/Grafana, setting SLOs/SLIs, leading incident response, performing root‑cause analysis, and hardening runbooks.
- Secure & comply by implementing network segmentation, secrets management, RBAC, and vulnerability scanning to meet industry standards.
- Collaborate with product engineers on performance, scalability, and customer issues.
- Improve practices in testing, CI/CD, and chaos engineering to promote a fast, quality-focused culture.
Must‑Have Skills
- 5+ years building and operating production systems as an SRE/DevOps/Platform Engineer.
- Hands‑on experience with Kubernetes and Docker in hybrid or bare‑metal setups.
- Strong Python skills; proficiency with TypeScript services.
- Deep Linux administration knowledge (kernel tuning, networking, storage, security hardening).
- Track record of delivering 99.9%+ uptime for latency-sensitive services.
- Experience with observability tools (Prometheus, Grafana, Loki/ELK, Alertmanager).
- Proficiency with Terraform or similar IaC tools and Git workflows.
- Excellent communication skills and a proactive approach to solving first‑of‑its‑kind problems.
Nice‑to‑Have
- Experience with GPU workloads, ML inference, or EDA toolchains in production.
- Familiarity with air‑gapped/restricted‑network deployments and data‑center operations.
- Knowledge of security certifications (SOC 2, ISO 27001) or semiconductor customer audits.
- Experience working at an early-stage startup.
Our Culture
- Challenge status quo
- Strong opinions, loosely held
- Ship fast, ship quality
- Proud of our craft
Ready to harden the infrastructure that will redefine chip design? Apply now and help keep ChipStack running flawlessly for the world’s most advanced silicon teams.
Additional Details
Seniority level: Mid‑Senior level
Employment type: Full‑time
Job function: Engineering and IT
Industries: Software Development
#J-18808-Ljbffr