Logo
ZipRecruiter

Head of Site Reliability Engineering at ChipStack

ZipRecruiter, San Jose

Save Job

Job DescriptionJob Descriptionn

Locations • San Jose, CA – On‑site • Full‑time • Engineering

n

About ChipStackn

Chips power everything, yet chip‑design tooling hasn’t kept up with the exploding complexity. ChipStack reinvents verification with AI‑ software already in use at 10+ semiconductor innovators. Backed by Khosla Ventures, Cerberus, and Clear Ventures, our small, fast team ships at the intersection of AI, EDA, and systems engineering.

n

The Opportunityn

We need rock‑solid, low‑latency deployments—often inside customer data centers with no internet egress. As our first dedicated reliability owner, you’ll design, automate and operate these hybrid/on‑prem environments so customers experience “five nines” availability without touching the underlying plumbing.

n

What You’ll Don

    n
  • n

    Own end‑to‑end reliability – architect, deploy, and monitor production clusters (on‑prem & cloud) running our Python/TypeScript micro‑services, LLM workloads and GPU back‑ends.

    n
  • n
  • n

    Automate the stack – build IaC pipelines (Terraform), GitOps workflows and zero‑downtime rollout strategies.

    n
  • n
  • n

    Observe & respond – instrument apps with Prometheus/Grafana, set SLOs/SLIs, lead incident response, perform root‑cause analysis, and harden runbooks.

    n
  • n
  • n

    Secure & comply – implement network segmentation, secrets management, RBAC and vulnerability scanning to satisfy strict semiconductor‑industry requirements.

    n
  • n
  • n

    Collaborate – pair with product engineers on performance profiling, scalability bottlenecks and customer issue triage.

    n
  • n
  • n

    Continually improve – champion best practices in testing, CI/CD, and chaos drills to push our “ship fast, ship quality” culture.

    n
  • n
n

Must‑Have Skillsn

    n
  • n

    5+ years building and operating production systems as an SRE / DevOps / Platform Engineer.

    n
  • n
  • n

    Hands‑on expertise with Kubernetes and Docker in hybrid or bare‑metal setups.

    n
  • n
  • n

    Strong Python for automation tooling; proficiency reading TypeScript services.

    n
  • n
  • n

    Deep Linux administration knowledge (kernel tuning, networking, storage, security hardening).

    n
  • n
  • n

    Proven track record delivering 99.9 %+ uptime for latency‑sensitive services.

    n
  • n
  • n

    Observability stack experience (Prometheus, Grafana, Loki / ELK, Alertmanager).

    n
  • n
  • n

    Proficiency with Terraform (or equivalent IaC) and Git‑based workflows.

    n
  • n
  • n

    Excellent communication and a bias for action when facing vague, first‑of‑its‑kind problems.

    n
  • n
n

Nice‑to‑Haven

    n
  • n

    Experience running GPU workloads, ML inference or EDA toolchains in production.

    n
  • n
  • n

    Familiarity with air‑gapped / restricted‑network deployments and data‑center operations.

    n
  • n
  • n

    Exposure to security certifications (SOC 2, ISO 27001) or semiconductor customer audits.

    n
  • n
  • n

    Prior work at an early‑stage startup.

    n
  • n
n

Our Culture (What You’ll Thrive In)n

    n
  • n

    Challenge status‑quo Strong opinions, loosely held Ship fast, ship quality Proud of our craft

    n
  • n
n

Ready to harden the infrastructure that will redefine chip design? Apply now and keep ChipStack running flawlessly for the world’s most advanced silicon teams.