IKR Enterprises

SeniorStaff Site Reliability Engineer San Francisco

IKR Enterprises, San Francisco, California, United States, 94199

Job Title:

Senior / Staff Site Reliability Engineer Location:

San Francisco or New York City (Hybrid, 3+ days/week in-office) Compensation:

$211,000 – $248,000 base salary + competitive equity

About the Company High-growth, AI-focused healthcare technology company building secure, cloud-native products used by clinicians and health systems. The team cares about reliability, performance, and enabling engineers to move fast without breaking things.

Role Overview We’re hiring a Senior / Staff Site Reliability Engineer to focus on application performance, reliability, and platform scale. This role sits between backend engineering and platform/SRE: you will design and run load and chaos tests, use observability and profiling tools to find bottlenecks, and then drive the code and infrastructure changes needed to fix them.

You’ll often embed with product teams for weeks or months at a time, helping them rehome services to more scalable infrastructure and adopt better SLOs, error budgets, and incident practices as the platform grows.

What You’ll Do

Use load testing, chaos engineering, and other test practices to identify performance and latency issues across services, and fix them in application code

Drive software changes that move applications to more scalable infrastructure (runtimes, event-driven systems, databases, multi-tenant setups)

Tune application and infrastructure configuration to improve performance and scalability

Build internal tools and modules that help engineers ship safer, faster, and with better defaults

Work with the Platform team to shape and roll out elements of the internal developer platform (service templates, self-serve infrastructure, etc.)

Partner with application teams to define and adopt SLOs, error budgets, and health metrics that support canary releases and better monitoring

Improve incident response by strengthening observability, dashboards, runbooks, and on-call practices

Document patterns, run training, and coach teams on cloud-native and performance best practices

Represent the work in the broader platform/SRE community (talks, OSS, etc.) as interest and time allow

Tech Stack

Kubernetes, Terraform, GCP

Python, TypeScript

Datadog, Grafana, and related observability tools

Requirements

5–10 years of experience as a backend or platform/SRE engineer working on distributed systems or developer tooling

At least 2 years focused on system performance and scalability at the application layer

Experience improving reliability and scalability of production systems (e.g., service migrations, database performance work, resilience improvements)

Proven examples of reducing latency by multiples using observability and profiling tools

Hands-on experience building on Kubernetes and scaling compute services on Kubernetes

Experience with at least one major cloud provider (GCP preferred)

Strong skills in Python and TypeScript (or strong in one and willing to ramp on the other)

Experience using Datadog, Grafana, or similar tools for monitoring and alerting

Clear interest in reliability, scalability, and engineering enablement (not just product features)

Willingness and ability to work in person at least 3 days per week in San Francisco or New York City

What We’re Looking For (Green Flags)

Core skill: application-layer performance optimization with real examples of cutting latency by multiples

Domain expertise: Kubernetes + Python/TypeScript + distributed systems scaling experience

Scale experience: has supported systems during rapid growth or major traffic spikes

Success pattern: has migrated production systems, improved database performance, or built internal developer tooling that changed how teams work

Traits That Are Not a Fit

Mainly interested in backend product features rather than reliability and platform work

Multiple short tenures (less than ~2 years in most recent roles)

High ego or low interest in cross-team collaboration

Less than 5–6 total years of software engineering experience

If you like solving hard performance problems, improving reliability at scale, and raising the bar for how teams ship software, we’d like to hear from you.

#J-18808-Ljbffr