Whatfix
Senior Software Engineer – Reliability & Kubernetes (E5)
Location: San Jose, CA (Onsite)
We are looking for an experienced Software Engineer (E5) who is passionate about building systems that are resilient, observable, and designed for scale from day one. This role sits within our Reliability Engineering charter and focuses on strengthening the core platform that powers all Whatfix products - including our next‑generation AI offerings.
You will design and implement reliability frameworks, evolve our Kubernetes‑based infrastructure, and create automation that allows engineering teams to operate their services with confidence. This is a senior individual contributor role where you will directly influence system architecture, lead reliability initiatives across teams, and mature the technical foundations required to support our enterprise and federal customers.
Candidate must be authorized to work in the United States on a full‑time basis without employer sponsorship, either now or in the future.
What You’ll Own
Architect and deliver platform components that improve reliability, fault tolerance, and system performance
Build reusable tooling and automation to reduce manual operations and scale reliability practices across engineering
Lead the design and rollout of observability and monitoring frameworks that give teams deep visibility into their services
Serve as a technical escalation point for critical incidents and drive long‑term remediation through blameless RCAs
Strengthen our Kubernetes platform with better automation, deployment workflows, and resource efficiency
Partner with engineering, platform, and product teams to define SLIs/SLOs and embed them into how we operate services
Support on‑prem and regulated environment deployments by ensuring high availability and compliance requirements are met
What You’ll Bring
Strong hands‑on programming experience in
Java
(plus
Python
or
Go
is a bonus)
Expertise running and scaling
Kubernetes
workloads in production environments
Experience with
GitOps
practices and tooling (ArgoCD, Helm)
Strong grounding in
CI/CD , infrastructure as code, and automated deployment pipelines
Background in observability (metrics, logs, traces) and designing systems that are measurable and diagnosable
Proven experience driving post‑incident reviews and converting findings into permanent engineering improvements
Ability to break down complex distributed systems problems into practical, high‑impact solutions
Nice‑to‑Have Experience
Log aggregation tools or stacks (e.g., ELK)
Chaos engineering or resilience testing approaches
Building internal developer platforms or reliability frameworks
Exposure to large‑scale or regulated enterprise environments
Who Thrives in This Role
Engineers who enjoy working across systems, infrastructure, and platform layers
ICs who like solving ambiguous problems and setting high technical standards
People who think in automation, self‑healing patterns, and long‑term system health
Engineers who want their work to directly influence the reliability posture of company‑wide products
Soft Skills That Matter
Strong ownership and problem‑solving mindset
Ability to collaborate across multiple engineering groups
Clear communication, especially during high‑pressure incident scenarios
Mentoring and uplifting other engineers through reviews, patterns, and best practices
Uncapped incentives
Equity plan
Mac shop, work with the newest technologies
Unlimited PTO policy
Paid maternity/paternity leave
Monthly cell phone stipend
Medical, Dental, and Vision coverage (Whatfix pays 80% of the premium for individuals and their families; for the HSA, Whatfix contributes $1,000 for individuals and $2,000 for a family)
Team and company outings
Learning and Development benefits
At Whatfix, we value collaboration, innovation, and human connection. We believe that working together in the office five days a week fosters open communication, strengthens our community, and drives innovation, helping us achieve our goals more effectively.
To facilitate global collaboration, our US teams start and end early, while our India teams start and end late. US teams do not have any evening meetings. Relocation and Sponsorship offered.
Whatfix is an Equal Opportunity Employer and an E‑Verify participant. All activities must comply with our Equal Opportunity Laws, ADA, and other regulations, as appropriate.
We are an equal opportunity employer and value diverse people because of and not in spite of the differences. We do not discriminate on the basis of race, religion, color, national origin, ethnicity, gender, sexual orientation, age, marital status, veteran status, or disability status.
#J-18808-Ljbffr
We are looking for an experienced Software Engineer (E5) who is passionate about building systems that are resilient, observable, and designed for scale from day one. This role sits within our Reliability Engineering charter and focuses on strengthening the core platform that powers all Whatfix products - including our next‑generation AI offerings.
You will design and implement reliability frameworks, evolve our Kubernetes‑based infrastructure, and create automation that allows engineering teams to operate their services with confidence. This is a senior individual contributor role where you will directly influence system architecture, lead reliability initiatives across teams, and mature the technical foundations required to support our enterprise and federal customers.
Candidate must be authorized to work in the United States on a full‑time basis without employer sponsorship, either now or in the future.
What You’ll Own
Architect and deliver platform components that improve reliability, fault tolerance, and system performance
Build reusable tooling and automation to reduce manual operations and scale reliability practices across engineering
Lead the design and rollout of observability and monitoring frameworks that give teams deep visibility into their services
Serve as a technical escalation point for critical incidents and drive long‑term remediation through blameless RCAs
Strengthen our Kubernetes platform with better automation, deployment workflows, and resource efficiency
Partner with engineering, platform, and product teams to define SLIs/SLOs and embed them into how we operate services
Support on‑prem and regulated environment deployments by ensuring high availability and compliance requirements are met
What You’ll Bring
Strong hands‑on programming experience in
Java
(plus
Python
or
Go
is a bonus)
Expertise running and scaling
Kubernetes
workloads in production environments
Experience with
GitOps
practices and tooling (ArgoCD, Helm)
Strong grounding in
CI/CD , infrastructure as code, and automated deployment pipelines
Background in observability (metrics, logs, traces) and designing systems that are measurable and diagnosable
Proven experience driving post‑incident reviews and converting findings into permanent engineering improvements
Ability to break down complex distributed systems problems into practical, high‑impact solutions
Nice‑to‑Have Experience
Log aggregation tools or stacks (e.g., ELK)
Chaos engineering or resilience testing approaches
Building internal developer platforms or reliability frameworks
Exposure to large‑scale or regulated enterprise environments
Who Thrives in This Role
Engineers who enjoy working across systems, infrastructure, and platform layers
ICs who like solving ambiguous problems and setting high technical standards
People who think in automation, self‑healing patterns, and long‑term system health
Engineers who want their work to directly influence the reliability posture of company‑wide products
Soft Skills That Matter
Strong ownership and problem‑solving mindset
Ability to collaborate across multiple engineering groups
Clear communication, especially during high‑pressure incident scenarios
Mentoring and uplifting other engineers through reviews, patterns, and best practices
Uncapped incentives
Equity plan
Mac shop, work with the newest technologies
Unlimited PTO policy
Paid maternity/paternity leave
Monthly cell phone stipend
Medical, Dental, and Vision coverage (Whatfix pays 80% of the premium for individuals and their families; for the HSA, Whatfix contributes $1,000 for individuals and $2,000 for a family)
Team and company outings
Learning and Development benefits
At Whatfix, we value collaboration, innovation, and human connection. We believe that working together in the office five days a week fosters open communication, strengthens our community, and drives innovation, helping us achieve our goals more effectively.
To facilitate global collaboration, our US teams start and end early, while our India teams start and end late. US teams do not have any evening meetings. Relocation and Sponsorship offered.
Whatfix is an Equal Opportunity Employer and an E‑Verify participant. All activities must comply with our Equal Opportunity Laws, ADA, and other regulations, as appropriate.
We are an equal opportunity employer and value diverse people because of and not in spite of the differences. We do not discriminate on the basis of race, religion, color, national origin, ethnicity, gender, sexual orientation, age, marital status, veteran status, or disability status.
#J-18808-Ljbffr