Logo
IBM

Senior SRE

IBM, San Jose, California, United States, 95199

Save Job

Introduction A career in IBM Software means you'll be part of a team that transforms our customer's challenges into solutions. Seeking new possibilities and always staying curious, we are a team dedicated to creating the world's leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.

IBM's product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive.

Your role and responsibilities We are seeking a Sr Customer Support / SRE to join our team who is responsible for delivering Astra Streaming (Apache Pulsar as a Service). You will help our users succeed by resolving complex incidents, improving service reliability, and driving operational excellence across environments.

You will work closely with engineering, product, and customer support teams to ensure the Astra Streaming platform runs with high availability, low latency, and predictable performance in support of meeting and exceeding enterprise workload expectations.

Key Responsibilities Serve as Tier2 / Tier3 escalation point for customer-reported incidents, performance issues, and operational anomalies.

Troubleshoot issues across the full-stack

Develop and maintain runbooks, monitoring dashboards, altering rules

Participate in and improve on-call rotation, including leading incident response and post-mortems when necessary

Collaborate with Engineering to identify root causes and drive fixes for long-term improvements

Implement SLOs, SLIs and error budgets to ensure platform reliability aligns with customer expectations

Automate common tasks (toil) Contribute to and lead observability and telemetry improvements (Prometheus, Grafana, Thanos, or equivalent).

Provide detailed and empathetic customer communication during incidents and post-incident reviews.

Act as a voice of the customer in reliability, scalability, and usability discussions

Mentor junior support and operations engineers

Success in this Role In the first six months, success means:

Handling escalations independently and guiding complex incident responses.

Improving MTTR through new automation or monitoring enhancements.

Earning customer trust by delivering transparent communication and reliable resolution

Identifying recurring failure modes and driving engineering changes to eliminate them.

Required technical and professional expertise 5+ years of experience in SRE, DevOps, or Production Engineering for large-scale distributed systems.

Deep understanding of Apache Pulsar, Apache Bookeeper, or similar messaging systems (Kafka, Rabbit MQ)

Experience operating Pulsar clusters in Kubernetes in public clouds

Solid troubleshooting skills across Linux, Networking, JVM based applications and Containers / Kubernetes as a service.

Strong knowledge of monitoring, logging, and tracing tools (Prometheus, Grafana, Splunk, etc)

Preferred technical and professional experience Experience contributing to Opensource Apache Pulsar or Bookeeper

Familiarity with multi-tenant architectures and managed-service operations

Experience with IaC and GitOps workflows IBM is committed to creating a diverse environment and is proud to be an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, gender, gender identity or expression, sexual orientation, national origin, caste, genetics, pregnancy, disability, neurodivergence, age, veteran status, or other characteristics. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.

#J-18808-Ljbffr