Logo
TikTok

Data Center Incident Response Manager - San Jose

TikTok, San Jose, California, United States, 95199

Save Job

Data Center Incident Response Manager - San Jose About the Team The Data Systems Infrastructure (DSI) team sits within the global technology structure and supports the company's fast growth by building and operating hyper‑scale datacenters, managing the life cycle of server fleet, providing cloud solutions, and developing various infrastructure services, making sure they are scalable and reliable.

Job Description We are seeking a technically skilled and detail‑oriented professional to serve as a front‑line responder for incident detection, triage, and response across infrastructure, facilities, and security operations. The ideal candidate will possess a solid foundation in IT, infrastructure, or engineering disciplines, with experience in critical environments and the ability to analyze incidents, identify patterns, and drive long‑term improvements. This role requires composed performance under pressure, data‑driven thinking, and a proactive approach to continuous improvement and operational resilience.

Responsibilities

Serve as the first responder in the IRC Operation Center, detecting and responding to events across infrastructure and facilities using tools such as Server Automation, Data Center Infrastructure Management, Network monitoring, Grafana, and related systems.

Respond promptly to events, including environmental systems (high temperature, humidity, power fluctuations or failures); IT infrastructure (server performance issues, network outages, system failures); facility and environmental alerts; and external facing services (colocation maintenance notices, CDN partner service requests, critical notifications).

Conduct detailed investigations to diagnose root causes of events, assess impacts, and determine appropriate response actions.

Monitor and analyze detected events, accurately classify incidents based on potential or actual customer impact, and proactively communicate risks.

Coordinate timely escalations by notifying and collaborating with relevant support teams to ensure swift incident resolution.

Monitor incident response performance against agreed SLAs, ensuring timely alerts and notifications.

Manage incidents calmly and efficiently, performing in‑depth investigations to determine root causes and impacts, while promptly engaging and coordinating with designated resolver teams.

Draft detailed incident reports and conduct post‑mortem reviews to document lessons learned.

Generate regular reports to deliver comprehensive insights into the effectiveness of incident response and recovery processes.

Analyze trends and patterns in events to identify opportunities for improvement and optimization.

Own and drive the Incident, Problem, and Change Management processes in alignment with ITIL or internal ITSM frameworks.

Develop and maintain a comprehensive library of SOPs, MOPs, runbooks, and operational guides to ensure consistency and readiness across teams.

Lead or support continuous improvement projects aimed at enhancing incident response capabilities, operational security, system reliability, and overall infrastructure performance.

Provide technical and operational leadership to the incident response center team, ensuring consistent performance and adherence to best practices.

Minimum Qualifications

Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related technical field.

Strong technical background with prioritized experience in Data Center Facility Operations Center (DC FOC) management. Experience in IT infrastructure, network operations, or systems monitoring is also desirable.

Proven ability to analyze complex systems, investigate incidents, and identify root causes effectively.

Familiarity with monitoring and alerting tools such as Grafana, Nagios, or similar platforms.

Experience in incident and problem management processes, with the ability to drive corrective actions and coordinate cross‑functional teams.

Excellent troubleshooting skills and the ability to work calmly under pressure during critical incidents.

Strong communication skills to draft reports, conduct reviews, and liaise with technical and non‑technical stakeholders.

Preferred Qualifications

5 years of experience in IT environments—such as data centers or enterprise systems—combined with hands‑on incident and problem management experience.

Proactive mindset with a focus on continuous improvement and operational excellence.

Proven ability to perform effectively under pressure and within tight time constraints to resolve issues and meet deliverables.

Hands‑on experience with ticketing systems, monitoring tools such as Grafana, server infrastructure, and data center systems.

Knowledge or certifications: ITIL Foundation, CompTIA Server+, Schneider Electric Data Center Certified Associate (DCCA), Cisco Certified Network Associate (CCNA), Project Management Professional (PMP), Data Analytics and Visualization tools or methodologies.

Experience driving or contributing to improvement projects focused on operational efficiency, security enhancements, or infrastructure reliability.

Ability to manage multiple tasks and projects, ensuring timely delivery and alignment with organizational goals.

Strong adaptability and problem‑solving skills in ambiguous and rapidly changing environments.

Willingness to be on call during weekends, nights, and holidays.

About TikTok TikTok is the leading destination for short‑form mobile video. At TikTok, our mission is to inspire creativity and bring joy. TikTok's global headquarters are in Los Angeles and Singapore, and we also have offices in New York City, London, Dublin, Paris, Berlin, Dubai, Jakarta, Seoul, and Tokyo.

Why Join Us Inspiring creativity is at the core of TikTok's mission. Our innovative product is built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and bring joy – a mission we work toward every day.

Diversity & Inclusion TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace.

TikTok Accommodation TikTok is committed to providing reasonable accommodations in our recruitment processes for candidates with disabilities, pregnancy, sincerely held religious beliefs or other reasons protected by applicable laws. If you need assistance or a reasonable accommodation, please reach out to us at

https://tinyurl.com/RA-request .

Compensation & Benefits The base salary range for this position in San Jose is $109,600 - $203,534 annually. Employees have access to medical, dental, and vision insurance, a 401(k) savings plan with company match, paid parental leave, short‑term and long‑term disability coverage, life insurance, wellbeing benefits, 10 paid holidays per year, 10 paid sick days, and 17 days of paid personal time.

EEO Statement TikTok is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, age, disability, or any other protected status.

#J-18808-Ljbffr