CoreWeave
CoreWeave is the AI Hyperscaler, delivering a cloud platform of cutting-edge services powering the next wave of AI. Our technology provides enterprises and leading AI labs with the most performant, efficient, and resilient solutions for accelerated computing.
As the leader in the industry, we thrive in an environment where adaptability and resilience are key. Our culture offers career-defining opportunities for those who excel amid change and challenge. If you’re someone who thrives in a dynamic environment, enjoys solving complex problems, and is eager to make a significant impact, CoreWeave is the place for you.
The Fleet Reliability Operations team is responsible for the day-to-day provisioning, management, and uptime of CoreWeave’s ever-expanding fleet of server nodes. We are seeking curious, creative, and persistent problem solvers to join our Fleet Reliability Operations team to help us drive batches of server nodes through our provisioning and validation processes while efficiently and effectively troubleshooting node or cluster problems as they arise.
Responsibilities include:
Configure and maintain large-scale high-performance supercomputing clusters running state-of-the-art GPUs
Troubleshoot hardware and software issues; escalate and coordinate as needed with data center, network, hardware, and platform teams to drive resolution
Monitor and analyze system performance and take appropriate remediation actions for cloud health
Approach your work with flexibility and optimism, anticipating shifting business and technical priorities
Create and maintain documentation of team processes, knowledge, and best practices for system management
Think critically about your day-to-day work and work collaboratively to improve team processes and efficiency
Participate in on-call rotations which include after-hours and weekend work
Minimum Qualifications:
Strong understanding of Linux system administration and internals
Ability to troubleshoot hardware and software issues and perform system maintenance tasks consistently and reliably
Software development or scripting languages (bash, python, powershell, etc)
Preferred Qualifications:
2+ years of experience troubleshooting or administering data center or on-prem infrastructure (servers, storage, network, or a mix)
Grafana, Prometheus, promsql queries, or similar observability platforms
Data center environments, including server racks, HVAC systems, fiber trays
Kubernetes administration
HPC - administering GPU-related workloads
Bachelor’s degree in a related field or equivalent experience
We offer a competitive salary range of $83,000 to $110,000, depending on job-related knowledge, skills, experience, and market location. Our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program.
Benefits include:
Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption
#J-18808-Ljbffr
#J-18808-Ljbffr