Logo
Anyscale

Software Engineer (Site Reliability Engineer)

Anyscale, San Francisco, California, United States, 94199

Save Job

Software Engineer (Site Reliability Engineer)

Join to apply for the

Software Engineer (Site Reliability Engineer)

role at

Anyscale . About Anyscale At Anyscale, we're on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We’re commercializing Ray, a popular open-source project that's creating an ecosystem of libraries for scalable machine learning. Companies like OpenAI, Uber, Spotify, Instacart, Cruise, and many more, have Ray in their tech stacks to accelerate the progress of AI applications out into the real world. With Anyscale, we’re building the best place to run Ray, so that any developer or data scientist can scale an ML application from their laptop to the cluster without needing to be a distributed systems expert. Proud to be backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date. About The Role As a Site Reliability Engineer, you will play a crucial role in ensuring the smooth operation of all user-facing services and other Anyscale production systems. We value diversity and inclusion, and we encourage applications from individuals of all backgrounds. This includes processes for provisioning, negotiating prices, managing costs, and identifying opportunities for teams to reduce wastage by finding applications across the company. You will apply sound engineering principles, operational discipline, and mature automation to our environments and the Anyscale codebase as we scale. Responsibilities: Develop a unified perspective on how cloud components are utilized across the company, considering diverse needs and requirements. Ensure deployment methodologies align with the company's reliability goals. Build systems that promote understanding of production environments, enabling quick issue identification through robust observability infrastructure for metrics, logging, and tracing. Create monitoring and alerting systems at different levels, allowing teams to contribute and enhance overall monitoring capabilities. Establish testing infrastructure to support effective writing and execution of tests. Develop tools for measuring service level objectives (SLOs) and define organization-wide SLOs. Implement best practices and on-call systems to ensure efficient incident management and improve incident response processes. Coordinate the creation and deployment of cloud-based services, including tracking deployments and establishing effective communication channels for issue resolution. Qualifications: At least 3 years of relevant work experience in a similar role. Compensation & Benefits: We offer a market-based compensation approach, including: Stock Options Healthcare plans, with premiums covered by Anyscale at 99% 401k Retirement Plan Wellness stipend Education stipend Paid Parental Leave Flexible Time Off Commute reimbursement 100% coverage of in-office meals We are an Equal Opportunity Employer and value diversity at our company. Additional Details

Seniority level: Mid-Senior level Employment type: Full-time Job function: Engineering and Information Technology Industry: Software Development

#J-18808-Ljbffr