Logo
ServiceNow

Senior Staff Machine Learning Engineer - DevOps/Site Reliability Engineer

ServiceNow, Santa Clara, California, us, 95053

Save Job

Senior Staff Machine Learning Engineer - DevOps/Site Reliability Engineer

Full-time Employee Type: Regular Region: AMS - North America and Canada Work Persona: Flexible It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today — ServiceNow stands as a global market leader, bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500. Our intelligent cloud-based platform seamlessly connects people, systems, and processes to empower organizations to find smarter, faster, and better ways to work. Join us as we pursue our purpose to make the world work better for everyone. This position requires passing a ServiceNow background screening, USFedPASS (US Federal Personnel Authorization Screening Standards). This includes a credit check, criminal/misdemeanor check and taking a drug test. Employment is contingent upon passing the screening. Due to Federal requirements, only US citizens, US naturalized citizens, or US Permanent Residents with a green card will be considered. Please note that this role requires you to be in our Santa Clara office for two days per week. PLATO (Platform Engineering and AI Technology Organization) at ServiceNow is a customer-focused innovative group building intelligent software using various technology stacks to enable industry-leading work experiences for our customers. We are deeply invested in our customers' success, with expertise in advanced technologies and software engineering best practices. We prioritize robustness, performance, and user experience over the specific technology stack and tools. We are a team of technology professionals and platform engineers with a dual mission: to build and evolve the AI platform and to partner with teams to develop products and end-to-end AI-powered work experiences. We also focus on foundational research, experimentation, and de-risking AI technologies to unlock future work experiences. As a Senior Staff Machine Learning Engineer - Site Reliability Engineer, you will: Contribute to designing, developing, and implementing infrastructure, platform, deployment, and observability features that support AI workloads. Collaborate with researchers, AI engineers, and infrastructure teams to ensure GPU clusters perform efficiently, scale effectively, and remain reliable. Enhance the SRE practice by transforming operational use cases into software tooling requirements. Assist in deployment and support activities for AI/ML developers. Write high-quality, scalable, and reusable code, adhering to best practices in software engineering (e.g., code reviews, unit testing). Work with product owners to understand detailed requirements and manage your code from design to implementation, testing, automation, and delivery. Experience with operating Large Language Models (LLMs) on NVIDIA GPUs. Mentor colleagues and promote knowledge sharing. To be successful in this role, you should have: Experience integrating AI into workflows, decision-making, or problem-solving, including automating workflows or analyzing AI-driven insights. Proficiency in prompt engineering and developing LLM-based features. Experience with training and fine-tuning large language models, such as distillation, supervised fine-tuning, and policy optimization. Experience with AI productivity tools like Cursor, Windsurf, etc. 8+ years in infrastructure and platform operations, deployments, SRE, and DevOps, with a focus on platform health. 6+ years operating highly available distributed workloads on Kubernetes using a DevOps approach. 6+ years developing with Python, GoLang, Java, or similar languages. Experience with DevOps tools (e.g., Helm, Ansible, Kubernetes, Prometheus, Splunk, GitLab CI). Strong experience managing distributed systems on Linux and J2EE. Knowledge of software-defined networking, infrastructure as code, and configuration management. Experience building secure, compliant software for regulated environments. Ability to lead projects with significant technical risks to successful outcomes. We support flexible and trusting work arrangements. Work personas (flexible, remote, in-office) are assigned based on work nature and location.

Learn more here . ServiceNow may verify your proximity to an office using a third-party service. Equal Opportunity Employer ServiceNow is committed to diversity and inclusion. All qualified applicants will receive consideration regardless of race, color, creed, religion, sex, sexual orientation, national origin, age, disability, gender identity or expression, marital status, veteran status, or any other protected category. We also consider applicants with arrest or conviction records in accordance with legal requirements. Accommodations If you need a reasonable accommodation during the application process or cannot use the online application, contact [emailprotected] for assistance. Export Control Regulations Employment may be contingent upon obtaining necessary export licenses if required by law, including the U.S. Export Administration Regulations (EAR).

#J-18808-Ljbffr