Get new jobs for this search by email

Create Job Alerts

Machine Learning Systems Administrator - HPC Infrastructure
ZipRecruiter - Palo Alto 3 days ago
Job DescriptionJob DescriptionZyphra is an artificial intelligence company based in Palo Alto, California. The R...
More...
Senior Linux Infrastructure Engineer (HPC)
The Voleon Group - San Francisco 1 days ago
Join to apply for the Senior Linux Infrastructure Engineer (HPC) role at The Voleon Group 5 days ago Be ...
More...
Senior Systems Administrator
Together AI - San Francisco 3 days ago
Join to apply for the Senior Systems Administrator role at Together AI 2 days ago Be among the first 25 applicants Join to apply for the Senior Sys...
More...
Senior Linux Infrastructure Engineer (IaC)
The Voleon Group - San Francisco 1 days ago
Join to apply for the Senior Linux Infrastructure Engineer (IaC) role at The Voleon Group Join to apply ...
More...
Senior Systems Administrator (HPC Lab Integration & Development)
Fuse Engineering - Annapolis Junction, Maryland, United States, 20701 3 days ago
Task Description. The System Administrator shall be responsible for providing network management and systems administration to maintain the customer s...
More...
System Administrator
Bear River Mutual Insurance - Murray 1 days ago
Join to apply for the System Administrator role at Bear River Mutual 3 days ago Be among the first 25 applicants Join to apply for the System Admin...
More...
System Administrator
Bear River Mutual Insurance - Salt Lake City, Utah, United States, 84193 3 days ago
Join to apply for the System Administrator role at Bear River Mutual 3 days ago Be among the first 25 applicants Join to apply for the System Administ...
More...
HPC Systems Administrator (Columbus, OH)
Conexess - Columbus 1 days ago
HPC Systems Administrator Position Overview: We are seeking an experienced HPC Administrator to manage, mai...
More...
System Administrator
Bear River Mutual - Murray, Utah, United States 2 days ago
Join to apply for theSystem Administratorrole atBear River Mutual 3 days ago Be among the first 25 applicants Join to ap...
More...
Senior HPC Systems Administrator (HR Title: Systems Administrator...
Southern Methodist University - Dallas 3 days ago
Job Description - Senior HPC Systems Administrator (HR Title: Systems Administrator III) (INF00000186) Job Title: Senior HPC Systems Administrator ...
More...

Go to next page

ZipRecruiter

Machine Learning Systems Administrator - HPC Infrastructure

ZipRecruiter - Palo Alto

Work at ZipRecruiter

Overview
View job

Overview

Job DescriptionJob DescriptionZyphra is an artificial intelligence company based in Palo Alto, California.

The Role:

As a Machine Learning Systems Administrator - HPC Infrastructure , you will be responsible for maintaining and developing the core infrastructure behind our machine learning research and production efforts. You’ll work closely with various training and inference teams to ensure the smooth operation of our systems while laying the groundwork for scalable, secure, and efficient workflows.

You’ll work across:

Administration and automation of our Linux-based cluster environments
Managing user onboarding/offboarding, security auditing, and access control
Monitoring system resources and job scheduling
Supporting and improving developer workflows (e.g., VSCode compatibility, Docker)
Enabling and supporting AI/ML workloads, including large-scale training jobs
Comfortable operating across a wide range of infrastructure concerns and excited to own and improve critical systems.
You’ll have a significant impact on both developer productivity and training and inference performance.

Requirements:

Strong experience with Linux system administration, user and access management, and automation
Demonstrated expertise in scripting for system tooling and automation (bash, Python, etc.)
Familiarity with containerized environments (e.g., Docker) and job scheduling systems like Slurm
Experience building tooling for cluster validation and reliability (GPU, networking, storage health checks)
Experience setting up and managing developer tools and third-party services (e.g, Cloud storage providers, Dockerhub, Slack, Gmail, Telegraf, experiment trackers, etc.)
Excellent debugging and troubleshooting skills across compute, storage, and networking
Strong communication skills and ability to collaborate across technical and non-technical teams

Bonus Qualifications:

Experience with infrastructure as code (e.g., Ansible, Terraform)
Prior work supporting ML/AI infrastructure, including GPU management and workload optimization
Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)
Experience working with cloud platforms such as AWS, Azure, or GCP
Familiarity with containers (Docker, Apptainer) and their integration with scheduling systems (Slurm, Kubernetes)

Why Work at Zyphra:

Our research methodology is to make grounded, methodical steps toward ambitious goals. Both deep research and engineering excellence are equally valued
We strongly value new and crazy ideas and are very willing to bet big on new ideas
We move as quickly as we can; we aim to minimize the bar to impact as low as possible
We all enjoy what we do and love discussing AI

Benefits and Perks:

Comprehensive medical, dental, vision, and FSA plans
Competitive compensation and 401(k)
Relocation and immigration support on a case-by-case basis
On-site meals prepared by a dedicated culinary team; Thursday Happy Hours
In-person team in Palo Alto, CA, with a collaborative, high-energy environment

If you are excited to bring reliability best practices to the frontier of AI infrastructure, this job is for you. Apply Today!

See details and apply

Machine Learning Systems Administrator - HPC Infrastructure

Get new jobs for this search by email

Machine Learning Systems Administrator - HPC Infrastructure

Senior Linux Infrastructure Engineer (HPC)

Senior Systems Administrator

Senior Linux Infrastructure Engineer (IaC)

Senior Systems Administrator (HPC Lab Integration & Development)

System Administrator

System Administrator

HPC Systems Administrator (Columbus, OH)

System Administrator

Senior HPC Systems Administrator (HR Title: Systems Administrator...

Overview

See details and apply