Logo
FAR.AI

Infrastructure Engineer

FAR.AI, WorkFromHome

Save Job

About FAR.AI

FAR.AI is a non‑profit AI research institute dedicated to ensuring advanced AI is safe and beneficial for everyone. Our mission is to facilitate breakthrough AI safety research, advance global understanding of AI risks and solutions, and foster a coordinated global response. Since our founding in July 2022, we have grown to more than 30 staff, produced over 40 influential academic papers, and hosted leading AI safety events. We collaborate with researchers, industry partners, and government institutions to drive practical change through red‑teaming, research roadmaps, and community support.

About the Role

We are seeking an Infrastructure Engineer to develop and manage scalable infrastructure that supports our research workloads. You will own our Kubernetes cluster, deployed on bare‑metal H100 cloud instances, and evolve it to meet new needs such as multi‑node LoRA training, onboarding a doubling research team, and advanced compute‑usage tracking.

Responsibilities

  • Empower researchers by delivering a reliable, easy‑to‑use compute cluster, enabling daily tasks such as debugging batch experiments and launching interactive devboxes.
  • Maintain and enhance in‑cluster services, including backups, experiment tracking, and a custom LLM‑based cluster support bot.
  • Ensure >95% uptime by proactively monitoring cluster stability and addressing issues that could interrupt research workloads.
  • Track the cloud GPU market, assist leadership with vendor comparisons, and recommend the most effective compute platforms.
  • Implement security measures to protect cluster integrity, including isolation for sensitive workloads, timely patching, and secure deployment workflows.
  • Champion security across FAR.AI, maintaining and extending our mobile device management (MDM) system and promoting secure practices.
  • Support novel ML workloads by architecting the cluster for new types of work, designing bespoke infrastructure solutions, and improving observability of GPU utilization.

Requirements

  • Experience with Kubernetes or other system administration.
  • Strong curiosity and ability to quickly learn a new domain.
  • Self‑directed, comfortable with ambiguous or rapidly evolving requirements.
  • Willingness to provide on‑call support during working hours for cluster issues.
  • Interest in improving security posture by implementing policies.
  • Experience supporting ML/AI workloads and administering GPU clusters.
  • Prior work in research environments or startups is preferable.
  • Knowledge of SAML/SCIM and experience automating access control will be an advantage.

Preferred Qualifications

  • Experience with large‑scale multi‑node training (e.g., InfiniBand).
  • Familiarity with DevOps best practices for research teams.
  • Background in security operations or mobile device management.

Example Projects

  • Configure the cluster and user‑space development environments to support InfiniBand nodes.
  • Improve the default devbox K8s pod template with best‑practice workflows.
  • Roll out a new mobile device management system to meet corporate security requirements.
  • Streamline onboarding for new cluster users across time zones.
  • Manage permissions and access control for FAR.AI team members, automating via SAML or SCIM where appropriate.
  • Analyze backup and disaster‑recovery storage patterns and propose improvements.

Logistics

  • Location: Remote or in‑person in Berkeley, CA (2‑hour overlap required). Visa sponsorship available for CA employees.
  • Hours: Full‑time, 40 hours per week.
  • Compensation: $100,000 – $175,000 per year, location‑adjusted. Travel and equipment expenses covered. Berkeley office provides lunch and dinner.
  • Application Process: Programming assessment, screening call, two 1‑hour interviews, one‑week paid work trial.

Complimentary contact: Please apply directly via the online application form; we do not accept resumes sent by email.

Compensation Range

$100,000 – $175,000

#J-18808-Ljbffr