Logo
Prime Intellect, Inc.

Member of Technical Staff - GPU Infrastructure

Prime Intellect, Inc., San Francisco, California, United States, 94199

Save Job

Building the Future of Decentralized AI Development

At Prime Intellect, we're enabling the next generation of AI breakthroughs by helping our customers deploy and optimize massive GPU clusters. As our Solutions Architect for GPU Infrastructure, you'll be the technical expert who transforms customer requirements into production-ready systems capable of training the world's most advanced AI models. We recently raised $15mm in funding (total of $20mm raised) led by Founders Fund, with participation from Menlo Ventures and prominent angels including Andrej Karpathy (Eureka AI, Tesla, OpenAI), Tri Dao (Chief Scientific Officer of Together AI), Dylan Patel (SemiAnalysis), Clem Delangue (Huggingface), Emad Mostaque (Stability AI) and many others. Core Technical Responsibilities This customer-facing role combines deep technical expertise with hands-on implementation. You'll be instrumental in: Customer Architecture & Design Partner with clients to understand workload requirements and design optimal GPU cluster architectures

Create technical proposals and capacity planning for clusters ranging from 100 to 10,000+ GPUs

Develop deployment strategies for LLM training, inference, and HPC workloads

Present architectural recommendations to technical and executive stakeholders

Infrastructure Deployment & Optimization Deploy and configure orchestration systems including SLURM and Kubernetes for distributed workloads

Implement high-performance networking with InfiniBand, RoCE, and NVLink interconnects

Optimize GPU utilization, memory management, and inter-node communication

Configure parallel filesystems (Lustre, BeeGFS, GPFS) for optimal I/O performance

Tune system performance from kernel parameters to CUDA configurations

Production Operations & Support Serve as primary technical escalation point for customer infrastructure issues

Diagnose and resolve complex problems across the full stack - hardware, drivers, networking, and software

Implement monitoring, alerting, and automated remediation systems

Provide 24/7 on-call support for critical customer deployments

Create runbooks and documentation for customer operations teams

Technical Requirements Required Experience 3+ years hands-on experience with GPU clusters and HPC environments

Deep expertise with SLURM and Kubernetes in production GPU settings

Proven experience with InfiniBand configuration and troubleshooting

Strong understanding of NVIDIA GPU architecture, CUDA ecosystem, and driver stack

Experience with infrastructure automation tools (Ansible, Terraform)

Proficiency in Python, Bash, and systems programming

Track record of customer-facing technical leadership

Infrastructure Skills NVIDIA driver installation and troubleshooting (CUDA, Fabric Manager, DCGM)

Container runtime configuration for GPUs (Docker, Containerd, Enroot)

Linux kernel tuning and performance optimization

Network topology design for AI workloads

Power and cooling requirements for high-density GPU deployments

Nice to Have Experience with 1000+ GPU deployments

NVIDIA DGX, HGX, or SuperPOD certification

Distributed training frameworks (PyTorch FSDP, DeepSpeed, Megatron-LM)

ML framework optimization and profiling

Experience with AMD MI300 or Intel Gaudi accelerators

Contributions to open-source HPC/AI infrastructure projects

Growth Opportunity You'll work directly with customers pushing the boundaries of AI, from startups training foundation models to enterprises deploying massive inference infrastructure. You'll collaborate with our world-class engineering team while having direct impact on systems powering the next generation of AI breakthroughs. We value expertise and customer obsession - if you're passionate about building reliable, high-performance GPU infrastructure and have a track record of successful large-scale deployments, we want to talk to you. Apply now and join us in our mission to democratize access to planetary scale computing.

#J-18808-Ljbffr