Logo
Selby Jennings

Lead Engineer for HPC Hardware

Selby Jennings, Dallas

Save Job

We're seeking a Lead Engineer for High-Performance Compute Hardware to take ownership of a large-scale compute infrastructure built on GPU and CPU systems at this leading trading firm. In this role, you'll be responsible for the design, expansion, and fine-tuning of a platform that powers intensive computational workloads across research and engineering teams.
You'll work in close partnership with colleagues across infrastructure, data center operations, AI/ML, security, and software development to ensure the platform is robust, efficient, and ready to scale. This position emphasizes automation, hardware performance, and operational excellence, with a strong focus on mentorship and long-term infrastructure strategy.
What You'll Be Doing

      • Architect and oversee a high-throughput compute environment
      • Expand and optimize infrastructure to support growing technical demands
      • Manage a bare-metal provisioning stack, with emphasis on OpenStack Ironic
      • Continuously monitor system health and implement performance improvements
      • Establish and refine operational procedures to reduce downtime and hardware faults
      • Conduct diagnostics, performance tuning, and capacity forecasting
      • Review and enhance hardware lifecycle workflows
      • Collaborate across teams to align infrastructure with broader technical goals
      • Apply security best practices to hardware and platform-level systems
      • Guide and mentor junior team members, fostering a culture of technical growth
What You Bring
      • Hands-on experience managing complex HPC environments at scale
      • In-depth understanding of server architecture, including compute, memory, storage, and networking components
      • Strong background in bare-metal provisioning and infrastructure-as-code practices
      • Proven ability to troubleshoot and resolve hardware issues in production environments
      • Familiarity with automation frameworks such as Ansible, Puppet, or Chef
      • Experience with out-of-band management tools and APIs (e.g., Redfish, iDRAC, iLO, BMC, IPMI)
      • Skills in system tuning, diagnostics, and capacity planning
      • Knowledge of thermal and power efficiency in data center environments
      • Awareness of hardware-level security practices
      • Strong analytical and communication skills, with a collaborative mindset
Bonus Points For
      • Experience in hyperscale or large compute cluster environments
      • Knowledge of high-speed networking technologies (e.g., InfiniBand, 100GbE)
      • Familiarity with Linux systems and scripting languages (Python, Bash, PowerShell)
      • Exposure to OpenStack or similar cloud infrastructure platforms
      • Experience with GPU management tools and debugging (e.g., NVIDIA-SMI)
      • Prior leadership experience in mentoring or managing technical teams