Selby Jennings

Lead Engineer for HPC Hardware

Selby Jennings, Dallas

We're seeking a Lead Engineer for High-Performance Compute Hardware to take ownership of a large-scale compute infrastructure built on GPU and CPU systems at this leading trading firm. In this role, you'll be responsible for the design, expansion, and fine-tuning of a platform that powers intensive computational workloads across research and engineering teams.
You'll work in close partnership with colleagues across infrastructure, data center operations, AI/ML, security, and software development to ensure the platform is robust, efficient, and ready to scale. This position emphasizes automation, hardware performance, and operational excellence, with a strong focus on mentorship and long-term infrastructure strategy.
What You'll Be Doing

Architect and oversee a high-throughput compute environment
Expand and optimize infrastructure to support growing technical demands
Manage a bare-metal provisioning stack, with emphasis on OpenStack Ironic
Continuously monitor system health and implement performance improvements
Establish and refine operational procedures to reduce downtime and hardware faults
Conduct diagnostics, performance tuning, and capacity forecasting
Review and enhance hardware lifecycle workflows
Collaborate across teams to align infrastructure with broader technical goals
Apply security best practices to hardware and platform-level systems
Guide and mentor junior team members, fostering a culture of technical growth

What You Bring

Hands-on experience managing complex HPC environments at scale
In-depth understanding of server architecture, including compute, memory, storage, and networking components
Strong background in bare-metal provisioning and infrastructure-as-code practices
Proven ability to troubleshoot and resolve hardware issues in production environments
Familiarity with automation frameworks such as Ansible, Puppet, or Chef
Experience with out-of-band management tools and APIs (e.g., Redfish, iDRAC, iLO, BMC, IPMI)
Skills in system tuning, diagnostics, and capacity planning
Knowledge of thermal and power efficiency in data center environments
Awareness of hardware-level security practices
Strong analytical and communication skills, with a collaborative mindset

Bonus Points For

Experience in hyperscale or large compute cluster environments
Knowledge of high-speed networking technologies (e.g., InfiniBand, 100GbE)
Familiarity with Linux systems and scripting languages (Python, Bash, PowerShell)
Exposure to OpenStack or similar cloud infrastructure platforms
Experience with GPU management tools and debugging (e.g., NVIDIA-SMI)
Prior leadership experience in mentoring or managing technical teams