Selby Jennings
We're seeking a Lead Engineer for High-Performance Compute Hardware to take ownership of a large-scale compute infrastructure built on GPU and CPU systems at this leading trading firm. In this role, you'll be responsible for the design, expansion, and fine-tuning of a platform that powers intensive computational workloads across research and engineering teams.
You'll work in close partnership with colleagues across infrastructure, data center operations, AI/ML, security, and software development to ensure the platform is robust, efficient, and ready to scale. This position emphasizes automation, hardware performance, and operational excellence, with a strong focus on mentorship and long-term infrastructure strategy.
What You'll Be Doing
- Architect and oversee a high-throughput compute environment
- Expand and optimize infrastructure to support growing technical demands
- Manage a bare-metal provisioning stack, with emphasis on OpenStack Ironic
- Continuously monitor system health and implement performance improvements
- Establish and refine operational procedures to reduce downtime and hardware faults
- Conduct diagnostics, performance tuning, and capacity forecasting
- Review and enhance hardware lifecycle workflows
- Collaborate across teams to align infrastructure with broader technical goals
- Apply security best practices to hardware and platform-level systems
- Guide and mentor junior team members, fostering a culture of technical growth
- Hands-on experience managing complex HPC environments at scale
- In-depth understanding of server architecture, including compute, memory, storage, and networking components
- Strong background in bare-metal provisioning and infrastructure-as-code practices
- Proven ability to troubleshoot and resolve hardware issues in production environments
- Familiarity with automation frameworks such as Ansible, Puppet, or Chef
- Experience with out-of-band management tools and APIs (e.g., Redfish, iDRAC, iLO, BMC, IPMI)
- Skills in system tuning, diagnostics, and capacity planning
- Knowledge of thermal and power efficiency in data center environments
- Awareness of hardware-level security practices
- Strong analytical and communication skills, with a collaborative mindset
- Experience in hyperscale or large compute cluster environments
- Knowledge of high-speed networking technologies (e.g., InfiniBand, 100GbE)
- Familiarity with Linux systems and scripting languages (Python, Bash, PowerShell)
- Exposure to OpenStack or similar cloud infrastructure platforms
- Experience with GPU management tools and debugging (e.g., NVIDIA-SMI)
- Prior leadership experience in mentoring or managing technical teams