Selby Jennings

Software Engineer: HPC Scheduler

Selby Jennings, Dallas, Texas, United States, 75215

This team at a leading trading firm is working to design and operate a large-scale, high-performance computing (HPC) platform that empowers the business to perform complex research at scale. We're looking for a driven and skilled individual to join their team and help advance our capabilities in running batch workloads on Kubernetes.

About the Role

This is an opportunity to work alongside a seasoned team at the forefront of machine learning and large-scale computing, working on the operation and development of a project that will enable efficient multi-cluster batch job scheduling on Kubernetes.

What You'll Do

Design and build robust software solutions, primarily using Golang Develop and maintain scalable, highly available, globally distributed systems for research workloads Manage and optimize data flows across relational and non-relational databases, especially PostgreSQL Build and operate containerized applications in Kubernetes, focusing on orchestration and scheduling Support and troubleshoot Linux-based systems within our compute platform Apply networking expertise to enhance system performance and connectivity Diagnose and resolve complex issues across infrastructure and software layers Use strong software architecture principles and computer science fundamentals to guide development Contribute to CI/CD pipelines and promote engineering best practices Stay current with emerging technologies and apply them across disciplines What We're Looking For

We're seeking someone with a passion for Kubernetes and batch computing, and a broad background in software engineering and infrastructure. Ideal candidates will have experience in:

Developing Kubernetes components like controllers and operators Event-driven programming and message queues (e.g., Apache Kafka, Pulsar) High-performance computing, Kubernetes, or DAG-based workflows Operating systems at scale on cloud platforms, preferably AWS Monitoring and logging tools such as Prometheus and Grafana Job scheduling systems like SLURM