INSPYR Solutions
HPC Performance and Validation Engineer
INSPYR Solutions, Dallas, Texas, United States, 75215
Title
HPC Performance and Validation Engineer Location
Dallas, Texas (Hybrid – relocation assistance available) Duration
Full-time – direct hire Compensation
$200,000 – $300,000 (Base) Work Requirements
US Citizen OR Green Card Holder Job Description
G‑Research is a leading quantitative research and technology firm with offices in London and Dallas. This hybrid role is based in our new Dallas infrastructure hub where we work on the latest technologies in a cutting‑edge environment. As an HPC Validation and Performance Engineer you will take ownership of the validation and optimization of our HPC CPU and GPU compute farms. You will develop a validation and performance baselining framework that ensures system readiness for AI/ML and HPC workloads across multiple architectures. Your role will provide continuous performance benchmarking, real‑time observability and long‑term strategic readiness. Key Responsibilities
Architect and implement a validation framework to certify readiness and utilization of GPU nodes across a large, distributed HPC environment Define methodologies to continually assess performance and optimize infrastructure across AI/ML workloads Develop and execute comprehensive performance testing using industry and customer‑specific benchmarks, ensuring optimal performance across HPC compute, storage and networking Contribute to research reports describing benchmarking discoveries and evaluating hardware performance and efficiency Lead efforts to debug, identify and resolve bottlenecks in system performance Build robust, scalable tools for automated validation and testing using Python, Go, Kubernetes and CI/CD pipelines to streamline continuous validation and benchmarking processes Implement monitoring solutions using Prometheus, Grafana and other modern monitoring technologies to track performance metrics and real‑time health of the cluster Define and implement best practices for continuous performance validation, ensuring that the infrastructure remains reliable and efficient as new technologies emerge Stay informed on industry trends and advancements to ensure long‑term strategic alignment Work cross‑functionally with engineering, infrastructure and research teams to align validation efforts with broader business objectives, ensuring the platform meets evolving research demands Who Are We Looking For?
Accelerator performance experience, including profiling and tuning with large‑scale GPU clusters In‑depth understanding of NVIDIA ClusterKit, Nsight and Validation Suite, MLPerf and DCGM tools for GPU and DPUs Networking & Storage performance experience, including profiling and optimization with NVIDIA ClusterKit, iPerf or equivalent across InfiniBand/RoCe network implementations System benchmarking experience across Linux and familiarity with the Phronix suite or equivalent Experience with HPC workloads across distributed global locations, bringing data‑driven performance data to complement key architectural decisions Strong proficiency in developing automation tools and micro‑benchmarking frameworks for validation using Python, Go and Kubernetes in a Ubuntu Linux environment Expertise with key monitoring platforms including OTEL, Prometheus, ELK and Grafana and in defining and implementing the overall observability strategy for HPC validation and performance monitoring A deep understanding of emerging technologies, architectures and strategies, with the ability to assess their potential impact on infrastructure and adopt them as part of a long‑term plan Proven ability to lead complex technical projects, influence decisions and engage with stakeholders across technical and research teams Equal Employment Opportunity
INSPYR Solutions provides Equal Employment Opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability, or genetics. INSPYR Solutions complies with applicable state and local laws governing nondiscrimination in employment in every location in which the company has facilities.
#J-18808-Ljbffr
HPC Performance and Validation Engineer Location
Dallas, Texas (Hybrid – relocation assistance available) Duration
Full-time – direct hire Compensation
$200,000 – $300,000 (Base) Work Requirements
US Citizen OR Green Card Holder Job Description
G‑Research is a leading quantitative research and technology firm with offices in London and Dallas. This hybrid role is based in our new Dallas infrastructure hub where we work on the latest technologies in a cutting‑edge environment. As an HPC Validation and Performance Engineer you will take ownership of the validation and optimization of our HPC CPU and GPU compute farms. You will develop a validation and performance baselining framework that ensures system readiness for AI/ML and HPC workloads across multiple architectures. Your role will provide continuous performance benchmarking, real‑time observability and long‑term strategic readiness. Key Responsibilities
Architect and implement a validation framework to certify readiness and utilization of GPU nodes across a large, distributed HPC environment Define methodologies to continually assess performance and optimize infrastructure across AI/ML workloads Develop and execute comprehensive performance testing using industry and customer‑specific benchmarks, ensuring optimal performance across HPC compute, storage and networking Contribute to research reports describing benchmarking discoveries and evaluating hardware performance and efficiency Lead efforts to debug, identify and resolve bottlenecks in system performance Build robust, scalable tools for automated validation and testing using Python, Go, Kubernetes and CI/CD pipelines to streamline continuous validation and benchmarking processes Implement monitoring solutions using Prometheus, Grafana and other modern monitoring technologies to track performance metrics and real‑time health of the cluster Define and implement best practices for continuous performance validation, ensuring that the infrastructure remains reliable and efficient as new technologies emerge Stay informed on industry trends and advancements to ensure long‑term strategic alignment Work cross‑functionally with engineering, infrastructure and research teams to align validation efforts with broader business objectives, ensuring the platform meets evolving research demands Who Are We Looking For?
Accelerator performance experience, including profiling and tuning with large‑scale GPU clusters In‑depth understanding of NVIDIA ClusterKit, Nsight and Validation Suite, MLPerf and DCGM tools for GPU and DPUs Networking & Storage performance experience, including profiling and optimization with NVIDIA ClusterKit, iPerf or equivalent across InfiniBand/RoCe network implementations System benchmarking experience across Linux and familiarity with the Phronix suite or equivalent Experience with HPC workloads across distributed global locations, bringing data‑driven performance data to complement key architectural decisions Strong proficiency in developing automation tools and micro‑benchmarking frameworks for validation using Python, Go and Kubernetes in a Ubuntu Linux environment Expertise with key monitoring platforms including OTEL, Prometheus, ELK and Grafana and in defining and implementing the overall observability strategy for HPC validation and performance monitoring A deep understanding of emerging technologies, architectures and strategies, with the ability to assess their potential impact on infrastructure and adopt them as part of a long‑term plan Proven ability to lead complex technical projects, influence decisions and engage with stakeholders across technical and research teams Equal Employment Opportunity
INSPYR Solutions provides Equal Employment Opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability, or genetics. INSPYR Solutions complies with applicable state and local laws governing nondiscrimination in employment in every location in which the company has facilities.
#J-18808-Ljbffr