Logo
RPMGlobal

High Compute Engineer

RPMGlobal, Bethesda, Maryland, us, 20811

Save Job

Job Description

Base-2 Solutions is seeking a High Compute Engineer who will lead the design, optimization, and integration of GPU-centric high-performance compute environments. Manage existing NVIDIA A100 and DGX-1 systems while designing scalable architectures to incorporate emerging GPU hardware as mission demands evolve. This role is critical to advanced compute initiatives, where performance, stability, and future-readiness drive every architectural decision. The candidate will work cross-functionally with data scientists, AI/ML developers, cybersecurity experts, and infrastructure teams to create a robust, secure, and performant GPU compute ecosystem. Required Skills

Proficiency with NVIDIA technologies: A100, DGX-1, CUDA, cuDNN, NCCL. Strong background in Linux (RHEL/CentOS/Ubuntu), kernel tuning, and HPC stack deployment. Experience with containerized GPU workloads using Docker, Kubernetes, and NVIDIA GPU Operator. Familiarity with distributed compute frameworks such as SLURM, Kubernetes, or Ray. Strong scripting skills in Bash, Python, or similar languages. Proven ability to plan and execute large-scale system upgrades and migrations. Qualifications

Bachelor’s or higher degree in Computer Engineering, Computer Science, or a related field with at least 12 years of related technical experience.

Additional years of experience may be considered in lieu of a degree.

5+ years of experience supporting GPU compute environments in mission-critical or enterprise settings. Meets DoD 8570.11 IAT Level II certification requirements (Security+ CE, CCNA-Security, GICSP, GSEC, or SSCP with an appropriate Computing Environment certification).

An IAT Level III certification (CASP+, CCNP Security, CISA, CISSP, GCED, GCIH, CCSP) is also acceptable.

Active TS/SCI clearance with Polygraph required, or active TS/SCI and willingness to obtain and maintain a Poly. Capabilities

Manage, optimize, and monitor existing high-performance GPU systems including NVIDIA A100s and DGX-1 platforms. Architect integration plans for scaling GPU compute infrastructure, including newer platforms such as H100, Grace Hopper, and AMD Instinct. Collaborate with data science teams to fine-tune GPU workloads for AI/ML pipelines. Design and implement high-speed networking (InfiniBand/RDMA) and storage solutions optimized for GPU data flow. Develop automation workflows using infrastructure-as-code (IaC) tools such as Ansible, Terraform, or SaltStack. Ensure system security, compliance, and patch management in alignment with NIST, RMF, or agency-specific controls. Analyze compute performance metrics and provide strategic recommendations for system enhancements. Maintain documentation on system architectures, configurations, and operational procedures. Desired Skills

Experience with hybrid cloud GPU environments (AWS, GCP, or Azure with NVIDIA support). Familiarity with AI/ML tooling such as PyTorch, TensorFlow, ONNX, and RAPIDS. Experience integrating GPUs with storage systems such as Lustre, BeeGFS, or Ceph. Exposure to hardware acceleration platforms such as FPGA or custom ASIC.

#J-18808-Ljbffr