Logo
Cystems Logic

Infrastructure Engineer Software Engineer Infrastructure Hardware Optimization R

Cystems Logic, Houston

Save Job

Job Description

Hello,

Infrastructure Engineer - Software Engineer Infrastructure & Hardware Optimization - Remote

We have below job opening.

If you are interested and your experience match with job description.

Please send your updated resume....Asap

Software Engineer Infrastructure & Hardware Optimization

Location: SF, CA, Portland, OR, Dallas, TX - Remote but need to be local of respective location

Duration: 6 Months+ Contract

Job Description: We are seeking a skilled low-level systems engineer to join the team. This individual will focus on infrastructure software that detects, configures, and optimizes AI inference pipelines across heterogeneous hardware accelerators (e.g., NVIDIA / AMD GPUs, TPUs, AWS Inferentia, FPGAs). You will work on hardware abstraction layers, containerized runtime environments, benchmarking, telemetry, and driver orchestration logic for multi-cloud agentic inference deployments.

Ideal Experience:

47 years experience in systems software or infrastructure engineering, preferably with exposure to AI/ML workloads.

Deep expertise in CUDA, NCCL, ROCm, or other accelerator programming frameworks.

Familiarity with LLM inference runtimes (TensorRT-LLM, vLLM, ONNXRuntime).

Experience with Kubernetes scheduling, device plugin development, and runtime patching for heterogeneous compute.

Strong Python/C++ and Linux systems programming skills.

Passion for building scalable, portable, and secure AI infrastructure.

Responsibilities:

Design and implement cross-platform hardware detection systems for GPUs/TPUs/NPUs using CUDA, ROCm, and low-level runtime interfaces.

Build and maintain plugin-based infrastructure for capability scoring, power efficiency tuning, and memory optimization.

Develop hardware abstraction layers (HAL) and performance benchmarking tools to optimize AI agents for cloud-native inference.

Extend container-based MLOps systems (Docker/Kubernetes) with support for hardware-specific runtime containers (e.g., TensorRT, vLLM, ROCm).

Automate driver validation, container security hardening, and runtime health monitoring across deployments.

Integrate telemetry systems (Prometheus, Grafana) to surface per-device inference performance metrics and health status.

Collaborate with solutions and DevOps teams to ensure hardware-aware agent deployment across cloud providers.

Qualifications:
Additional Information

All your information will be kept confidential according to EEO guidelines.