GMI Cloud

DC Operation System Engineer

GMI Cloud, Denver, Colorado, United States, 80285

Overview We are focusing on llm, AI Infra, and AIGC. Opportunities across Silicon Valley and China.

We are seeking a

Data Center Operation System Engineer

to join the GMI Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of the large-scale high performance AI/ML clusters in our data center. The ideal candidate will bring expertise in system-level troubleshooting, hardware lifecycle support, and operational excellence to ensure maximum performance for our infrastructure. Experience with network operations is considered a strong plus, as the role interfaces closely with networking and interconnect technologies. Preferred Location: Denver.

Responsibilities

Operate and maintain GPU/CPU compute nodes, and storage systems, ensuring stability and performance.

Troubleshoot issues across compute, accelerator, and storage systems (Linux OS, drivers, firmware, hardware).

Perform physical data center work including racking, stacking, labeling, cabling, and replacing hardware components (GPUs, CPUs, NICs, SSDs, Memory, PSUs).

Deploy and configure servers, storage arrays, and perform BIOS/firmware updates.

Collaborate with hardware vendors, OEMs, supply chain teams to ensure timely deployment of systems, RMA parts replacement, and to resolve escalated issues related to DC infrastructure.

Support capacity planning and scaling for compute, GPU, and storage workloads.

Document standard operating procedures, build/configuration guides, and troubleshooting playbooks, including DC layout and network topology in DCIM software.

Manage parts inventory and track equipment throughout the whole system lifecycle.

Regional/international travel to GMI data center locations.

Qualifications

Bachelor’s degree in Compute Science or related field.

Over 3+ years of experience in data center operations, infrastructure, or systems engineering.

Strong knowledge of Linux administration (Ubuntu, CentOS, RHEL).

Hands-on experience with GPU compute platforms (NVIDIA GPUs, drivers, CUDA stack).

Solid understanding of CPU server systems and enterprise storage platforms (NVMe, SAN, NAS, object storage).

Familiarity with OEM hardware platforms (Dell, HPE, Supermicro, Wistron).

Hardware troubleshooting expertise across compute, storage, and accelerator components.

Ability to perform hands-on data center support including cabling and FRU replacement.

Familiar with DC operations such as power distribution, air or liquid cooling environment, structured cabling, cable management, DCIM software, and hardware inventory tools.

Experience managing large-scale GPU/CPU compute clusters or HPC environments.

Familiarity with basic networking concepts (Ethernet, VLANs, TCP/IP) and exposure to network operations (InfiniBand, RoCE, BGP, VXLAN) – added advantage.

Meeting every qualification is not required—if you’re excited about this role, we’d love to hear from you. We believe diverse perspectives and experiences strengthen our team.

Details

Seniority level: Associate

Employment type: Full-time

Industries: IT Services and IT Consulting and IT System Installation and Disposal

#J-18808-Ljbffr