ArrowCore Group

Site Reliability Engineer (SRE) - Hardware Specialist

ArrowCore Group, Memphis, Tennessee, us, 37544

Site Reliability Engineer (SRE) - Hardware Specialist Location:

Memphis, TN

Department:

Site Reliability Engineering

Employment Type:

Full-Time

About The Role As an SRE - Hardware Specialist, you will serve as a hardware reliability expert focused on firmware, hardware specifications, vendor relations, and failure analysis. You will proactively identify and resolve hardware issues, manage RMA processes, and stay ahead of emerging hardware technologies to support our datacenter operations. This role demands deep technical expertise in hardware diagnostics, vendor negotiations, and forward-looking hardware evaluation.

Key Responsibilities

Analyze firmware packages and hardware specifications for upcoming releases to ensure compatibility, performance, and reliability in our datacenter environment.

Investigate and diagnose hardware failures, including "grey failures" (ambiguous or intermittent issues), proving them as true hardware defects through rigorous testing and data analysis.

Manage vendor relationships, including initiating RMA (Return Merchandise Authorization) claims, negotiating beyond standard processes when necessary, and holding vendors accountable for resolutions.

Collaborate with Datacenter Operations Technicians to troubleshoot, repair, and optimize hardware systems in real‑time.

Research and evaluate next‑generation hardware technologies that are not yet released, providing insights and recommendations to inform our infrastructure roadmap.

Develop and implement monitoring tools, scripts, and processes to detect hardware anomalies early and minimize downtime.

Document failure modes, RMA outcomes, and hardware evaluations to build a knowledge base for the team.

Participate in on‑call rotations and incident response for hardware‑related issues in the Memphis datacenter.

Required Qualifications

Bachelor's degree in Systems Engineering, Electrical Engineering, Computer Science, or a related field (or equivalent experience).

5+ years of experience in hardware reliability engineering, preferably in high‑performance computing or datacenter environments.

Proven expertise in firmware analysis, hardware specifications review, and release validation.

Strong experience with RMA processes, including filing claims, vendor negotiations, and pushing for resolutions outside standard protocols.

Demonstrated ability to diagnose and prove complex hardware failures, including grey or intermittent issues, using tools like oscilloscopes, logic analyzers, or diagnostic software.

Familiarity with datacenter hardware components (e.g., servers, GPUs, networking equipment) and emerging technologies.

Proficiency in scripting languages (e.g., Python, Bash) for automation and analysis.

Excellent problem‑solving skills with a data‑driven approach to reliability engineering.

Ability to work collaboratively with cross‑functional teams, including operations technicians.

Preferred Qualifications

Experience in AI/ML infrastructure or supercomputing environments.

Knowledge of vendor ecosystems (e.g., NVIDIA, Dell, HP, Supermicro) and supply chain management.

Certifications in hardware engineering or reliability (e.g., CRE, CompTIA Server+).

Prior work in a fast‑paced startup or tech companies.

#J-18808-Ljbffr