Logo
Supermicro

Sr. System Engineer/Rack Solution (27692)

Supermicro, San Jose, California, United States, 95199

Save Job

Senior System Engineer Job Req ID: 27692

About Supermicro Supermicro® is a top‑tier provider of advanced server, storage, and networking solutions for data centers, cloud computing, enterprise IT, Hadoop/Big Data, hyperscale, HPC, and IoT/embedded customers worldwide. We are the #5 fastest‑growing company among the Silicon Valley Top 50 technology firms and are expanding globally, creating many new opportunities for talented engineers and technologists.

Job Summary As a Sr. System Engineer, you will be the go‑to person for rolling out and maintaining business‑critical applications and services for Supermicro. You will resolve escalated service issues, coach other engineers to resolution, and lead complex projects. Independent leadership and excellent communication skills are essential.

Essential Duties and Responsibilities

Execute comprehensive system‑level rack tests on the latest Nvidia and AMD GPUs, ARM‑based, Intel Xeon, and AMD EPYC processors, covering functionality, compatibility, performance, stress, and reliability using proprietary in‑house tools.

Develop expertise in HPC/AI applications and benchmarks, deliver training sessions to customers and partners, address complex support issues, and build robust processes and procedures for HPC/AI solutions.

Conduct proof‑of‑concept design and testing, provide optimized benchmarks for HPC/AI applications, fine‑tune BIOS settings, optimize OS/network configurations, and develop simulation configurations to increase efficiency across workloads.

Deliver on‑site deployment services, ensure customer acceptance verification, and provide post‑level 1 & 2 support. Create and maintain technical documentation, including notes, blogs, and diagrams.

Identify and document hardware and software quality issues and collaborate with Product Management and Engineering teams to integrate customer feedback into future product enhancements.

Proactively engage in HPC roadmap development, planning software and hardware upgrades to sustain exceptional HPC infrastructure performance.

Document and analyze test plans, reports, logs, and actively contribute to the development of test utilities and automation scripts to streamline testing processes.

Qualifications

BS/MS in Electrical Engineering, Computer Engineering, or Computer Science.

8+ years of experience in deep learning and machine learning.

8+ years of Linux/networking debugging/testing or relevant experience preferred.

Experience with leading AI/ML frameworks such as PyTorch, TensorFlow, ONNX, etc.

Experience with DevOps or cloud environments, including Docker/containers and Kubernetes.

Hands‑on experience with workload/scheduler managers (Slurm) for rack/cluster environments.

Familiarity with MLPerf Training/Inference benchmark, LLM, HPL‑AI or RCCL/NCCL.

Programming experience with Windows and Linux shell scripting.

Team‑player with strong communication skills.

Knowledge of Intel/AMD/Nvidia development toolkits such as CUDA, oneAPI, ROCm is a plus.

Experience with server/network hardware debugging and troubleshooting is a plus.

CCNA, OpenStack, OpenShift, Azure or AWS is a plus.

Note: This position requires regular in‑office attendance. The successful candidate must be present in the office during standard working hours, with on‑site collaboration, training sessions, and meetings essential to the role.

Salary Range

$137,000 – $156,000 (depending on location, level, experience, and other factors). Compensation may include bonus and equity award programs.

EEO Statement

Supermicro is an Equal Opportunity Employer and embraces diversity. We provide equal opportunity to all qualified applicants and employees without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, marital status, pregnancy, genetic information, or any other legally protected status.

Location: San Jose, CA Referrals increase your chances of interviewing by 2×.

#J-18808-Ljbffr