CLOUDENGINE DIGITAL PTE. LTD.
Position Overview
We are seeking a highly skilled Senior System Architect to lead the design and optimization of our server and data center infrastructure. The role will focus primarily on server systems architecture while also involving GPU-based compute solutions to support AI workloads. The ideal candidate will also have strong knowledge of network architecture and storage systems, enabling them to deliver holistic and scalable infrastructure solutions.
Key Responsibilities
• System Architecture & Design
• Lead the end-to-end design of enterprise-grade server systems, including compute, storage, and networking components.
• Define architectural standards, guidelines, and best practices for large-scale deployments.
• Drive infrastructure scalability, performance, and reliability improvements.
• GPU & AI Workloads
• Architect and optimize GPU server platforms for AI/ML training and inference.
• Evaluate emerging hardware accelerators (GPU, DPU, FPGA) for integration into server environments.
• Collaborate with AI/ML teams to align system designs with model training and deployment requirements.
• Networking & Storage
• Design and oversee high-performance networking architectures (InfiniBand, RoCE, Ethernet) for HPC and AI clusters.
• Architect storage solutions (SAN, NAS, object storage, NVMe, distributed storage) to support data-intensive workloads.
• Ensure optimized data flow between compute, storage, and network layers.
• Technical Leadership
• Mentor junior engineers and guide cross-functional teams in system integration projects.
• Conduct technology evaluations, PoCs, and vendor assessments.
• Contribute to long-term IT strategy and infrastructure roadmap.
Qualifications
Education & Experience
• Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or related field.
• 8+ years of experience in system architecture, server design, or data center engineering.
Technical Skills
• Strong expertise in server hardware (CPU, GPU, memory, storage subsystems).
• Proven experience with GPU computing (e.g., NVIDIA CUDA, GPU clusters, HPC).
• Knowledge of network architecture (Ethernet, InfiniBand, RDMA, SDN) and storage technologies (RAID, SAN/NAS, NVMe, distributed storage).
• Familiarity with cloud platforms, virtualization, and container technologies (VMware, KVM, Docker, Kubernetes).
• Experience in system performance tuning, benchmarking, and capacity planning.
Soft Skills
• Strong analytical and problem-solving skills.
• Excellent communication and leadership abilities.
• Ability to work collaboratively across multiple teams and global environments.
Preferred Qualifications
• Hands-on experience with AI/ML infrastructure, large-scale GPU clusters, or HPC systems.
• Knowledge of emerging technologies such as DPUs, SmartNICs, and liquid cooling.
• Familiarity with DevOps practices and Infrastructure-as-Code (e.g., Ansible, Terraform).
• Industry certifications (e.g., NVIDIA, Red Hat, VMware, Cisco) are a plus.
#J-18808-Ljbffr
#J-18808-Ljbffr