NVIDIA

Solutions Architect - Cloud Infrastructure

NVIDIA, Redmond, Washington, United States, 98052

Overview

Solutions Architect - Cloud Infrastructure

role at

NVIDIA . We are seeking a passionate individual with a strong interest in large-scale GPU infrastructure and AI Factory deployments. If you are excited about contributing to projects that push the boundaries of cloud-based AI and resilience in large-scale environments, read on. What You'll Be Doing

Be a key member of the cloud solutions team, acting as the technical expert on NVIDIA AI Factory solutions and large-scale GPU infrastructure, helping clients architect and deploy resilient, telemetry-driven AI compute environments at unprecedented scale. Collaborate with engineering teams to secure design wins, address challenges, and deploy solutions into production, focusing on tooling for observability, failure recovery, and infrastructure-level performance optimization. Act as a trusted advisor to clients, understanding their cloud environment, translating requirements into technical solutions, and guiding optimization of NVIDIA AI Factories for scalable, reliable, and high-performance workloads. Qualifications

2+ years of experience in large-scale cloud infrastructure engineering, distributed AI/ML systems, or GPU cluster deployment and management. A BS in Computer Science, Electrical Engineering, Mathematics, or Physics, or equivalent experience. Understanding of large-scale computing systems architecture, including multi-node GPU clusters, high-performance networking, and distributed storage. Experience with infrastructure-as-code, automation, and configuration management for large-scale deployments. Passion for machine learning and AI, with a drive to continually learn and apply new technologies. Excellent interpersonal skills, including the ability to explain complex technical topics to non-experts. Ways To Stand Out From The Crowd

Expertise with orchestration and workload management tools like Slurm, Kubernetes, Run:ai, or similar platforms for GPU resource scheduling. Knowledge of AI training and inference performance optimization at scale, including distributed training frameworks and multi-node communication patterns. Hands-on experience designing telemetry systems and failure recovery mechanisms for large-scale cloud infrastructures, including observability tools such as Grafana, Prometheus, and OpenTelemetry. Proficiency in deploying and managing cloud-native solutions using platforms such as AWS, Azure, or Google Cloud, with a focus on GPU-accelerated workloads. Deep expertise with high-performance networking technologies, particularly NVIDIA InfiniBand, NCCL, and GPU-Direct RDMA for large-scale AI workloads. Salary, Equity & Benefits

Your base salary will be determined based on location, experience, and pay of employees in similar positions. The base salary range is 120,000 USD - 189,750 USD for Level 2, and 148,000 USD - 235,750 USD for Level 3. You will also be eligible for equity and benefits. Additional Information

Applications for this job will be accepted at least until October 11, 2025. NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. We value diversity and do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. JR2005440

#J-18808-Ljbffr