Cango Inc.
Join Cango Inc as a Senior Solutions Architect focusing on LLM and diffusion model inference on large-scale GPU clusters.
Responsibilities
Design end-to-end technical architecture for LLM and Diffusion model inference on large-scale GPU clusters.
Develop innovative solutions in KV Cache management, distributed scheduling, pipelining/batching strategies, memory allocation, and P2P/IB communication.
Architect a multi-tenant serving framework that balances throughput, latency, and cost.
Define product positioning and differentiation based on industry trends and company strategy.
Develop technical evolution plans (e.g., token streaming like vLLM, syntax parsing like SGLang, Diffusion acceleration).
Align closely with internal GPU infrastructure and business teams to ensure timely product delivery.
Lead performance engineering efforts including NCCL tuning, NUMA binding, CUDA kernel optimization.
Drive cross-team collaboration (GPU kernel, compiler, distributed system, frontend APIs) to ensure system stability and scalability.
Organize benchmarking and performance testing against industry leaders (vLLM, SGLang, TensorRT, etc.).
Guide engineering team on implementation strategies, experimental methodologies, and optimization pathways.
Engage with open-source communities and contribute core components to enhance technical influence.
Communicate directly with North America-based clients to understand their needs for AI inference, training, and deployment.
Translate customer needs into internal implementation plans and coordinate across operations, engineering, and delivery teams.
Qualifications
5+ years of experience in computer infrastructure, GPU cloud, or large-scale cloud computing in the U.S., with a deep understanding of the North American tech ecosystem.
Master’s or Ph.D. in Computer Science, Electrical Engineering, or related fields preferred.
5+ years of hands‑on experience in deep learning systems or GPU optimization, including leading the design of at least one large‑scale AI inference or training system.
Proficiency with PyTorch, CUDA, NCCL, Triton, TensorRT, MPI/IB/RDMA, etc.
Deep understanding of projects like vLLM, SGLang, DeepSpeed, FasterTransformer.
Practical experience in LLM inference optimization (e.g., KV Cache, P2P vs CPU routing, batching strategies).
Ability to integrate system‑level optimization with product usability (API and Serving layers).
Strong architectural thinking and cross‑functional communication skills to translate complexity into clear product roadmaps.
Preferred
Open‑source contributions (e.g., to vLLM, DeepSpeed, Ray, Triton‑Server, SGLang, etc.).
Experience launching GPU cloud or AI infrastructure products (e.g., RunPod, Lambda, Modal, SageMaker).
Familiarity with emerging LLM inference trends such as speculative decoding, continuous batching, and streaming inference.
What We Offer
Hands‑on opportunity to manage and optimize GPU clusters at multi‑thousand‑card scale, operating at the forefront of global compute infrastructure.
Strategic partner role in both product architecture and business decisions alongside core leadership team.
Key role in building the next‑generation GPU‑based AI inference infrastructure.
High degree of autonomy in product and architectural decisions.
Competitive compensation package with equity incentives.
Global team and access to cross‑regional GPU cluster resources.
#J-18808-Ljbffr
Responsibilities
Design end-to-end technical architecture for LLM and Diffusion model inference on large-scale GPU clusters.
Develop innovative solutions in KV Cache management, distributed scheduling, pipelining/batching strategies, memory allocation, and P2P/IB communication.
Architect a multi-tenant serving framework that balances throughput, latency, and cost.
Define product positioning and differentiation based on industry trends and company strategy.
Develop technical evolution plans (e.g., token streaming like vLLM, syntax parsing like SGLang, Diffusion acceleration).
Align closely with internal GPU infrastructure and business teams to ensure timely product delivery.
Lead performance engineering efforts including NCCL tuning, NUMA binding, CUDA kernel optimization.
Drive cross-team collaboration (GPU kernel, compiler, distributed system, frontend APIs) to ensure system stability and scalability.
Organize benchmarking and performance testing against industry leaders (vLLM, SGLang, TensorRT, etc.).
Guide engineering team on implementation strategies, experimental methodologies, and optimization pathways.
Engage with open-source communities and contribute core components to enhance technical influence.
Communicate directly with North America-based clients to understand their needs for AI inference, training, and deployment.
Translate customer needs into internal implementation plans and coordinate across operations, engineering, and delivery teams.
Qualifications
5+ years of experience in computer infrastructure, GPU cloud, or large-scale cloud computing in the U.S., with a deep understanding of the North American tech ecosystem.
Master’s or Ph.D. in Computer Science, Electrical Engineering, or related fields preferred.
5+ years of hands‑on experience in deep learning systems or GPU optimization, including leading the design of at least one large‑scale AI inference or training system.
Proficiency with PyTorch, CUDA, NCCL, Triton, TensorRT, MPI/IB/RDMA, etc.
Deep understanding of projects like vLLM, SGLang, DeepSpeed, FasterTransformer.
Practical experience in LLM inference optimization (e.g., KV Cache, P2P vs CPU routing, batching strategies).
Ability to integrate system‑level optimization with product usability (API and Serving layers).
Strong architectural thinking and cross‑functional communication skills to translate complexity into clear product roadmaps.
Preferred
Open‑source contributions (e.g., to vLLM, DeepSpeed, Ray, Triton‑Server, SGLang, etc.).
Experience launching GPU cloud or AI infrastructure products (e.g., RunPod, Lambda, Modal, SageMaker).
Familiarity with emerging LLM inference trends such as speculative decoding, continuous batching, and streaming inference.
What We Offer
Hands‑on opportunity to manage and optimize GPU clusters at multi‑thousand‑card scale, operating at the forefront of global compute infrastructure.
Strategic partner role in both product architecture and business decisions alongside core leadership team.
Key role in building the next‑generation GPU‑based AI inference infrastructure.
High degree of autonomy in product and architectural decisions.
Competitive compensation package with equity incentives.
Global team and access to cross‑regional GPU cluster resources.
#J-18808-Ljbffr