Logo
Quantiphi

Infrastructure Architect (GCP)

Quantiphi, Dallas, Texas, United States, 75215

Save Job

Infrastructure Architect

While technology is the heart of our business, a global and diverse culture is the heart of our success. We love our people and we take pride in catering them to a culture built on transparency, diversity, integrity, learning, and growth. If working in an environment that encourages you to innovate and excel, not just in professional but personal life, interests you- you would enjoy your career with Quantiphi! Quantiphi is an award-winning Applied AI and Big Data software and services company, driven by a deep desire to solve transformational problems at the heart of businesses. Our signature approach combines groundbreaking machine learning research with disciplined cloud and data-engineering practices to create breakthrough impact at unprecedented speed. Quantiphi has seen 2.5x growth YoY since its inception in 2013, we don't just innovate - we lead. Headquartered in Boston, with 4,000+ Quantiphi professionals across the globe. As an Elite/Premier Partner for Google Cloud, AWS, NVIDIA, Snowflake, and others, we've been recognized with: 17x Google Cloud Partner of the Year awards in the last 8 years. 3x AWS AI/ML award wins. 3x NVIDIA Partner of the Year titles. 2x Snowflake Partner of the Year awards. We have also garnered top analyst recognitions from Gartner, ISG, and Everest Group. We offer first-in-class industry solutions across Healthcare, Financial Services, Consumer Goods, Manufacturing, and more, powered by cutting-edge Generative AI and Agentic AI accelerators. We have been certified as a Great Place to Work for the third year in a row- 2021, 2022, 2023. Be part of a trailblazing team that's shaping the future of AI, ML, and cloud innovation. Your next big opportunity starts here! Work Location: Dallas (preferred) but anywhere in US works. Role Overview: We are seeking a seasoned Infrastructure Architect with deep expertise in both cloud platforms and on-premise infrastructure to design, implement, and manage robust hybrid environments that can support high-compute AI and GenAI workloads. You will work onsite with one of our key enterprise clients to assess existing infrastructure, define scalable architectures, and ensure optimal performance for AI/ML and GenAI solutions. You'll play a critical role in bridging infrastructure, DevOps, and AI solution delivery, ensuring our client has the right foundational stack to scale advanced AI workloads across their enterprise. Key Responsibilities: Hybrid Infrastructure Design & Deployment: Architect and implement secure, scalable, and cost-effective infrastructure solutions across on-prem and cloud (GCP, AWS, Azure) environments. Evaluate existing systems and define migration or integration strategies for deploying AI/GenAI workloads in hybrid setups. Design infrastructure supporting GPU-intensive workloads, distributed training, inferencing, and vector database storage. Cloud & On-Prem Operations: Manage provisioning, automation, and orchestration across virtual machines, containers, and Kubernetes clusters. Implement and monitor high-availability, low-latency, and disaster recovery strategies. Optimize infrastructure for latency-sensitive applications, including real-time GenAI agentic workflows. Collaboration & Enablement: Work closely with AI/ML engineers, data scientists, solution architects, and DevOps to ensure smooth deployment and scaling of models and GenAI agents. Recommend best practices on hybrid infrastructure for LLM fine-tuning, RAG architecture, and multi-agent orchestration platforms. Guide teams on infrastructure security, IAM policies, and governance frameworks for GenAI applications. Performance & Cost Optimization: Continuously benchmark, profile, and optimize infrastructure for performance and efficiency. Monitor resource utilization and propose capacity planning strategies for AI workload peaks. Key Qualifications & Experience: Bachelor's or Master's degree in Computer Science, Information Systems, or related field. 815 years of experience in enterprise infrastructure architecture, with significant experience in both on-prem and cloud-native environments. Proven track record in designing and deploying AI/ML or GenAI-supporting infrastructure (e.g., GPU clusters, Kubernetes for ML workloads, hybrid vector databases). Deep knowledge of cloud services (GCP preferred; AWS or Azure acceptable), on-prem virtualization, storage, networking, and container orchestration. Experience supporting multi-agentic GenAI frameworks, including task orchestration, distributed agents, and workflow automation. Hands-on experience in DevOps and IaC tools (Terraform, Helm, Ansible, CI/CD). Familiarity with AI governance, data security, and compliance in hybrid environments. Required Skills: GCP Infrastructure Design & Deployment Deep hands-on expertise in architecting and managing solutions on Google Cloud Platform, including: VPC design, subnetting, firewall rules, Private Service Connect, and Cloud Interconnect for secure hybrid networking. Identity & Access Management (IAM), Workload Identity Federation, and service accounts for secure access control across services. Cloud Load Balancing, Cloud NAT, and Cloud Armor for high-availability, secure ingress/egress management. Resource hierarchy and organization policies to manage large-scale enterprise GCP environments. AI/GenAI-Centric Compute & Storage Architecture Strong understanding of compute services tailored to GenAI: Compute Engine for custom VM/GPU provisioning (A100/H100, T4). GKE (Google Kubernetes Engine) for containerized model deployments, including support for GPU workloads and node auto-provisioning. Vertex AI and Vertex AI Workbench for managing ML pipelines, training, model registry, and deployments. Storage architecture experience with: Cloud Storage (standard, nearline, coldline) for unstructured datasets. Filestore, Local SSDs, and Persistent Disks for high-throughput model training and inferencing. Integration with BigQuery and Spanner for structured data workloads supporting GenAI applications. Containerization, Orchestration & IaC on GCP: Advanced experience with GKE: Cluster autoscaling, workload identity, taints/tolerations for GPU scheduling. Helm-based deployments and integration with Artifact Registry. Proficient in Infrastructure as Code using: Terraform (with GCP provider modules) for declarative infrastructure deployment. Cloud Build, Cloud Deploy, or integration with GitHub Actions for CI/CD pipelines. Ability to automate infrastructure provisioning, policy enforcement, and environment standardization. Support for GenAI Architectures: Experience deploying and optimizing infrastructure for: LLM hosting using Triton Inference Server, vLLM, or Text Generation Inference on GKE or Compute Engine. Vector database integrations (Weaviate, ChromaDB, FAISS) with GCS and BigQuery. RAG pipeline infrastructure including document ingestion (e.g., via Pub/Sub, Cloud Functions) and scalable retrieval. Multi-agent frameworks like LangGraph, CrewAI, or AutoGen, with secure multi-service orchestration across GCP services. Observability, Security, and Governance Monitoring & observability stack: Cloud Monitoring, Cloud Logging, Cloud Trace, Profiler, and Error Reporting for full-stack visibility. Experience setting up custom dashboards, alerts, and uptime checks. Security and compliance capabilities: VPC