Logo
ZipRecruiter

Senior DevOps Engineer

ZipRecruiter, Miami, Florida, us, 33222

Save Job

Job DescriptionJob Description About Aldea

Aldea is a next- AI company focused on voice-based clinical and expert applications. Our flagship product, Advisor, uses proprietary AI to scale the impact of world-class minds across personal development, finance, parenting, relationships, and more. We’re on a mission to bring the best expert guidance in the world to people navigating real-life challenges — whether that’s parenting, relationships, health, or personal growth. Our consumer products are voice-first, AI-, and designed to meet people where they are.

As a multidisciplinary team of builders, researchers, and product thinkers, we value clear thinking, sharp writing, and strong user-first intuition.

This is a rare opportunity to join an early-stage startup that will help define a new category.

Why This Role Matters

We're building AI infrastructure that scales. With 5 distinct environments managing complex multi-cluster Kubernetes deployments, we need infrastructure experts who can architect systems for production readiness while maintaining security and operational excellence. This isn't just maintaining servers—you'll be designing the backbone that powers our AI platform across development, staging, and production environments.

What You'll Own

Multi-Environment Kubernetes Architecture

Manage 5 distinct environments (NMS, Sandbox, Development, Staging, Production) with different security and access requirements

Design redundancy and failover mechanisms for our centralized NMS hub that manages all environments

Infrastructure as Code Excellence

Develop and maintain Pulumi-based infrastructure using Python

Manage complex cross-environment dependencies and VPC peering relationships

Automate resource provisioning and configuration management

Zero-Trust Security Implementation

Implement and maintain certificate-based VPN access with internal DNS resolution

Configure WAF, security groups, and network policies for VPN-only access

Manage HashiCorp Vault integration for secure credential management across environments

Comprehensive Observability

Deploy and configure Prometheus, Grafana, Loki, Jaeger, and CloudWatch

Implement unified monitoring across distributed infrastructure

Design alerting and incident response procedures

API Platform Management

Deploy and maintain the centralized API that manages all environments from the NMS hub

Implement automation for managing training jobs and inference across multiple Kubernetes clusters

Optimize GPU and CPU resource utilization across node groups

Must-Have Requirements

Experience & Technical Depth

5+ years in DevOps, SRE, or infrastructure engineering

Expert-level Kubernetes experience with EKS and multi-cluster management

Strong Python programming skills for infrastructure automation and API development

Infrastructure & Cloud Expertise

Infrastructure as Code expertise with Pulumi, Terraform, or similar tools

Deep AWS knowledge: VPC, EKS, ECR, S3, CloudWatch, IAM, and networking

Linux system administration and containerization with Docker

Monitoring & Security

Hands-on experience with Prometheus, Grafana, and centralized logging systems

Network security experience including VPN, firewalls, and certificate management

Understanding of zero-trust architecture principles

Nice-to-Have Qualifications

Machine Learning infrastructure experience (GPU clusters, model serving, ML pipelines)

HashiCorp Vault administration and integration

GitOps experience with ArgoCD or similar tools

Service mesh experience (Istio, Linkerd)

Database administration (PostgreSQL, Redis, Elasticsearch)

CI/CD pipeline design and multi-cloud infrastructure experience

90-Day Success Metrics

Infrastructure Stability

Zero unplanned downtime across production environments

Successfully implement disaster recovery procedures with tested failover mechanisms

Achieve 99.9% uptime SLA across all critical services

Security & Compliance

Complete VPN-only access implementation with certificate-based authentication

Successfully integrate HashiCorp Vault across all environments

Pass security audit with comprehensive logging and monitoring in place

Operational Excellence

Reduce infrastructure provisioning time by 50% through automation

Implement comprehensive monitoring with

Optimize GPU utilization rates above 80% across training workloads

Key Challenges You'll Solve

Architectural Complexity

The NMS hub is a single point of failure—you'll architect redundancy without compromising centralized management

Balance VPN-only security requirements with operational efficiency for remote team access

Manage complex service discovery across 5 interconnected environments

Scale & Performance

Optimize GPU resources across competing training and inference workloads

Implement cost optimization strategies while maintaining performance requirements

Design monitoring systems that scale with our infrastructure growth

Benefits

Compensation & Benefits We are a well-funded, Seed-stage company preparing for launch. We offer:

Competitive base salary

Performance-based bonus based on achieving goals

Equity participation

Comprehensive benefits, including health, dental, vision, and paid time off

Flexible work environment—based in Miami, hybrid OK. Remote considered.