Aldea Inc
Infrastructure Engineer (Hybrid Cloud & Platform)
Aldea Inc, San Francisco, California, United States, 94199
Location: US Remote / Bay Area
Job Type: Full-time
Level: Mid-Level / Senior
About Aldea
Aldea is a multi-modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bottlenecks for the evolution of software. Our mission is to build the next generation of foundational models that power a more expressive, contextual, and intelligent human–machine interface.
The Mission We are seeking an Infrastructure Engineer to bridge the gap between complex hybrid infrastructure and developer velocity. You will architect a unified platform spanning
AWS and Bare Metal Kubernetes
.
At this level, you bring technical direction and expertise to the table. You will participate in planning and discussion for architecting resilient infrastructure, drive cross-team initiatives, and mentor other engineers while remaining deeply hands‑on. Your ultimate goal is to build a “Golden Path” for engineering: automated releases, deep observability, and a platform experience that feels invisible to the end user.
Key Responsibilities 1. Hybrid Infrastructure & Bare Metal (AWS + K8s)
Unified IaC Strategy:
Architect and maintain the
Terraform
codebase for both AWS services (EKS, RDS, VPC) and Bare Metal clusters. You will treat physical infrastructure as mutable software, using tools like
Cluster API
,
Metal3
, or
Tinkerbell
to manage hardware lifecycles.
Bare Metal Mastery:
Manage multiple production clusters on bare metal with clear separation of environments. You will solve complex challenges including
networking
(BGP, ECMP), load balancing (MetalLB/Kube‑VIP), and storage orchestration (CSI/Rook‑Ceph) for stateful workloads.
2. Observability & AI Monitoring
Full-Stack Visibility:
Contribute to building our stack (
Prometheus, Grafana, ELK/Loki
) to monitor both EKS and bare metal.
AI/GPU Telemetry:
Build specialized dashboards for AI workloads. You will track
GPU metrics
, CPU saturation, and memory pressure to ensure efficient resource utilization.
4. CI/CD & Release Architecture
CI/CD at Scale:
Architect resilient, multi-region pipelines using
GitHub Actions
. Automated CI/CD for apps using
ArgoCD
. You will build and manage a fleet of self-hosted runners to control costs and accelerate feedback loops.
Secure Release Engineering:
Implement end-to-end workflows: Docker image build → Helm chart release → deployment (GH Actions + ArgoCD). Semantic versioning, manage artifacts in centralized registries, and integrate
vulnerability scanning
.
Technical Direction:
Lead design reviews and drive platform roadmaps that balance reliability, cost, and developer productivity.
Cross-Functional Partnership:
Partner with product, security, and application teams to translate business needs into robust platform capabilities.
Requirements
Experience:
Infrastructure, DevOps, or SRE roles, with primary ownership of production systems in
AWS
and
Bare Metal Kubernetes
.
Technical Arsenal:
Expert fluency in
Terraform
,
Linux/Bash or Python
scripting, and
GitHub Actions
, and
ArgoCD
Bare Metal & K8s:
Proven experience operating Kubernetes in production, including hybrid setups (EKS + On-Prem). You understand networking (CNI, BGP), storage (CSI), and cluster lifecycle management.
Observability Depth:
You have moved beyond “out-of-the-box” dashboards. You understand high-cardinality metrics, log retention strategies, and how to debug distributed systems.
Platform Mindset:
You don't just build servers; you build products for developers.
Bonus
Experience with
OpenTelemetry (OTEL)
for unified tracing.
Understanding of
eBPF
Experience configuring
NVIDIA DCGM
for GPU monitoring and handling AI training/inference workloads.
Aldea is proud to be an equal-opportunity employer. We are committed to building a diverse and inclusive culture that celebrates authenticity to win as one. We do not discriminate on the basis of race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, disability, protected veteran status, citizenship or immigration status, or any other legally protected characteristics.
Aldea uses E-Verify to confirm employment eligibility in compliance with federal law. For more information please visit: https://www.e-verify.gov .
Please note: We do not accept unsolicited resumes from recruiters or employment agencies and will not be responsible for any fees related to unsolicited resumes.
#J-18808-Ljbffr
Job Type: Full-time
Level: Mid-Level / Senior
About Aldea
Aldea is a multi-modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bottlenecks for the evolution of software. Our mission is to build the next generation of foundational models that power a more expressive, contextual, and intelligent human–machine interface.
The Mission We are seeking an Infrastructure Engineer to bridge the gap between complex hybrid infrastructure and developer velocity. You will architect a unified platform spanning
AWS and Bare Metal Kubernetes
.
At this level, you bring technical direction and expertise to the table. You will participate in planning and discussion for architecting resilient infrastructure, drive cross-team initiatives, and mentor other engineers while remaining deeply hands‑on. Your ultimate goal is to build a “Golden Path” for engineering: automated releases, deep observability, and a platform experience that feels invisible to the end user.
Key Responsibilities 1. Hybrid Infrastructure & Bare Metal (AWS + K8s)
Unified IaC Strategy:
Architect and maintain the
Terraform
codebase for both AWS services (EKS, RDS, VPC) and Bare Metal clusters. You will treat physical infrastructure as mutable software, using tools like
Cluster API
,
Metal3
, or
Tinkerbell
to manage hardware lifecycles.
Bare Metal Mastery:
Manage multiple production clusters on bare metal with clear separation of environments. You will solve complex challenges including
networking
(BGP, ECMP), load balancing (MetalLB/Kube‑VIP), and storage orchestration (CSI/Rook‑Ceph) for stateful workloads.
2. Observability & AI Monitoring
Full-Stack Visibility:
Contribute to building our stack (
Prometheus, Grafana, ELK/Loki
) to monitor both EKS and bare metal.
AI/GPU Telemetry:
Build specialized dashboards for AI workloads. You will track
GPU metrics
, CPU saturation, and memory pressure to ensure efficient resource utilization.
4. CI/CD & Release Architecture
CI/CD at Scale:
Architect resilient, multi-region pipelines using
GitHub Actions
. Automated CI/CD for apps using
ArgoCD
. You will build and manage a fleet of self-hosted runners to control costs and accelerate feedback loops.
Secure Release Engineering:
Implement end-to-end workflows: Docker image build → Helm chart release → deployment (GH Actions + ArgoCD). Semantic versioning, manage artifacts in centralized registries, and integrate
vulnerability scanning
.
Technical Direction:
Lead design reviews and drive platform roadmaps that balance reliability, cost, and developer productivity.
Cross-Functional Partnership:
Partner with product, security, and application teams to translate business needs into robust platform capabilities.
Requirements
Experience:
Infrastructure, DevOps, or SRE roles, with primary ownership of production systems in
AWS
and
Bare Metal Kubernetes
.
Technical Arsenal:
Expert fluency in
Terraform
,
Linux/Bash or Python
scripting, and
GitHub Actions
, and
ArgoCD
Bare Metal & K8s:
Proven experience operating Kubernetes in production, including hybrid setups (EKS + On-Prem). You understand networking (CNI, BGP), storage (CSI), and cluster lifecycle management.
Observability Depth:
You have moved beyond “out-of-the-box” dashboards. You understand high-cardinality metrics, log retention strategies, and how to debug distributed systems.
Platform Mindset:
You don't just build servers; you build products for developers.
Bonus
Experience with
OpenTelemetry (OTEL)
for unified tracing.
Understanding of
eBPF
Experience configuring
NVIDIA DCGM
for GPU monitoring and handling AI training/inference workloads.
Aldea is proud to be an equal-opportunity employer. We are committed to building a diverse and inclusive culture that celebrates authenticity to win as one. We do not discriminate on the basis of race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, disability, protected veteran status, citizenship or immigration status, or any other legally protected characteristics.
Aldea uses E-Verify to confirm employment eligibility in compliance with federal law. For more information please visit: https://www.e-verify.gov .
Please note: We do not accept unsolicited resumes from recruiters or employment agencies and will not be responsible for any fees related to unsolicited resumes.
#J-18808-Ljbffr