Kumo
The Cloud Infrastructure team at Kumo is responsible for managing and scaling our Kubernetes-based, cloud-native AI platform across multiple cloud providers. They set service level objectives, optimize resource allocation, enforce security compliance, and drive cost efficiency for the Multi-Cloud Platform.
As a key team member, you will architect and operate a highly scalable, resilient Kubernetes infrastructure to support massive Big Data and AI workloads. You'll design and implement advanced cluster management strategies, fleet capacity scaling, optimize workload scheduling, and enhance observability at scale. Your expertise in Kubernetes internals, networking, and performance tuning will be critical in ensuring high availability and seamless scaling.
Joining early, you'll play a pivotal role in shaping platform reliability, automating infrastructure, and enabling ML engineers with efficient commit-to-production automation, Continuous Provisioning, CI/CD, ML Ops, and deployment orchestration and workflows. You'll collaborate with ML scientists, product engineers, and leadership to influence scaling strategies, develop self-service tooling, and drive multi-cloud resilience. Engineers at Kumo take ownership of core system design, building infrastructure that powers the next generation of AI applications.
Key Responsibilities
Design, build, and scale
Kubernetes-based infrastructure
to support Kumo's multi-cloud AI platform, ensuring high availability, resilience, and performance. Architect and optimize
large-scale Kubernetes clusters , improving
scheduling, networking (CNI), and workload orchestration
for production environments. Develop and extend
Kubernetes controllers and operators
to automate cluster management, lifecycle operations, and scaling strategies. Enhance
observability, diagnostics, and monitoring
by building tools for
real-time cluster health tracking, alerting, and performance tuning . Lead efforts to
automate fleet management , optimizing
node pools, autoscaling, and multi-cluster deployments
across AWS, GCP, and Azure. Define and implement
Kubernetes security policies, RBAC models, and best practices
to ensure compliance and platform integrity. Collaborate with ML engineers and platform teams to optimize
Kubernetes for machine learning workloads , ensuring seamless resource allocation for AI/ML models. Drive
commit-to-production automation, cloud connectivity, and deployment orchestration , ensuring
seamless application rollouts, zero-downtime upgrades, and global infrastructure reliability . Required Skills and Experience
Kubernetes Mastery : 5-7+ years of experience managing
large-scale Kubernetes clusters
(EKS, GKE, AKS, or OpenSource) in production. Deep expertise in
Kubernetes internals , including
controllers, operators, scheduling, networking (CNI), and security policies . Cloud-Native Infrastructure : 5-7+ years of experience building cloud-native
Kubernetes-based infrastructure
across AWS, Azure, and GCP. Platform Engineering : 5-7+ years of experience building
Kubernetes service meshes
(Istio/Envoy, Traefik), networking policies (Calico/Tigera), and distributed ingress/egress control. Fleet Management & Scaling : Proven experience in
optimizing, scaling, and maintaining Kubernetes clusters
across multi-cloud environments, ensuring high availability and performance. Software Development : 5-7+ years of experience writing production-grade
controllers and operators
in
Python, Go, or Rust
to extend Kubernetes functionality. Infrastructure-as-Code & Automation : Hands-on experience with
Terraform, CloudFormation, Ansible ,
BASH and Make
scripting to automate Kubernetes cluster provisioning and management. Distributed Systems & SaaS : Expertise in building and operating
large-scale distributed systems
for cloud-native B2B SaaS applications running on Kubernetes. Cloud Application Deployment : Deep expertise in building of
container orchestration, workload scheduling, and runtime optimizations
using Kubernetes, Argo or Flux. Education:
BS/MS in Computer Science or a related field (PhD preferred) Nice to Have
Proficiency with
cloud platforms
such as AWS, GCP, or Azure. Familiarity with
chaos engineering tools and practices
for testing system resilience. Strong understanding of
security best practices and compliance standards
(GDPR, SOC2, ISO27001, vulnerability assessments, GRC, risk management). Contributions to
open-source projects , particularly in the Kubernetes or cloud-native ecosystem. Expertise in
Docker, Kubernetes, Jenkins, Flux, Argo, and Terraform
in a Linux environment. Hands-on experience with
monitoring and observability tools
such as Prometheus and Grafana. Ability to develop
customer-facing web frontends or public APIs/SDKs
for platform services. Benefits
Competitive salary and
equity options . Comprehensive
medical and dental insurance . An inclusive,
diverse work environment
where all employees are valued and supported.
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
As a key team member, you will architect and operate a highly scalable, resilient Kubernetes infrastructure to support massive Big Data and AI workloads. You'll design and implement advanced cluster management strategies, fleet capacity scaling, optimize workload scheduling, and enhance observability at scale. Your expertise in Kubernetes internals, networking, and performance tuning will be critical in ensuring high availability and seamless scaling.
Joining early, you'll play a pivotal role in shaping platform reliability, automating infrastructure, and enabling ML engineers with efficient commit-to-production automation, Continuous Provisioning, CI/CD, ML Ops, and deployment orchestration and workflows. You'll collaborate with ML scientists, product engineers, and leadership to influence scaling strategies, develop self-service tooling, and drive multi-cloud resilience. Engineers at Kumo take ownership of core system design, building infrastructure that powers the next generation of AI applications.
Key Responsibilities
Design, build, and scale
Kubernetes-based infrastructure
to support Kumo's multi-cloud AI platform, ensuring high availability, resilience, and performance. Architect and optimize
large-scale Kubernetes clusters , improving
scheduling, networking (CNI), and workload orchestration
for production environments. Develop and extend
Kubernetes controllers and operators
to automate cluster management, lifecycle operations, and scaling strategies. Enhance
observability, diagnostics, and monitoring
by building tools for
real-time cluster health tracking, alerting, and performance tuning . Lead efforts to
automate fleet management , optimizing
node pools, autoscaling, and multi-cluster deployments
across AWS, GCP, and Azure. Define and implement
Kubernetes security policies, RBAC models, and best practices
to ensure compliance and platform integrity. Collaborate with ML engineers and platform teams to optimize
Kubernetes for machine learning workloads , ensuring seamless resource allocation for AI/ML models. Drive
commit-to-production automation, cloud connectivity, and deployment orchestration , ensuring
seamless application rollouts, zero-downtime upgrades, and global infrastructure reliability . Required Skills and Experience
Kubernetes Mastery : 5-7+ years of experience managing
large-scale Kubernetes clusters
(EKS, GKE, AKS, or OpenSource) in production. Deep expertise in
Kubernetes internals , including
controllers, operators, scheduling, networking (CNI), and security policies . Cloud-Native Infrastructure : 5-7+ years of experience building cloud-native
Kubernetes-based infrastructure
across AWS, Azure, and GCP. Platform Engineering : 5-7+ years of experience building
Kubernetes service meshes
(Istio/Envoy, Traefik), networking policies (Calico/Tigera), and distributed ingress/egress control. Fleet Management & Scaling : Proven experience in
optimizing, scaling, and maintaining Kubernetes clusters
across multi-cloud environments, ensuring high availability and performance. Software Development : 5-7+ years of experience writing production-grade
controllers and operators
in
Python, Go, or Rust
to extend Kubernetes functionality. Infrastructure-as-Code & Automation : Hands-on experience with
Terraform, CloudFormation, Ansible ,
BASH and Make
scripting to automate Kubernetes cluster provisioning and management. Distributed Systems & SaaS : Expertise in building and operating
large-scale distributed systems
for cloud-native B2B SaaS applications running on Kubernetes. Cloud Application Deployment : Deep expertise in building of
container orchestration, workload scheduling, and runtime optimizations
using Kubernetes, Argo or Flux. Education:
BS/MS in Computer Science or a related field (PhD preferred) Nice to Have
Proficiency with
cloud platforms
such as AWS, GCP, or Azure. Familiarity with
chaos engineering tools and practices
for testing system resilience. Strong understanding of
security best practices and compliance standards
(GDPR, SOC2, ISO27001, vulnerability assessments, GRC, risk management). Contributions to
open-source projects , particularly in the Kubernetes or cloud-native ecosystem. Expertise in
Docker, Kubernetes, Jenkins, Flux, Argo, and Terraform
in a Linux environment. Hands-on experience with
monitoring and observability tools
such as Prometheus and Grafana. Ability to develop
customer-facing web frontends or public APIs/SDKs
for platform services. Benefits
Competitive salary and
equity options . Comprehensive
medical and dental insurance . An inclusive,
diverse work environment
where all employees are valued and supported.
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.