Logo
Kumo

Software Engineer - Cloud Engineering, Kubernetes

Kumo, Mountain View, California, us, 94039

Save Job

The Cloud Infrastructure team at Kumo is responsible for managing and scaling our Kubernetes-based, cloud-native AI platform across multiple cloud providers. They set service level objectives, optimize resource allocation, enforce security compliance, and drive cost efficiency for the Multi-Cloud Platform.

As a key team member, you will architect and operate a highly scalable, resilient Kubernetes infrastructure to support massive Big Data and AI workloads. You'll design and implement advanced cluster management strategies, fleet capacity scaling, optimize workload scheduling, and enhance observability at scale. Your expertise in Kubernetes internals, networking, and performance tuning will be critical in ensuring high availability and seamless scaling.

Joining early, you'll play a pivotal role in shaping platform reliability, automating infrastructure, and enabling ML engineers with efficient commit-to-production automation, Continuous Provisioning, CI/CD, ML Ops, and deployment orchestration and workflows. You'll collaborate with ML scientists, product engineers, and leadership to influence scaling strategies, develop self-service tooling, and drive multi-cloud resilience. Engineers at Kumo take ownership of core system design, building infrastructure that powers the next generation of AI applications.

Key Responsibilities

Design, build, and scale

Kubernetes-based infrastructure

to support Kumo's multi-cloud AI platform, ensuring high availability, resilience, and performance. Architect and optimize

large-scale Kubernetes clusters , improving

scheduling, networking (CNI), and workload orchestration

for production environments. Develop and extend

Kubernetes controllers and operators

to automate cluster management, lifecycle operations, and scaling strategies. Enhance

observability, diagnostics, and monitoring

by building tools for

real-time cluster health tracking, alerting, and performance tuning . Lead efforts to

automate fleet management , optimizing

node pools, autoscaling, and multi-cluster deployments

across AWS, GCP, and Azure. Define and implement

Kubernetes security policies, RBAC models, and best practices

to ensure compliance and platform integrity. Collaborate with ML engineers and platform teams to optimize

Kubernetes for machine learning workloads , ensuring seamless resource allocation for AI/ML models. Drive

commit-to-production automation, cloud connectivity, and deployment orchestration , ensuring

seamless application rollouts, zero-downtime upgrades, and global infrastructure reliability . Required Skills and Experience

Kubernetes Mastery : 5-7+ years of experience managing

large-scale Kubernetes clusters

(EKS, GKE, AKS, or OpenSource) in production. Deep expertise in

Kubernetes internals , including

controllers, operators, scheduling, networking (CNI), and security policies . Cloud-Native Infrastructure : 5-7+ years of experience building cloud-native

Kubernetes-based infrastructure

across AWS, Azure, and GCP. Platform Engineering : 5-7+ years of experience building

Kubernetes service meshes

(Istio/Envoy, Traefik), networking policies (Calico/Tigera), and distributed ingress/egress control. Fleet Management & Scaling : Proven experience in

optimizing, scaling, and maintaining Kubernetes clusters

across multi-cloud environments, ensuring high availability and performance. Software Development : 5-7+ years of experience writing production-grade

controllers and operators

in

Python, Go, or Rust

to extend Kubernetes functionality. Infrastructure-as-Code & Automation : Hands-on experience with

Terraform, CloudFormation, Ansible ,

BASH and Make

scripting to automate Kubernetes cluster provisioning and management. Distributed Systems & SaaS : Expertise in building and operating

large-scale distributed systems

for cloud-native B2B SaaS applications running on Kubernetes. Cloud Application Deployment : Deep expertise in building of

container orchestration, workload scheduling, and runtime optimizations

using Kubernetes, Argo or Flux. Education:

BS/MS in Computer Science or a related field (PhD preferred) Nice to Have

Proficiency with

cloud platforms

such as AWS, GCP, or Azure. Familiarity with

chaos engineering tools and practices

for testing system resilience. Strong understanding of

security best practices and compliance standards

(GDPR, SOC2, ISO27001, vulnerability assessments, GRC, risk management). Contributions to

open-source projects , particularly in the Kubernetes or cloud-native ecosystem. Expertise in

Docker, Kubernetes, Jenkins, Flux, Argo, and Terraform

in a Linux environment. Hands-on experience with

monitoring and observability tools

such as Prometheus and Grafana. Ability to develop

customer-facing web frontends or public APIs/SDKs

for platform services. Benefits

Competitive salary and

equity options . Comprehensive

medical and dental insurance . An inclusive,

diverse work environment

where all employees are valued and supported.

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.