HiringAgents.ai
Senior/Staff Software Engineer - ML Infrastructure
HiringAgents.ai, San Francisco, California, United States, 94199
Senior/Staff Software Engineer - ML Infrastructure
Join to apply for the
Senior/Staff Software Engineer - ML Infrastructure
role at
HiringAgents.ai .
Base pay range $200,000.00/yr - $250,000.00/yr
About The Role Industrial labor is incredibly dangerous work – almost 3 million people in the US per year are injured in the workplace for entirely preventable and at times, fatal or debilitating causes. Protecting these essential people who power our world is what motivates Voxel, and Voxel would love for you to join them.
Voxel is transforming workplace safety and operations with a full‑stack AI and computer vision platform that powers site intelligence for leading enterprises across grocery and retail, manufacturing, warehousing, supply chain, and logistics. Based in San Francisco and backed by top‑tier VCs, Voxel’s technology helps safety and operations leaders see unseen risks, make better decisions, and prevent incidents before they happen.
As a Staff Machine Learning Infrastructure Engineer, you will own three core pillars of Voxel’s computer‑vision platform: ground‑truth data and labeling workflows, large‑scale training infrastructure, and continuous model lifecycle management. You’ll design and operate cloud‑native, distributed systems that turn raw video into production‑ready, version‑controlled models. You’ll work closely with ML researchers and engineers, providing technical leadership and building the infrastructure that lets them iterate quickly, safely, and at scale.
Responsibilities
Own data and labeling pipelines: architect scalable labeling services (storage, query, retrieval), design ontologies, automate annotation workflows, and build quality‑tiered datasets within cost constraints
Build and operate training infrastructure: create multi‑GPU/multi‑node training frameworks (e.g., Ray, Spark, Kubernetes), optimize distributed jobs, and integrate accelerators (TensorRT, CUDA‑graph, FP8, etc.)
Manage the full model lifecycle: implement model registries, version control, evaluation suites, and continuous‑learning loops to push updates from dev → staging → prod with safe rollbacks
Provide technical leadership, mentorship, and lightweight project management for a small infra + research squad
Establish DevOps‑for‑ML best practices (IaC, CI/CD, observability, cost monitoring) and partner with ML engineers on architecture decisions from data schemas to inference optimizations
Requirements
Must be based in the San Francisco Bay Area, California, United States, with the ability to work on‑site at Voxel’s San Francisco office
At least 5+ years of professional experience building and operating large‑scale infrastructure, including a minimum of 3+ years focused on ML or other data‑intensive systems
Bachelor’s degree or higher in Computer Science, Electrical Engineering, or a closely related technical field
Hands‑on experience designing and operating highly available, distributed systems on Kubernetes (e.g., EKS, GKE, or on‑prem clusters)
Practical experience with ML or data infrastructure, including automating data‑labeling or ground‑truth workflows and maintaining dataset versioning
Practical experience with modern DevOps for ML, including infrastructure‑as‑code (e.g., Terraform or AWS CDK), CI/CD pipelines (e.g., GitHub Actions or ArgoCD), and metrics/alerting tooling (e.g., Prometheus and Grafana)
Preferred Skills
Experience running multi‑instance or multi‑GPU training jobs and applying mixed‑precision optimizations or TensorRT/Triton inference
Background with model registry tooling (e.g., MLflow, BentoML, or SageMaker Model Registry) and associated evaluation dashboards
Prior work with computer‑vision models (e.g., YOLO, DETR, Faster R‑CNN) or video understanding systems at scale
Experience shipping high‑quality production code in Python in ML or infrastructure‑heavy environments
Familiarity with active‑learning, continuous‑training, or online distillation pipelines
Exposure to edge deployment or real‑time inference systems
Why join Voxel? Join a visionary team revolutionizing safety and operations, directly impacting the well‑being of millions of essential workers. This is your opportunity to build an extraordinary business and foster a vibrant company culture that demands your absolute best. You’ll work alongside AI experts, experienced entrepreneurs, and passionate problem‑solvers, playing a pivotal role in shaping Voxel’s growth trajectory and market position.
Voxel Offers
Extensive health, dental, and vision insurance
Highly competitive paid parental leave and support
Ownership through an Equity Incentive Plan
Generous paid time off and flexible work arrangements
Daily meals in‑office, vibrant company events, and team‑building
401(k) retirement plan, HSA options, and pre‑tax commuter benefits
#J-18808-Ljbffr
Senior/Staff Software Engineer - ML Infrastructure
role at
HiringAgents.ai .
Base pay range $200,000.00/yr - $250,000.00/yr
About The Role Industrial labor is incredibly dangerous work – almost 3 million people in the US per year are injured in the workplace for entirely preventable and at times, fatal or debilitating causes. Protecting these essential people who power our world is what motivates Voxel, and Voxel would love for you to join them.
Voxel is transforming workplace safety and operations with a full‑stack AI and computer vision platform that powers site intelligence for leading enterprises across grocery and retail, manufacturing, warehousing, supply chain, and logistics. Based in San Francisco and backed by top‑tier VCs, Voxel’s technology helps safety and operations leaders see unseen risks, make better decisions, and prevent incidents before they happen.
As a Staff Machine Learning Infrastructure Engineer, you will own three core pillars of Voxel’s computer‑vision platform: ground‑truth data and labeling workflows, large‑scale training infrastructure, and continuous model lifecycle management. You’ll design and operate cloud‑native, distributed systems that turn raw video into production‑ready, version‑controlled models. You’ll work closely with ML researchers and engineers, providing technical leadership and building the infrastructure that lets them iterate quickly, safely, and at scale.
Responsibilities
Own data and labeling pipelines: architect scalable labeling services (storage, query, retrieval), design ontologies, automate annotation workflows, and build quality‑tiered datasets within cost constraints
Build and operate training infrastructure: create multi‑GPU/multi‑node training frameworks (e.g., Ray, Spark, Kubernetes), optimize distributed jobs, and integrate accelerators (TensorRT, CUDA‑graph, FP8, etc.)
Manage the full model lifecycle: implement model registries, version control, evaluation suites, and continuous‑learning loops to push updates from dev → staging → prod with safe rollbacks
Provide technical leadership, mentorship, and lightweight project management for a small infra + research squad
Establish DevOps‑for‑ML best practices (IaC, CI/CD, observability, cost monitoring) and partner with ML engineers on architecture decisions from data schemas to inference optimizations
Requirements
Must be based in the San Francisco Bay Area, California, United States, with the ability to work on‑site at Voxel’s San Francisco office
At least 5+ years of professional experience building and operating large‑scale infrastructure, including a minimum of 3+ years focused on ML or other data‑intensive systems
Bachelor’s degree or higher in Computer Science, Electrical Engineering, or a closely related technical field
Hands‑on experience designing and operating highly available, distributed systems on Kubernetes (e.g., EKS, GKE, or on‑prem clusters)
Practical experience with ML or data infrastructure, including automating data‑labeling or ground‑truth workflows and maintaining dataset versioning
Practical experience with modern DevOps for ML, including infrastructure‑as‑code (e.g., Terraform or AWS CDK), CI/CD pipelines (e.g., GitHub Actions or ArgoCD), and metrics/alerting tooling (e.g., Prometheus and Grafana)
Preferred Skills
Experience running multi‑instance or multi‑GPU training jobs and applying mixed‑precision optimizations or TensorRT/Triton inference
Background with model registry tooling (e.g., MLflow, BentoML, or SageMaker Model Registry) and associated evaluation dashboards
Prior work with computer‑vision models (e.g., YOLO, DETR, Faster R‑CNN) or video understanding systems at scale
Experience shipping high‑quality production code in Python in ML or infrastructure‑heavy environments
Familiarity with active‑learning, continuous‑training, or online distillation pipelines
Exposure to edge deployment or real‑time inference systems
Why join Voxel? Join a visionary team revolutionizing safety and operations, directly impacting the well‑being of millions of essential workers. This is your opportunity to build an extraordinary business and foster a vibrant company culture that demands your absolute best. You’ll work alongside AI experts, experienced entrepreneurs, and passionate problem‑solvers, playing a pivotal role in shaping Voxel’s growth trajectory and market position.
Voxel Offers
Extensive health, dental, and vision insurance
Highly competitive paid parental leave and support
Ownership through an Equity Incentive Plan
Generous paid time off and flexible work arrangements
Daily meals in‑office, vibrant company events, and team‑building
401(k) retirement plan, HSA options, and pre‑tax commuter benefits
#J-18808-Ljbffr