Virtue AI
Software Engineering – Inference Engineer
Virtue AI, San Francisco, California, United States, 94199
About Virtue AI
Virtue AI sets the standard for advanced AI security platforms. Built on decades of foundational and award-winning research in AI security, its AI-native architecture unifies automated red‑teaming, real‑time multimodal guardrails, and systematic governance for enterprise apps and agents. Deploy in minutes—across any environment—to keep your AI protected and compliant. We are a well‑funded, early‑stage startup founded by industry veterans, and we're looking for passionate builders to join our core team.
What You’ll Do As an Inference Engineer, you will own how models are served in production. Your job is to make inferences fast, stable, observable, and cost‑efficient—even under unpredictable workloads.
You will:
Serve and optimize LLM, embedding, and other ML models’ inference across multiple model families
Design and operate inference APIs with clear contracts, versioning, and backward compatibility
Build routing and load‑balancing logic for inference traffic
Multi‑model routing
Fallback and degradation strategies
vLLM or SGLang
Package inference services into production‑ready Docker images
Implement logging and metrics for inference systems
Latency, throughput, token counts, GPU utilization
Prometheus‑based metrics
Analyze server uptime and failure modes
GPU OOMs, hangs, slowdowns, fragmentation
Recovery and restart strategies
Design GPU and model placement strategies
Model sharding, replication, and batching
Tradeoffs between latency, cost, and availability
Work closely with backend, platform (Cloud DevOps), and ML teams to align inference behavior with product requirements
What Makes You a Great Fit You understand that inference is a systems problem, not just a model problem. You think in QPS, p99 latency, GPU memory, and failure domains.
Required Qualifications
Bachelor’s degree or higher in CS, CE, or related field
Strong experience serving LLMs and embedding models in production
Hands‑on experience designing:
Inference APIs
Load balancing and routing logic
Experience with SGLang, vLLM, TensorRT, or similar inference frameworks
Strong understanding of GPU behavior
Memory limits, batching, fragmentation, utilization
Experience with:
Docker
Prometheus metrics
Structured logging
Ability to debug and fix real inference failures in production
Experience with autoscaling inference services
Familiarity with Kubernetes GPU scheduling
Experience supporting production systems with real SLAs
Proven ability to debug and fix inference failures in production
Comfortable operating in a fast‑paced startup environment with high ownership
Preferred Qualifications
Experience with GPU‑level optimization
Memory planning reuse
Kernel launch efficiency
Reducing fragmentation and allocator overhead
Experience with kernel- or runtime‑level optimization
CUDA kernels, Triton kernels, or custom ops
Experience with model‑level inference optimization
Quantization (FP8 / INT8 / BF16)
KV‑cache optimization
Speculative decoding or batching strategies
Experience pushing inference efficiency boundaries (latency, throughput, or cost)
Why Join Virtue AI
Competitive base salary compensation + equity commensurate with skills and experience.
Impact at scale – Help define the category of AI security and partner with Fortune 500 enterprises on their most strategic AI initiatives.
Work on the frontier – Engage with bleeding‑edge AI/ML and deploy AI security solutions for use cases that don't yet exist anywhere else yet.
Collaborative culture – Join a team of builders, problem‑solvers, and innovators who are mission‑driven and collaborative.
Opportunity for growth – Shape not only our customer engagements, but also the processes and culture of an early lean team with plans for scale.
Equal Opportunity Employment Virtue AI is an Equal Opportunity Employer. We welcome and celebrate diversity and are committed to creating an inclusive workplace for all employees. Employment decisions are made without regard to race, color, religion, sex, gender identity or expression, sexual orientation, marital status, national origin, ancestry, age, disability, medical condition, veteran status, or any other status protected by law.
We also provide reasonable accommodations for applicants and employees with disabilities or sincerely held religious beliefs, consistent with legal requirements.
#J-18808-Ljbffr
What You’ll Do As an Inference Engineer, you will own how models are served in production. Your job is to make inferences fast, stable, observable, and cost‑efficient—even under unpredictable workloads.
You will:
Serve and optimize LLM, embedding, and other ML models’ inference across multiple model families
Design and operate inference APIs with clear contracts, versioning, and backward compatibility
Build routing and load‑balancing logic for inference traffic
Multi‑model routing
Fallback and degradation strategies
vLLM or SGLang
Package inference services into production‑ready Docker images
Implement logging and metrics for inference systems
Latency, throughput, token counts, GPU utilization
Prometheus‑based metrics
Analyze server uptime and failure modes
GPU OOMs, hangs, slowdowns, fragmentation
Recovery and restart strategies
Design GPU and model placement strategies
Model sharding, replication, and batching
Tradeoffs between latency, cost, and availability
Work closely with backend, platform (Cloud DevOps), and ML teams to align inference behavior with product requirements
What Makes You a Great Fit You understand that inference is a systems problem, not just a model problem. You think in QPS, p99 latency, GPU memory, and failure domains.
Required Qualifications
Bachelor’s degree or higher in CS, CE, or related field
Strong experience serving LLMs and embedding models in production
Hands‑on experience designing:
Inference APIs
Load balancing and routing logic
Experience with SGLang, vLLM, TensorRT, or similar inference frameworks
Strong understanding of GPU behavior
Memory limits, batching, fragmentation, utilization
Experience with:
Docker
Prometheus metrics
Structured logging
Ability to debug and fix real inference failures in production
Experience with autoscaling inference services
Familiarity with Kubernetes GPU scheduling
Experience supporting production systems with real SLAs
Proven ability to debug and fix inference failures in production
Comfortable operating in a fast‑paced startup environment with high ownership
Preferred Qualifications
Experience with GPU‑level optimization
Memory planning reuse
Kernel launch efficiency
Reducing fragmentation and allocator overhead
Experience with kernel- or runtime‑level optimization
CUDA kernels, Triton kernels, or custom ops
Experience with model‑level inference optimization
Quantization (FP8 / INT8 / BF16)
KV‑cache optimization
Speculative decoding or batching strategies
Experience pushing inference efficiency boundaries (latency, throughput, or cost)
Why Join Virtue AI
Competitive base salary compensation + equity commensurate with skills and experience.
Impact at scale – Help define the category of AI security and partner with Fortune 500 enterprises on their most strategic AI initiatives.
Work on the frontier – Engage with bleeding‑edge AI/ML and deploy AI security solutions for use cases that don't yet exist anywhere else yet.
Collaborative culture – Join a team of builders, problem‑solvers, and innovators who are mission‑driven and collaborative.
Opportunity for growth – Shape not only our customer engagements, but also the processes and culture of an early lean team with plans for scale.
Equal Opportunity Employment Virtue AI is an Equal Opportunity Employer. We welcome and celebrate diversity and are committed to creating an inclusive workplace for all employees. Employment decisions are made without regard to race, color, religion, sex, gender identity or expression, sexual orientation, marital status, national origin, ancestry, age, disability, medical condition, veteran status, or any other status protected by law.
We also provide reasonable accommodations for applicants and employees with disabilities or sincerely held religious beliefs, consistent with legal requirements.
#J-18808-Ljbffr