AI Cybersecurity Company
Senior AI Infrastructure Engineer (LLMOps / MLOps)
AI Cybersecurity Company, San Jose, California, United States, 95199
Are you passionate about AI and eager to make a significant impact in the cybersecurity space?
Join us at our cutting‑edge AI startup in the
San Francisco Bay Area , where we are assembling a world‑class team to tackle some of the most pressing challenges in cybersecurity.
As a
Senior AI Infrastructure Engineer , you will own the design, deployment, and scaling of our
AI infrastructure and production pipelines . You’ll bridge the gap between our
AI research team
and
engineering organization , enabling the deployment of advanced
LLM and ML models
into secure, high‑performance production systems.
You will build APIs, automate workflows, optimize GPU clusters, and ensure our models perform reliably in real‑world cybersecurity applications. This role is ideal for someone who thrives in a startup environment — hands‑on, cross‑functional, and driven to build world‑class AI systems from the ground up.
Why Join Us
$25M Seed Funding:
We are well‑funded, with $25 million raised in our seed round, giving us the resources to innovate and scale rapidly.
Proven Early Success:
We’ve already partnered with
Fortune 500 companies , demonstrating market traction and trust in our AI‑driven cybersecurity solutions.
Experienced Leadership:
Our founders are second‑ and third‑time entrepreneurs with 25+ years in cybersecurity — having led companies to valuations exceeding
$3B .
World‑Class Leadership Team:
Heads of AI, Engineering, and Product come from top global tech companies, ensuring best‑in‑class mentorship and technical direction.
Cutting‑Edge AI Solutions:
We leverage the most advanced AI technologies, including
Large Language Models (LLMs) ,
Generative AI , and
intelligent inference systems .
Generous Compensation:
Competitive salary, meaningful equity, and a high‑growth environment where your impact is recognized and rewarded.
Cybersecurity Knowledge Preferred but Not Required:
We value strong AI/ML and infrastructure engineering talent above all — cybersecurity expertise can be learned on the job.
Key Responsibilities Core (Mission‑Critical)
Own and manage the AI infrastructure stack
— GPU clusters, vector databases, and model serving frameworks (vLLM, Triton, Ray, or similar).
Productionize LLMs and ML models
developed by the AI team, deploying them into secure, monitored, and scalable environments.
Design and maintain REST/gRPC APIs
for inference and automation, integrating tightly with the core cybersecurity platform.
Collaborate closely with AI scientists, backend engineers, and DevOps to streamline deployment workflows and ensure production reliability.
Infrastructure & Reliability
Build and maintain
infrastructure‑as‑code (IaC)
setups using Terraform or Pulumi for reproducible environments.
Implement
observability and monitoring
— latency, throughput, model drift, and uptime dashboards with Prometheus, Grafana, OpenTelemetry.
Automate
CI/CD pipelines
for model training, validation, and deployment using GitHub Actions, ArgoCD, or similar tools.
Architect
scalable, hybrid AI systems
across on‑prem and cloud, enabling cost‑effective compute scaling and fault tolerance.
Security, Data, and Performance
Enforce
data privacy and compliance
across AI pipelines (SOC2, encryption, access control, VPC isolation).
Manage
data and model artifacts , including versioning, lineage tracking, and storage for models, checkpoints, and embeddings.
Optimize
inference latency, GPU utilization, and throughput , using batching, caching, or quantization techniques.
Build
fallback and failover mechanisms
to maintain service reliability in case of model or API failure.
Innovation & Leadership
Research and integrate emerging
LLMOps and MLOps tools
(e.g., LangGraph, Vertex AI, Ollama, Triton, Hugging Face TGI).
Create
sandbox environments
for AI researchers to experiment safely.
Lead
cost optimization and capacity planning , forecasting GPU and cloud needs.
Document and maintain
runbooks, architecture diagrams, and standard operating procedures .
Mentor junior engineers and contribute to a culture of operational excellence and continuous improvement.
Qualifications Required
5+ years of experience in
ML Infrastructure, MLOps, or AI Platform Engineering .
Proven expertise with
LLM serving, distributed systems , and GPU orchestration (e.g., Kubernetes, Ray, or vLLM).
Strong programming skills in
Python
and experience building
APIs
(FastAPI, Flask, gRPC).
Proficiency with
cloud platforms
(Azure, AWS, or GCP) and IaC tools (Terraform, Pulumi).
Solid understanding of
CI/CD , Docker, containerization, and model registry practices.
Experience implementing
observability, monitoring, and fault‑tolerant deployments .
Preferred
Familiarity with
vector databases
(FAISS, Pinecone, Weaviate, Qdrant).
Exposure to
security or compliance‑focused environments .
Experience with
PyTorch / TensorFlow
and
MLflow / Weights & Biases .
Knowledge of
distributed training or large‑scale inference optimization
(DeepSpeed, TensorRT, Quantization).
Prior work at startups or fast‑paced R&D‑to‑production environments.
Our Culture & Team
Collaborative Environment:
Join a fast‑moving, innovation‑driven startup where every engineer has a direct impact.
World‑Class Leadership:
Mentorship from leaders with deep expertise in AI, ML, and cybersecurity.
Growth Opportunities:
Access to professional development, top‑tier conferences, and bleeding‑edge AI projects.
Diversity and Inclusion:
We believe that diverse perspectives drive stronger innovation.
Comprehensive
health, dental, and vision insurance .
Wellness and professional development stipends.
Equity options
— share in the company’s success.
Access to the
latest tools and GPUs
for AI/ML development.
#J-18808-Ljbffr
Join us at our cutting‑edge AI startup in the
San Francisco Bay Area , where we are assembling a world‑class team to tackle some of the most pressing challenges in cybersecurity.
As a
Senior AI Infrastructure Engineer , you will own the design, deployment, and scaling of our
AI infrastructure and production pipelines . You’ll bridge the gap between our
AI research team
and
engineering organization , enabling the deployment of advanced
LLM and ML models
into secure, high‑performance production systems.
You will build APIs, automate workflows, optimize GPU clusters, and ensure our models perform reliably in real‑world cybersecurity applications. This role is ideal for someone who thrives in a startup environment — hands‑on, cross‑functional, and driven to build world‑class AI systems from the ground up.
Why Join Us
$25M Seed Funding:
We are well‑funded, with $25 million raised in our seed round, giving us the resources to innovate and scale rapidly.
Proven Early Success:
We’ve already partnered with
Fortune 500 companies , demonstrating market traction and trust in our AI‑driven cybersecurity solutions.
Experienced Leadership:
Our founders are second‑ and third‑time entrepreneurs with 25+ years in cybersecurity — having led companies to valuations exceeding
$3B .
World‑Class Leadership Team:
Heads of AI, Engineering, and Product come from top global tech companies, ensuring best‑in‑class mentorship and technical direction.
Cutting‑Edge AI Solutions:
We leverage the most advanced AI technologies, including
Large Language Models (LLMs) ,
Generative AI , and
intelligent inference systems .
Generous Compensation:
Competitive salary, meaningful equity, and a high‑growth environment where your impact is recognized and rewarded.
Cybersecurity Knowledge Preferred but Not Required:
We value strong AI/ML and infrastructure engineering talent above all — cybersecurity expertise can be learned on the job.
Key Responsibilities Core (Mission‑Critical)
Own and manage the AI infrastructure stack
— GPU clusters, vector databases, and model serving frameworks (vLLM, Triton, Ray, or similar).
Productionize LLMs and ML models
developed by the AI team, deploying them into secure, monitored, and scalable environments.
Design and maintain REST/gRPC APIs
for inference and automation, integrating tightly with the core cybersecurity platform.
Collaborate closely with AI scientists, backend engineers, and DevOps to streamline deployment workflows and ensure production reliability.
Infrastructure & Reliability
Build and maintain
infrastructure‑as‑code (IaC)
setups using Terraform or Pulumi for reproducible environments.
Implement
observability and monitoring
— latency, throughput, model drift, and uptime dashboards with Prometheus, Grafana, OpenTelemetry.
Automate
CI/CD pipelines
for model training, validation, and deployment using GitHub Actions, ArgoCD, or similar tools.
Architect
scalable, hybrid AI systems
across on‑prem and cloud, enabling cost‑effective compute scaling and fault tolerance.
Security, Data, and Performance
Enforce
data privacy and compliance
across AI pipelines (SOC2, encryption, access control, VPC isolation).
Manage
data and model artifacts , including versioning, lineage tracking, and storage for models, checkpoints, and embeddings.
Optimize
inference latency, GPU utilization, and throughput , using batching, caching, or quantization techniques.
Build
fallback and failover mechanisms
to maintain service reliability in case of model or API failure.
Innovation & Leadership
Research and integrate emerging
LLMOps and MLOps tools
(e.g., LangGraph, Vertex AI, Ollama, Triton, Hugging Face TGI).
Create
sandbox environments
for AI researchers to experiment safely.
Lead
cost optimization and capacity planning , forecasting GPU and cloud needs.
Document and maintain
runbooks, architecture diagrams, and standard operating procedures .
Mentor junior engineers and contribute to a culture of operational excellence and continuous improvement.
Qualifications Required
5+ years of experience in
ML Infrastructure, MLOps, or AI Platform Engineering .
Proven expertise with
LLM serving, distributed systems , and GPU orchestration (e.g., Kubernetes, Ray, or vLLM).
Strong programming skills in
Python
and experience building
APIs
(FastAPI, Flask, gRPC).
Proficiency with
cloud platforms
(Azure, AWS, or GCP) and IaC tools (Terraform, Pulumi).
Solid understanding of
CI/CD , Docker, containerization, and model registry practices.
Experience implementing
observability, monitoring, and fault‑tolerant deployments .
Preferred
Familiarity with
vector databases
(FAISS, Pinecone, Weaviate, Qdrant).
Exposure to
security or compliance‑focused environments .
Experience with
PyTorch / TensorFlow
and
MLflow / Weights & Biases .
Knowledge of
distributed training or large‑scale inference optimization
(DeepSpeed, TensorRT, Quantization).
Prior work at startups or fast‑paced R&D‑to‑production environments.
Our Culture & Team
Collaborative Environment:
Join a fast‑moving, innovation‑driven startup where every engineer has a direct impact.
World‑Class Leadership:
Mentorship from leaders with deep expertise in AI, ML, and cybersecurity.
Growth Opportunities:
Access to professional development, top‑tier conferences, and bleeding‑edge AI projects.
Diversity and Inclusion:
We believe that diverse perspectives drive stronger innovation.
Comprehensive
health, dental, and vision insurance .
Wellness and professional development stipends.
Equity options
— share in the company’s success.
Access to the
latest tools and GPUs
for AI/ML development.
#J-18808-Ljbffr