Logo
AI Cybersecurity Company

Senior AI Infrastructure Engineer (LLMOps / MLOps)

AI Cybersecurity Company, San Jose, California, United States, 95199

Save Job

Are you passionate about AI and eager to make a significant impact in the cybersecurity space?

Join us at our cutting‑edge AI startup in the

San Francisco Bay Area , where we are assembling a world‑class team to tackle some of the most pressing challenges in cybersecurity.

As a

Senior AI Infrastructure Engineer , you will own the design, deployment, and scaling of our

AI infrastructure and production pipelines . You’ll bridge the gap between our

AI research team

and

engineering organization , enabling the deployment of advanced

LLM and ML models

into secure, high‑performance production systems.

You will build APIs, automate workflows, optimize GPU clusters, and ensure our models perform reliably in real‑world cybersecurity applications. This role is ideal for someone who thrives in a startup environment — hands‑on, cross‑functional, and driven to build world‑class AI systems from the ground up.

Why Join Us

$25M Seed Funding:

We are well‑funded, with $25 million raised in our seed round, giving us the resources to innovate and scale rapidly.

Proven Early Success:

We’ve already partnered with

Fortune 500 companies , demonstrating market traction and trust in our AI‑driven cybersecurity solutions.

Experienced Leadership:

Our founders are second‑ and third‑time entrepreneurs with 25+ years in cybersecurity — having led companies to valuations exceeding

$3B .

World‑Class Leadership Team:

Heads of AI, Engineering, and Product come from top global tech companies, ensuring best‑in‑class mentorship and technical direction.

Cutting‑Edge AI Solutions:

We leverage the most advanced AI technologies, including

Large Language Models (LLMs) ,

Generative AI , and

intelligent inference systems .

Generous Compensation:

Competitive salary, meaningful equity, and a high‑growth environment where your impact is recognized and rewarded.

Cybersecurity Knowledge Preferred but Not Required:

We value strong AI/ML and infrastructure engineering talent above all — cybersecurity expertise can be learned on the job.

Key Responsibilities Core (Mission‑Critical)

Own and manage the AI infrastructure stack

— GPU clusters, vector databases, and model serving frameworks (vLLM, Triton, Ray, or similar).

Productionize LLMs and ML models

developed by the AI team, deploying them into secure, monitored, and scalable environments.

Design and maintain REST/gRPC APIs

for inference and automation, integrating tightly with the core cybersecurity platform.

Collaborate closely with AI scientists, backend engineers, and DevOps to streamline deployment workflows and ensure production reliability.

Infrastructure & Reliability

Build and maintain

infrastructure‑as‑code (IaC)

setups using Terraform or Pulumi for reproducible environments.

Implement

observability and monitoring

— latency, throughput, model drift, and uptime dashboards with Prometheus, Grafana, OpenTelemetry.

Automate

CI/CD pipelines

for model training, validation, and deployment using GitHub Actions, ArgoCD, or similar tools.

Architect

scalable, hybrid AI systems

across on‑prem and cloud, enabling cost‑effective compute scaling and fault tolerance.

Security, Data, and Performance

Enforce

data privacy and compliance

across AI pipelines (SOC2, encryption, access control, VPC isolation).

Manage

data and model artifacts , including versioning, lineage tracking, and storage for models, checkpoints, and embeddings.

Optimize

inference latency, GPU utilization, and throughput , using batching, caching, or quantization techniques.

Build

fallback and failover mechanisms

to maintain service reliability in case of model or API failure.

Innovation & Leadership

Research and integrate emerging

LLMOps and MLOps tools

(e.g., LangGraph, Vertex AI, Ollama, Triton, Hugging Face TGI).

Create

sandbox environments

for AI researchers to experiment safely.

Lead

cost optimization and capacity planning , forecasting GPU and cloud needs.

Document and maintain

runbooks, architecture diagrams, and standard operating procedures .

Mentor junior engineers and contribute to a culture of operational excellence and continuous improvement.

Qualifications Required

5+ years of experience in

ML Infrastructure, MLOps, or AI Platform Engineering .

Proven expertise with

LLM serving, distributed systems , and GPU orchestration (e.g., Kubernetes, Ray, or vLLM).

Strong programming skills in

Python

and experience building

APIs

(FastAPI, Flask, gRPC).

Proficiency with

cloud platforms

(Azure, AWS, or GCP) and IaC tools (Terraform, Pulumi).

Solid understanding of

CI/CD , Docker, containerization, and model registry practices.

Experience implementing

observability, monitoring, and fault‑tolerant deployments .

Preferred

Familiarity with

vector databases

(FAISS, Pinecone, Weaviate, Qdrant).

Exposure to

security or compliance‑focused environments .

Experience with

PyTorch / TensorFlow

and

MLflow / Weights & Biases .

Knowledge of

distributed training or large‑scale inference optimization

(DeepSpeed, TensorRT, Quantization).

Prior work at startups or fast‑paced R&D‑to‑production environments.

Our Culture & Team

Collaborative Environment:

Join a fast‑moving, innovation‑driven startup where every engineer has a direct impact.

World‑Class Leadership:

Mentorship from leaders with deep expertise in AI, ML, and cybersecurity.

Growth Opportunities:

Access to professional development, top‑tier conferences, and bleeding‑edge AI projects.

Diversity and Inclusion:

We believe that diverse perspectives drive stronger innovation.

Comprehensive

health, dental, and vision insurance .

Wellness and professional development stipends.

Equity options

— share in the company’s success.

Access to the

latest tools and GPUs

for AI/ML development.

#J-18808-Ljbffr