Inflection AI
Member of Technical Staff - Inference
Inflection AI, Palo Alto, California, United States, 94306
At Inflection AI, our public benefit mission is to harness the power of AI to improve human well-being and productivity.
The next era of AI will be defined by agents we trust to act on our behalf.
We're pioneering this future with human-centered AI models that unite emotional intelligence (EQ) and raw intelligence (IQ)-transforming interactions from transactional to relational, to create enduring value for individuals and enterprises alike.
Our work comes to life in two ways today:
Pi, your personal AI, designed to be a kind and supportive companion that elevates everyday life with practical assistance and perspectives.
Platform - large-language models (LLMs) and APIs that enable builders, agents, and enterprises to bring Pi-class emotional intelligence into experiences where empathy and human understanding matter most.
We are building toward a future of AI agents that earn trust, deepen understanding, and create aligned, long-term value for all.
About the Role
As an Inference Engineer, you will own the real-time performance, scalability, and reliability of our LLM-powered systems. You'll optimize every layer-from GPU kernels to orchestration frameworks-to deliver sub-second latency, high throughput, and enterprise-grade uptime. Your work will also enable advanced capabilities such as tool usage, agentic flows, retrieval-augmented generation (RAG), and long-term memory.
This is a good role for you if you: Have direct experience deploying and optimizing large transformer models for real-time inference across multi-GPU or multi-node environments Are skilled with tools like Triton, TensorRT, TVM, ONNX Runtime, or custom CUDA kernels-and know when to use C++/Rust for critical performance gains Understand the balance between latency, throughput, accuracy, and cost, and make smart choices around quantization, speculative decoding, and caching Have developed or integrated agent-based orchestration systems, RAG pipelines, or memory architectures in production environments Automate at every layer-CI/CD for model artifacts, load testing, canary rollouts, and auto-scaling Communicate clearly with both infrastructure teams and product stakeholders Responsibilities include:
Design and optimize high-performance inference pipelines using PyTorch, vLLM, Triton, TensorRT, and FSDP/DeepSpeed Integrate agentic runtimes-tool calling, function execution, and multi-step planning-while meeting strict latency requirements Build robust retrieval-augmented generation (RAG) stacks combining vector search, caching, and real-time context packing Develop memory services to support conversation continuity and user personalization at scale Monitor, instrument, and autotune GPU performance, kernel fusion, and batching strategies across clusters of NVIDIA H100 and Intel Gaudi accelerators Partner with training, safety, and product teams to transform research into stable, production-grade systems Contribute upstream to open-source performance libraries and share insights with the community Employee Pay Disclosures
At Inflection AI, we aim to attract and retain the best employees and compensate them in a way that appropriately and fairly values their individual contributions to the company. For this role, Inflection AI estimates a starting annual base salary will fall in the range of approximately $175,000 - $350,000 depending on experience. This estimate can vary based on the factors described above, so the actual starting annual base salary may be above or below this range.
Interview Process
Apply: Please apply on Linkedin or our website for a specific role.
After speaking with one of our recruiters, you'll enter our structured interview process, which includes the following stages: Hiring Manager Conversation
- An initial discussion with the hiring manager to assess fit and alignment. Technical Interview
- A deep dive with an Inflection Engineer to evaluate your technical expertise. Onsite Interview
- A comprehensive assessment, including: A
domain-specific interview A
system design interview A final conversation with the
hiring manager
Depending on the role, we may also ask you to complete a take-home exercise or deliver a presentation.
For
non-technical roles , be prepared for a role-specific interview, such as a portfolio review.
Decision Timeline We aim to provide feedback within one week of your final interview.
The next era of AI will be defined by agents we trust to act on our behalf.
We're pioneering this future with human-centered AI models that unite emotional intelligence (EQ) and raw intelligence (IQ)-transforming interactions from transactional to relational, to create enduring value for individuals and enterprises alike.
Our work comes to life in two ways today:
Pi, your personal AI, designed to be a kind and supportive companion that elevates everyday life with practical assistance and perspectives.
Platform - large-language models (LLMs) and APIs that enable builders, agents, and enterprises to bring Pi-class emotional intelligence into experiences where empathy and human understanding matter most.
We are building toward a future of AI agents that earn trust, deepen understanding, and create aligned, long-term value for all.
About the Role
As an Inference Engineer, you will own the real-time performance, scalability, and reliability of our LLM-powered systems. You'll optimize every layer-from GPU kernels to orchestration frameworks-to deliver sub-second latency, high throughput, and enterprise-grade uptime. Your work will also enable advanced capabilities such as tool usage, agentic flows, retrieval-augmented generation (RAG), and long-term memory.
This is a good role for you if you: Have direct experience deploying and optimizing large transformer models for real-time inference across multi-GPU or multi-node environments Are skilled with tools like Triton, TensorRT, TVM, ONNX Runtime, or custom CUDA kernels-and know when to use C++/Rust for critical performance gains Understand the balance between latency, throughput, accuracy, and cost, and make smart choices around quantization, speculative decoding, and caching Have developed or integrated agent-based orchestration systems, RAG pipelines, or memory architectures in production environments Automate at every layer-CI/CD for model artifacts, load testing, canary rollouts, and auto-scaling Communicate clearly with both infrastructure teams and product stakeholders Responsibilities include:
Design and optimize high-performance inference pipelines using PyTorch, vLLM, Triton, TensorRT, and FSDP/DeepSpeed Integrate agentic runtimes-tool calling, function execution, and multi-step planning-while meeting strict latency requirements Build robust retrieval-augmented generation (RAG) stacks combining vector search, caching, and real-time context packing Develop memory services to support conversation continuity and user personalization at scale Monitor, instrument, and autotune GPU performance, kernel fusion, and batching strategies across clusters of NVIDIA H100 and Intel Gaudi accelerators Partner with training, safety, and product teams to transform research into stable, production-grade systems Contribute upstream to open-source performance libraries and share insights with the community Employee Pay Disclosures
At Inflection AI, we aim to attract and retain the best employees and compensate them in a way that appropriately and fairly values their individual contributions to the company. For this role, Inflection AI estimates a starting annual base salary will fall in the range of approximately $175,000 - $350,000 depending on experience. This estimate can vary based on the factors described above, so the actual starting annual base salary may be above or below this range.
Interview Process
Apply: Please apply on Linkedin or our website for a specific role.
After speaking with one of our recruiters, you'll enter our structured interview process, which includes the following stages: Hiring Manager Conversation
- An initial discussion with the hiring manager to assess fit and alignment. Technical Interview
- A deep dive with an Inflection Engineer to evaluate your technical expertise. Onsite Interview
- A comprehensive assessment, including: A
domain-specific interview A
system design interview A final conversation with the
hiring manager
Depending on the role, we may also ask you to complete a take-home exercise or deliver a presentation.
For
non-technical roles , be prepared for a role-specific interview, such as a portfolio review.
Decision Timeline We aim to provide feedback within one week of your final interview.