Artmac
About the Role
Job Title: AI/ML inference optimization Engineer
Job Type: W2/C2C
Experience: 8–15 years
Location: San Jose, California
Responsibilities
8+ years of experience in MLOps, Machine Learning Engineering, or a specialized Inference Optimization role.
Portfolio or project experience demonstrating successful deployment of high-performance, containerized ML models to production scale.
Model‑Specific Optimization: Analyze and understand the underlying logic and dependencies of various AI/ML models (primarily using PyTorch and TensorFlow) to identify bottlenecks in the inference pipeline.
High‑Performance Serving Implementation: Design, implement, and manage high‑performance inference serving solutions utilizing specialized inference servers (e.g., vLLM) to achieve low latency and high throughput.
GPU Utilization Optimization: Optimize model serving configurations specifically for GPU hardware to maximize resource efficiency and performance metrics in a production environment.
Containerization for Deployment: Create minimal, secure, and production‑ready Docker images for streamlined deployment of optimized models and inference servers across various environments.
Collaboration: Work closely with core engineering and data science teams to ensure a smooth transition from model development to high‑scale production deployment.
Required Skillsets
AI/ML Domain Expertise : Deep understanding of the AI/ML domain with a core focus on model performance and serving, rather than general infrastructure.
ML Frameworks : Expertise in PyTorch and TensorFlow, proven ability to troubleshoot model‑specific dependencies, logic, and graph structures within these major frameworks.
Inference Optimization :
Production inference experience designing and implementing high‑throughput, low‑latency model serving solutions.
Mandatory experience with high‑performance inference servers, specifically vLLM or similar dedicated LLM serving frameworks.
GPU optimization: Demonstrated ability to optimize model serving parameters and infrastructure to maximize performance on NVIDIA or equivalent GPU hardware.
Deployment and Infrastructure :
Containerization (Docker): Proficiency in creating minimal, secure, and efficient Docker images for model and server deployment.
Infrastructure Knowledge (Helpful, but Secondary): General knowledge of cloud platforms (AWS, GCP, Azure) and Kubernetes/orchestration beneficial but primary focus remains on model serving and optimization.
Qualification : Bachelor's degree or equivalent combination of education and experience.
Job Details Seniority level: Mid‑Senior; Employment type: Contract; Job function: Engineering and Information Technology; Industries: Software Development.
#J-18808-Ljbffr
Job Type: W2/C2C
Experience: 8–15 years
Location: San Jose, California
Responsibilities
8+ years of experience in MLOps, Machine Learning Engineering, or a specialized Inference Optimization role.
Portfolio or project experience demonstrating successful deployment of high-performance, containerized ML models to production scale.
Model‑Specific Optimization: Analyze and understand the underlying logic and dependencies of various AI/ML models (primarily using PyTorch and TensorFlow) to identify bottlenecks in the inference pipeline.
High‑Performance Serving Implementation: Design, implement, and manage high‑performance inference serving solutions utilizing specialized inference servers (e.g., vLLM) to achieve low latency and high throughput.
GPU Utilization Optimization: Optimize model serving configurations specifically for GPU hardware to maximize resource efficiency and performance metrics in a production environment.
Containerization for Deployment: Create minimal, secure, and production‑ready Docker images for streamlined deployment of optimized models and inference servers across various environments.
Collaboration: Work closely with core engineering and data science teams to ensure a smooth transition from model development to high‑scale production deployment.
Required Skillsets
AI/ML Domain Expertise : Deep understanding of the AI/ML domain with a core focus on model performance and serving, rather than general infrastructure.
ML Frameworks : Expertise in PyTorch and TensorFlow, proven ability to troubleshoot model‑specific dependencies, logic, and graph structures within these major frameworks.
Inference Optimization :
Production inference experience designing and implementing high‑throughput, low‑latency model serving solutions.
Mandatory experience with high‑performance inference servers, specifically vLLM or similar dedicated LLM serving frameworks.
GPU optimization: Demonstrated ability to optimize model serving parameters and infrastructure to maximize performance on NVIDIA or equivalent GPU hardware.
Deployment and Infrastructure :
Containerization (Docker): Proficiency in creating minimal, secure, and efficient Docker images for model and server deployment.
Infrastructure Knowledge (Helpful, but Secondary): General knowledge of cloud platforms (AWS, GCP, Azure) and Kubernetes/orchestration beneficial but primary focus remains on model serving and optimization.
Qualification : Bachelor's degree or equivalent combination of education and experience.
Job Details Seniority level: Mid‑Senior; Employment type: Contract; Job function: Engineering and Information Technology; Industries: Software Development.
#J-18808-Ljbffr