Logo
Energy Jobline ZR

Principal Engineer - AI Infrastructure Abstractions in San Jose

Energy Jobline ZR, San Jose, California, United States, 95199

Save Job

Energy Jobline is the largest and fastest growing global Energy Job Board and Energy Hub. We have an audience reach of over 7 million energy professionals, 400,000+ monthly advertised global energy and engineering jobs, and work with the leading energy companies worldwide.

We focus on the Oil & Gas, Renewables, Engineering, Power, and Nuclear markets as well as emerging technologies in EV, Battery, and Fusion. We are committed to ensuring that we offer the most exciting career opportunities from around the world for our jobseekers.

Job DescriptionJob Description

As a

Principal AI Infrastructure Abstraction Engineer , you will design and implement the foundational systems that make shared AI compute environments scalable, secure, and developer-friendly. Your work will focus on creating abstractions that hide hardware complexity while providing predictable, cloud- interfaces for AI workloads.

This position bridges infrastructure and applied AI—turning raw GPUs and accelerators into programmable, elastic, and multi-tenant resources for both internal developers and enterprise clients.

Key Responsibilities

Architect abstractions that map logical compute constructs (vGPUs, GPU pools, workload queues) to physical devices.

Build APIs, services, and control planes that expose GPU and accelerator resources with strong isolation and quality-of-service guarantees.

Develop mechanisms for secure GPU sharing, including time-slicing, partitioning, and namespace isolation.

Work with orchestration and scheduling systems to ensure intelligent mapping of resources based on utilization, priority, and network topology.

Define policies for quotas, fair allocation, and resource elasticity in shared environments.

Integrate with AI/ML frameworks (PyTorch, TensorFlow, Triton, etc.) to optimize model training and inference workflows.

Deliver observability and monitoring capabilities that trace resource usage from logical abstractions to hardware.

Partner with platform security teams to strengthen access controls, onboarding processes, and tenant isolation.

Support internal developer adoption of abstraction APIs while maintaining high performance and low overhead.

Contribute to long-term compute platform strategy with a focus on modularity, abstraction, and scale.

Minimum Qualifications

Bachelor’s degree with 15+ years of experience, Master’s with 12+ years, or PhD with 8+ years.

Proven track record building production-grade infrastructure systems, preferably in Go, Python, or C++.

Strong experience with containerization and orchestration platforms (Kubernetes, Docker, KubeVirt).

Background in designing logical abstractions for compute, storage, or networking in multi-tenant systems.

Familiarity with integrating with machine learning platforms (e.g., PyTorch, TensorFlow, Triton, MLFlow).

Qualifications

Hands-on experience with GPU sharing, scheduling, or isolation (MIG, MPS, vGPUs, time-slicing, or device plugin models).

Deep knowledge of resource management: quotas, prioritization, fairness, elasticity.

Strong ability to think across hardware/software boundaries and design abstractions that scale.

If you are interested in applying for this job please press the Apply Button and follow the application process. Energy Jobline wishes you the very best of luck in your next career move.

#J-18808-Ljbffr