Logo
iTCO Solutions

Sr. ML Cloud Infrastructure Engineer

iTCO Solutions, Palo Alto, California, United States, 94306

Save Job

Senior Machine Learning & Cloud Infrastructure Engineer

Location: Palo Alto, CA (Onsite 4-5 days/week)

Employment Type: Long-term contract with potential for direct hire

Company: Early-Stage Healthcare + AI Start Up - Series B

About Us

AI is a fast-growing, venture-backed startup at the intersection of healthcare and artificial intelligence. We are building digital agents that support healthcare professionals and patients in non-diagnostic use cases - from scheduling and medication reminders to post-procedure follow-ups and routine care. Our mission is to help address the shortage of medical professionals by delivering scalable, AI-driven solutions that improve patient care and operational efficiency.

Role Overview

We are seeking cutting-edge, hands-on Machine Learning and Cloud Infrastructure Engineers with deep experience in ML Ops, LLM Ops, and distributed cloud systems. You will work on training and inference pipelines, custom Python jobs, and building multi-cloud, serverless infrastructure to support large-scale AI deployments.

This role requires onsite presence in our Palo Alto office (minimum 3 days/week) and is ideal for engineers from top-tier universities or high-caliber industry backgrounds (FAANG, leading AI startups, or equivalent).

Key Responsibilities

Design, develop, and maintain ML Ops & LLM Ops pipelines for both training and inference workloads.

Build multi-cloud, serverless infrastructure leveraging AWS, Google Cloud, Azure, and emerging GPU-as-a-service providers (CoreWeave, Vast.ai, Lambda Labs, etc.).

Create scalable, elastic platforms to manage GPU resources efficiently for cost optimization.

Develop automation, observability, and CI/CD tooling to accelerate feature delivery and improve reliability.

Collaborate with model-building teams on pre-training, fine-tuning, and deploying large language models(50B+ parameters).

Ensure infrastructure compliance with data sovereignty and local regulatory requirements across multiple regions.

Contribute to platform-level initiatives around security, benchmarking, and operational excellence.

Work cross-functionally to support rapid product experimentation and scaling in a high-growth startup environment.

Required Skills & Experience

Strong systems engineering background with proven track record in ML Ops, LLM Ops, or large-scale distributed systems.

Expert-level Python programming skills.

Hands-on experience with cloud infrastructure (AWS, GCP, Azure) and GPU provisioning/orchestration.

Familiarity with GPU-as-a-service platforms (CoreWeave, Vast.ai, Lambda Labs, etc.).

Experience with OpenAI APIs and building AI-powered products.

Deep knowledge of training and inference pipelines for large-scale machine learning models.

Proven ability to build scalable, fault-tolerant infrastructure and automation.

Strong background in observability, CI/CD, FinOps, and platform tooling.

Bachelor's, Master's, or PhD in Computer Science, Electrical Engineering, or related field (top-tier institutions preferred).

Preferred Qualifications

Experience in multi-cloud orchestration and serverless architecture.

Prior work on LLM training/fine-tuning and deployment at scale.

Knowledge of security best practices for AI/ML infrastructure.

Experience in high-growth startups or VC-backed companies.

Work Environment

Onsite in Palo Alto (minimum 3 days/week, preference for 5 days/week).

Highly collaborative, fast-paced startup culture.

Opportunity to move between infrastructure and model-building teams.

Potential for full-time conversion if joining as a contractor.

Why Join?

Build AI infrastructure that directly impacts healthcare delivery.

Work with a world-class engineering team from top Silicon Valley companies and universities.

Influence the technical direction of a high-growth, mission-driven company.

Competitive compensation and equity opportunities.

E-Verify:

United States Employment Opportunities Only

E-Verify is an internet-based system operated by the Department of Homeland Security and the Social Security Administration and allows employers to confirm an individual's employment eligibility to work in the United States. Under the E-Verify rules, effective September 8, 2009, federal agencies subject to the Federal Acquisition Regulation are required to modify, and include in new contracts, a provision that requires federal contractors and subcontractors to use E-Verify. ITCO Solutions is required to adhere to these requirements.

This message is intended for the use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. #LI-1269610_CJ1 #LI-BS1 #LI-IA1 #LI-BP1 #LI-NB1 #LI-AP1 #LI-DM1 #LI-PT1 #LI-NT1 #LI-SG1 #LI-RB1