Logo
Surge IT

AI/ML Platform Engineer

Surge IT, Alexandria, Virginia, us, 22350

Save Job

We are seeking an experienced AI/ML Platform Engineer with a strong background in building, deploying, and operationalizing AI/ML solutions. The ideal candidate will have deep expertise in both AWS and Databricks environments, along with hands‑on experience designing scalable machine learning workflows, pipelines, and model management systems.

Key Responsibilities

Design, build, and maintain scalable AI/ML platforms and pipelines for production environments.

Develop and operationalize ML workflows, including data ingestion, transformation, training, and deployment.

Collaborate with data scientists and engineers to enable efficient experimentation and model lifecycle management.

Work with AWS (Lambda, SQS, EC2, EBS, S3) and Databricks to optimize performance and reliability of AI systems.

Implement infrastructure‑as‑code solutions using tools like Terraform and manage containerized workloads using Kubernetes.

Develop, test, and maintain code in Python (including PySpark) and other languages such as R, JavaScript, and PowerShell.

Leverage generative AI tools and frameworks, including LangChain, for building advanced AI applications.

Apply prompt engineering techniques for finetuning and improving generative AI models.

Monitor and troubleshoot system performance using tools such as AWS XRay and Azure monitoring suites.

Required Qualifications

10+ years of overall IT experience with at least 5 years focused on AI/ML engineering and platform development.

Proven experience in AWS and Databricks ecosystems.

Strong proficiency in Python, PySpark, and related ML frameworks.

Hands‑on experience with data engineering, model management, and MLOps workflows.

Strong understanding of cloud infrastructure, automation, and container orchestration.

Demonstrated experience in AI/ML coding, prompt writing, and generative AI development.

Preferred Qualifications

Experience building scalable ML data platforms and cloud‑native architectures.

Familiarity with LangChain and modern LLM‑based application development.

Knowledge of Terraform, Kubernetes, AWS XRay, and Azure Databricks.

Experience with machine learning model deployment, monitoring, and optimization.

#J-18808-Ljbffr