Be The Match in
Overview
The AI/ML Engineer will play a key role within the AI Center of Excellence (CoE), focusing on building, scaling, and maintaining robust ML and GenAI operational infrastructure. This position is responsible for developing and automating end-to-end machine learning pipelines, deploying models into production, and ensuring their performance and stability over time. The ideal candidate is a hands-on engineer with strong experience in MLOps, LLMOps, and cloud-native tools, and a passion for reliable, scalable, and efficient AI systems. Accountabilities
The primary functions, scope and responsibilities of the role Engineering and Operations
Develop, deploy, and maintain production-grade ML/GenAI pipelines using AWS cloud-native and open-source MLOps tools. Automate model training, evaluation, testing, deployment, and monitoring workflows. Implement LLMOps practices for prompt versioning, model tracking, and continuous evaluation of GenAI systems. Integrate ML systems with CI/CD pipelines and infrastructure-as-code tools. Support model inference at scale via APIs, containers, and microservices. Work closely with data engineering to ensure high-quality, real-time, and batch data availability for ML workflows. Ensure high availability, reliability, and performance of AI services in production environments. Maintain robust monitoring and observability on AWS, Snowflake, Salesforce and Oracle ecosystems. Implement feature stores and data versioning systems to ensure reproducible ML experiments and deployments. Deploy and optimize vector databases and embedding models for semantic search and RAG applications. Configure GPU-enabled cloud infrastructure and implement monitoring solutions to optimize resource utilization, costs and performance for ML training and inference workloads. Establish automated model validation, testing, and rollback procedures for safe production deployments. Tooling and Infrastructure
Build and manage model registries, feature stores, and metadata tracking systems. Leverage containerization (e.g., Docker) and orchestration (e.g., Kubernetes, Airflow, Kubeflow) for scalable deployment. Implement role-based access control, auditing, and governance for ML infrastructure. Manage cost-effective cloud infrastructure using AWS. Build and maintain data quality monitoring systems with automated alerting for data drift and anomalies. Implement cost optimization strategies including auto-scaling, spot instances, and resource right-sizing for ML workloads. Collaboration and Support
Partner with data engineers, data scientists, ML engineers, architects, software engineers, infrastructure and security teams to support scalable and efficient AI/ML workflows. Contribute to incident response, performance tuning, and continuous improvement of ML pipelines. Provide guidance and documentation to promote reproducibility and best practices across teams. Work as part of an agile development team and participate in planning and code reviews. Required Qualifications
(Minimum qualifications needed for this position including education, experience, certification, knowledge and/or physical requirements) Knowledge of: Cloud-native AI/ML development with AWS. MLOps/LLMOps frameworks and lifecycle tools on AWS. Monitoring and observability platforms on AWS. ML model deployment strategies (e.g., batch, real-time, streaming). Feature stores and data versioning tools on AWS and Snowflake. Model serving frameworks like AWS SageMaker and AWS Bedrock for scalable inference deployment. Vector databases and embedding deployment (e.g., Pinecone, Weaviate, FAISS, pgvector) for LLM and RAG applications. LLMOps-specific tools including prompt management platforms and LLM serving optimization on AWS. Docker registries and artifact management. Required Skills and Abilities
Strong Python programming and scripting skills. Hands-on experience deploying and managing ML/GenAI models in production. Experience with Docker, Kubernetes, and workflow orchestration tools like Airflow or Kubeflow. Proficiency in infrastructure-as-code tools (e.g., Terraform, CloudFormation). Ability to debug, troubleshoot, and optimize AI/ML pipelines and systems. Comfortable working in agile teams and collaborating cross-functionally. Proven ability to automate processes and build reusable ML operational frameworks. Experience with A/B testing frameworks and canary deployments for ML models in production environments. Knowledge of GPU resource management and optimization for training and inference workloads. Understanding data pipeline quality monitoring, drift detection, and automated retraining triggers. Experience with secrets management, role-based access control, and secure credential handling for ML systems. Education and/or Experience
Bachelor's degree in computer science, Engineering, or a related field (master's preferred). 2-3 years of experience in ML engineering, DevOps, or MLOps roles. Demonstrated experience managing production AI/ML workloads and systems. Preferred Qualifications
Experience with LLMOps and GenAI pipeline monitoring. Cloud certifications in AWS, Azure, or GCP. Experience supporting AI applications in regulated industries (e.g., healthcare, finance). Contributions to open-source MLOps tools or infrastructure projects. Experience with edge deployment and model optimization techniques (quantization, pruning, distillation). Knowledge of compliance frameworks (SOC2, GDPR, HIPAA) and security best practices for AI/ML systems. Experience with real-time streaming data pipelines (Kafka, Kinesis) and event-driven ML architectures.
#J-18808-Ljbffr
The AI/ML Engineer will play a key role within the AI Center of Excellence (CoE), focusing on building, scaling, and maintaining robust ML and GenAI operational infrastructure. This position is responsible for developing and automating end-to-end machine learning pipelines, deploying models into production, and ensuring their performance and stability over time. The ideal candidate is a hands-on engineer with strong experience in MLOps, LLMOps, and cloud-native tools, and a passion for reliable, scalable, and efficient AI systems. Accountabilities
The primary functions, scope and responsibilities of the role Engineering and Operations
Develop, deploy, and maintain production-grade ML/GenAI pipelines using AWS cloud-native and open-source MLOps tools. Automate model training, evaluation, testing, deployment, and monitoring workflows. Implement LLMOps practices for prompt versioning, model tracking, and continuous evaluation of GenAI systems. Integrate ML systems with CI/CD pipelines and infrastructure-as-code tools. Support model inference at scale via APIs, containers, and microservices. Work closely with data engineering to ensure high-quality, real-time, and batch data availability for ML workflows. Ensure high availability, reliability, and performance of AI services in production environments. Maintain robust monitoring and observability on AWS, Snowflake, Salesforce and Oracle ecosystems. Implement feature stores and data versioning systems to ensure reproducible ML experiments and deployments. Deploy and optimize vector databases and embedding models for semantic search and RAG applications. Configure GPU-enabled cloud infrastructure and implement monitoring solutions to optimize resource utilization, costs and performance for ML training and inference workloads. Establish automated model validation, testing, and rollback procedures for safe production deployments. Tooling and Infrastructure
Build and manage model registries, feature stores, and metadata tracking systems. Leverage containerization (e.g., Docker) and orchestration (e.g., Kubernetes, Airflow, Kubeflow) for scalable deployment. Implement role-based access control, auditing, and governance for ML infrastructure. Manage cost-effective cloud infrastructure using AWS. Build and maintain data quality monitoring systems with automated alerting for data drift and anomalies. Implement cost optimization strategies including auto-scaling, spot instances, and resource right-sizing for ML workloads. Collaboration and Support
Partner with data engineers, data scientists, ML engineers, architects, software engineers, infrastructure and security teams to support scalable and efficient AI/ML workflows. Contribute to incident response, performance tuning, and continuous improvement of ML pipelines. Provide guidance and documentation to promote reproducibility and best practices across teams. Work as part of an agile development team and participate in planning and code reviews. Required Qualifications
(Minimum qualifications needed for this position including education, experience, certification, knowledge and/or physical requirements) Knowledge of: Cloud-native AI/ML development with AWS. MLOps/LLMOps frameworks and lifecycle tools on AWS. Monitoring and observability platforms on AWS. ML model deployment strategies (e.g., batch, real-time, streaming). Feature stores and data versioning tools on AWS and Snowflake. Model serving frameworks like AWS SageMaker and AWS Bedrock for scalable inference deployment. Vector databases and embedding deployment (e.g., Pinecone, Weaviate, FAISS, pgvector) for LLM and RAG applications. LLMOps-specific tools including prompt management platforms and LLM serving optimization on AWS. Docker registries and artifact management. Required Skills and Abilities
Strong Python programming and scripting skills. Hands-on experience deploying and managing ML/GenAI models in production. Experience with Docker, Kubernetes, and workflow orchestration tools like Airflow or Kubeflow. Proficiency in infrastructure-as-code tools (e.g., Terraform, CloudFormation). Ability to debug, troubleshoot, and optimize AI/ML pipelines and systems. Comfortable working in agile teams and collaborating cross-functionally. Proven ability to automate processes and build reusable ML operational frameworks. Experience with A/B testing frameworks and canary deployments for ML models in production environments. Knowledge of GPU resource management and optimization for training and inference workloads. Understanding data pipeline quality monitoring, drift detection, and automated retraining triggers. Experience with secrets management, role-based access control, and secure credential handling for ML systems. Education and/or Experience
Bachelor's degree in computer science, Engineering, or a related field (master's preferred). 2-3 years of experience in ML engineering, DevOps, or MLOps roles. Demonstrated experience managing production AI/ML workloads and systems. Preferred Qualifications
Experience with LLMOps and GenAI pipeline monitoring. Cloud certifications in AWS, Azure, or GCP. Experience supporting AI applications in regulated industries (e.g., healthcare, finance). Contributions to open-source MLOps tools or infrastructure projects. Experience with edge deployment and model optimization techniques (quantization, pruning, distillation). Knowledge of compliance frameworks (SOC2, GDPR, HIPAA) and security best practices for AI/ML systems. Experience with real-time streaming data pipelines (Kafka, Kinesis) and event-driven ML architectures.
#J-18808-Ljbffr