Logo
athenahealth

Lead Machine Learning & AI Evaluation Engineer

athenahealth, Boston, Massachusetts, United States, 02108

Save Job

Researcher

Ai Evaluation (Verification & Validation)

Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all. We are looking for a Lead Machine Learning Engineer focusing on AI Evaluation to join the Research team in the Core AI Subdivision. Evaluating LLMs and applications integrating LLMs & agents presents unique challenges compared to traditional software or machine learning models due to their inherent non-deterministic nature and the complexity of assessing the quality of their multimodal outputs. Effective verification & validation of LLMs and applications integrating LLMs & agents is paramount for ensuring accuracy, reliability, safety, and the user trust. You will work with the team to establish scalable methodologies, designs, and tooling to accomplish this. About you: You love to own important work and find it difficult to turn down a good challenge. You are excited about the latest developments in AI & ML and keep abreast of the latest models, methods, and technologies. You have experience building, tuning, evaluating, and deploying ML models at scale. You have strong communication skills and can work with colleagues from a variety of technical and non-technical backgrounds. You enjoy both learning and teaching, and you are excited to help share knowledge with a multi-thousand-person company. You love collaboration and working closely with a team of other experts and with technical and non-technical stakeholders. Finally, you have a strong interest in improving the delivery of healthcare. About the team: The Core AI Subdivision is bringing Artificial Intelligence to bear against the hardest problems in healthcare. We are working with product and engineering leaders across the company to build AI into our Best In KLAS suite of products. We work together with athenahealth engineers to deploy state-of-the-art machine learning models and agents. Job Responsibilities: Leveraging standardized benchmarks for initial assessment Calculating and interpreting quantitative metrics such as accuracy, precision, recall, F1, perplexity, BLEU, ROUGE, text similarity, exact match etc. Human evaluation Conventional testing such as unit, functional and scale/load. Model explainability & output consistency Testing to understand bias, toxicity, fairness. Prompt variation/robustness testing Factual accuracy/coherence/relevance/fluency/hallucination testing Security testing Monitoring in production (especially important given the non-deterministic nature of LLMs) Overall observability (accuracy, perf metrics, traces/explainability, cost, usage) Techniques/approaches for improving key aspects of overall model performance such as accuracy and latency e.g., advanced prompt engineering, RAG, domain specific fine tuning, reasoning, and self-checking. Incorporating end user feedback loops Establishing best practice for evaluation of applications integrating LLMs Automating as much as practical to make AI evaluation reliable, scalable, and repeatable, including integration into CI/CD pipelines As a member of the Research team, you will: Identify opportunities to make AI evaluation deterministic, performant, and cost-effective. Understand and follow conventions and best practices for modeling, coding, architecture, and statistics; and hold other team members accountable for doing so. Apply rigorous testing of statistics, models, and code. Contribute to the development of internal tools and Core AI team standards. Typical Qualifications: Excellent verbal communication and writing skills. Bachelors in relevant field: math, computer science, data science, economics. At least 6 years of professional experience developing and evaluating machine learning models. At least 2 years enterprise experience training, evaluating, and deploying models. Proficient in Python. Experience using machine learning models and libraries Familiarity with NLP, computer vision, ambient computing techniques. Experience with commercial and open-source AI evaluation tooling, frameworks, and best practices. Experience using the AWS ecosystem a bonus, including Kubernetes, Kubeflow or EKS experience.