Eli Lilly
Advisor Federated Learning Data Scientist
At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our employees around the world work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We're looking for people who are determined to make life better for people around the world. The Advisor Federated Learning Data Scientist plays an essential leadership role, responsible for identifying, assessing, and implementing cutting-edge algorithmic solutions that leverage diverse datasets while ensuring data privacy and security for our partners. This position requires comprehensive knowledge in small molecule drug development, ADME/Tox, antibody engineering, and/or genetic medicine, combined with expertise in data science and statistical analysis to develop sophisticated models utilizing federated learning. This position will be instrumental in advancing Lilly's pipeline by designing critical algorithms and workflows that expedite the creation of transformative therapies. This role focuses on building large-scale, pre-trained models in a decentralized, privacy-preserving manner. The ideal candidate will pioneer the development of semi-supervised foundation models that can learn from vast, distributed datasets without centralizing sensitive information. Key Responsibilities Foundation Model Architecture:
Design and develop novel deep learning architectures (e.g., Transformer, Graph Neural Network-based) for large-scale, federated pre-training on unlabeled or partially labeled data distributed across multiple sources. Semi-Supervised & Self-Supervised Learning:
Implement and advance state-of-the-art semi-supervised and self-supervised learning algorithms (e.g., contrastive learning, masked auto-encoding) tailored for the unique constraints of federated learning, such as communication bottlenecks and data heterogeneity. Federated Optimization & Aggregation:
Develop and implement robust and communication-efficient federated aggregation strategies (e.g., FedAvg, FedProx, SCAFFOLD) that are stable for large, complex models and can handle non-IID (non-independently and identically distributed) data. Downstream Task Adaptation:
Create efficient and effective protocols for fine-tuning and adapting the pre-trained federated foundation models for a wide range of specific downstream tasks, ensuring knowledge transfer while maintaining privacy. Data Curation & Simulation:
Collaborate with data engineering teams to establish pipelines for accessing and simulating distributed datasets. Develop high-fidelity simulation environments to test, debug, and benchmark federated pre-training strategies before real-world deployment. Scalability and Performance:
Profile, analyze, and optimize the computational performance (e.g., memory, latency, communication cost) of federated training and inference to ensure scalability to a large number of clients and massive datasets. Scientific Dissemination:
Author high-impact research papers for publication in top-tier machine learning conferences (e.g., NeurIPS, ICML, ICLR) and relevant scientific journals. Prepare and deliver compelling presentations to both internal and external audiences. Code & Model Governance:
Write clean, high-quality, and reproducible code. Contribute to internal libraries and ML platforms. Implement version control for data, code, and models to ensure robust and transparent research. Cross-Functional Collaboration:
Work in a collaborative, multi-disciplinary team alongside software engineers, MLOps specialists, privacy experts, and domain scientists to translate research concepts into practical, impactful solutions. Literature Review & Innovation:
Maintain a thorough understanding of the latest advancements in federated learning, deep learning, and related fields to drive innovation and contribute to the team's research strategy. Basic Qualifications PhD in a data science field such as Biostatistics, Statistics, Machine Learning, Computational Biology, Computational Chemistry, Physics, Applied mathematics, or related field from an accredited college or university Minimum of 2 years of experience in the biopharmaceutical industry or related fields, with demonstrated expertise in drug discovery and early development. Additional Preferences Experience in developing statistical and machine learning models for complex endpoints. Broad understanding of emerging scientific and technical breakthroughs. Exceptional interpersonal and communication skills, with a keen ability to understand, empathize, and navigate complex relationships and dynamics Outstanding EQ, problem-solving, analytical, project management skills. Highly self-motivated and organized. Demonstrated ability to connect and influence at various levels across disciplines, both externally and internally. Learning Agility: Ability to quickly adapt to changing circumstances, learn from past experiences, and apply those learnings to new situations. Portfolio Mindset: Strong ability to think with a portfolio-level mentality, ensuring that individual program decisions align with the overall goals of Catalyze360. Independent, self-starter, work without supervision This is a site-based role in Indianapolis (preferred) or San Diego (preferred) or San Francisco or Boston and relocation is provided.
At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our employees around the world work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We're looking for people who are determined to make life better for people around the world. The Advisor Federated Learning Data Scientist plays an essential leadership role, responsible for identifying, assessing, and implementing cutting-edge algorithmic solutions that leverage diverse datasets while ensuring data privacy and security for our partners. This position requires comprehensive knowledge in small molecule drug development, ADME/Tox, antibody engineering, and/or genetic medicine, combined with expertise in data science and statistical analysis to develop sophisticated models utilizing federated learning. This position will be instrumental in advancing Lilly's pipeline by designing critical algorithms and workflows that expedite the creation of transformative therapies. This role focuses on building large-scale, pre-trained models in a decentralized, privacy-preserving manner. The ideal candidate will pioneer the development of semi-supervised foundation models that can learn from vast, distributed datasets without centralizing sensitive information. Key Responsibilities Foundation Model Architecture:
Design and develop novel deep learning architectures (e.g., Transformer, Graph Neural Network-based) for large-scale, federated pre-training on unlabeled or partially labeled data distributed across multiple sources. Semi-Supervised & Self-Supervised Learning:
Implement and advance state-of-the-art semi-supervised and self-supervised learning algorithms (e.g., contrastive learning, masked auto-encoding) tailored for the unique constraints of federated learning, such as communication bottlenecks and data heterogeneity. Federated Optimization & Aggregation:
Develop and implement robust and communication-efficient federated aggregation strategies (e.g., FedAvg, FedProx, SCAFFOLD) that are stable for large, complex models and can handle non-IID (non-independently and identically distributed) data. Downstream Task Adaptation:
Create efficient and effective protocols for fine-tuning and adapting the pre-trained federated foundation models for a wide range of specific downstream tasks, ensuring knowledge transfer while maintaining privacy. Data Curation & Simulation:
Collaborate with data engineering teams to establish pipelines for accessing and simulating distributed datasets. Develop high-fidelity simulation environments to test, debug, and benchmark federated pre-training strategies before real-world deployment. Scalability and Performance:
Profile, analyze, and optimize the computational performance (e.g., memory, latency, communication cost) of federated training and inference to ensure scalability to a large number of clients and massive datasets. Scientific Dissemination:
Author high-impact research papers for publication in top-tier machine learning conferences (e.g., NeurIPS, ICML, ICLR) and relevant scientific journals. Prepare and deliver compelling presentations to both internal and external audiences. Code & Model Governance:
Write clean, high-quality, and reproducible code. Contribute to internal libraries and ML platforms. Implement version control for data, code, and models to ensure robust and transparent research. Cross-Functional Collaboration:
Work in a collaborative, multi-disciplinary team alongside software engineers, MLOps specialists, privacy experts, and domain scientists to translate research concepts into practical, impactful solutions. Literature Review & Innovation:
Maintain a thorough understanding of the latest advancements in federated learning, deep learning, and related fields to drive innovation and contribute to the team's research strategy. Basic Qualifications PhD in a data science field such as Biostatistics, Statistics, Machine Learning, Computational Biology, Computational Chemistry, Physics, Applied mathematics, or related field from an accredited college or university Minimum of 2 years of experience in the biopharmaceutical industry or related fields, with demonstrated expertise in drug discovery and early development. Additional Preferences Experience in developing statistical and machine learning models for complex endpoints. Broad understanding of emerging scientific and technical breakthroughs. Exceptional interpersonal and communication skills, with a keen ability to understand, empathize, and navigate complex relationships and dynamics Outstanding EQ, problem-solving, analytical, project management skills. Highly self-motivated and organized. Demonstrated ability to connect and influence at various levels across disciplines, both externally and internally. Learning Agility: Ability to quickly adapt to changing circumstances, learn from past experiences, and apply those learnings to new situations. Portfolio Mindset: Strong ability to think with a portfolio-level mentality, ensuring that individual program decisions align with the overall goals of Catalyze360. Independent, self-starter, work without supervision This is a site-based role in Indianapolis (preferred) or San Diego (preferred) or San Francisco or Boston and relocation is provided.