Allen Institute for AI (AI2)
Overview
Data Engineer role to help integrate a large U.S. patent corpus into the Semantic Scholar platform. This fixed-term NSF-funded position focuses on high-impact data engineering: linking patent and academic research data, resolving citations, disambiguating inventors and authors, applying topic models, and extending data products and APIs. This is not a research role, but you should be confident implementing ML-driven solutions when off-the-shelf tools dont cut it. Who You Are: The Allen Institute for AI (Ai2) is hiring a Data Engineer to work in a high-performing engineering environment and own full-stack data tasks including building pipelines, integrating or training practical ML models, and deploying production services. This is a fixed term position scheduled for 2 years with the possibility of renewal. Responsibilities
Build scalable data pipelines (Airflow) for citation resolution and corpus integration Develop and deploy lightweight ML models for inventor disambiguation and author linking Train or adapt a topic model to classify patents using titles, abstracts, claims, and specs Extend REST APIs to expose linked metadata and topic classifications Contribute to dashboards and tools for evaluating data quality and model precision Collaborate with Ai2 engineers to ensure maintainability, test coverage, and robust deployment Produce reliable, well-documented code and contribute technical designs that support long-term maintainability Qualifications
What Youll Need: Bachelor's degree and 8+ years of technical experience; relevant experience may substitute for education. Strong Python engineering skills, especially for building and maintaining data pipelines Experience with SQL and schema design in production settings (PostgreSQL preferred) Familiarity with common ML workflows (training classifiers, tuning models, and deploying for inference), particularly for large-scale or ambiguous structured datasets Comfortable working with structured datasets (XML/JSON/Parquet) and writing ETL code Experience with workflow orchestration tools (Airflow or similar) and cloud infrastructure (e.g. AWS, S3, Docker) Strong communicator and a strong sense of ownership for results Preferred: Experience with author disambiguation, entity resolution, or record linkage problems Experience applying vector-based similarity or topic modeling techniques to real-world corpora at scale Exposure to citation networks or scholarly data systems (e.g., arXiv, OpenAlex, USPTO) Comfort building internal APIs and dashboards to support ML and data quality review About Ai2 and Environment
The Semantic Scholar team builds open, production-grade systems that power scientific discovery and large-scale AI research. We focus on creating high-quality structured datasets, integrating diverse content types, and enabling downstream applications across search, citation analysis, and model training. The team collaborates across Ai2s product and research orgs to deliver tools and infrastructure used by millions of researchers and developers worldwide. What We Offer
Team members and their families are covered by medical, dental, vision, and an employee assistance program Enrollment in health savings account plan, healthcare reimbursement arrangement plan, and flexible spending accounts Enrollment in the company's 401k plan Monthly stipends for commuting/internet and fitness/wellbeing expenses Vacation, personal days, sick leave, and paid holidays; bonus opportunities Equal Opportunity and Compliance
Ai2 is proud to be an Equal Opportunity employer. We do not discriminate based on race, religion, color, national origin, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, transgender status, age, veteran status, disability, or other protected characteristics. You may view related Know Your Rights and Pay Transparency notices. Ai2 participates in E-Verify and will provide Form I-9 information to confirm work authorization. We are committed to reasonable accommodations under the ADA. If you need accommodations, contact recruiting@allenai.org. Voluntary Self-Identification
We collect voluntary demographic information for government reporting purposes. The completion of this form is voluntary and confidential. More details are provided in the voluntary self-identification materials attached to this posting. How to Apply
Apply for this job. This posting includes fields for name, contact information, location, resume, education, experience, and confirmation of eligibility to work in the U.S. You may be asked to certify information and consent to a background check as part of the process. #J-18808-Ljbffr
Data Engineer role to help integrate a large U.S. patent corpus into the Semantic Scholar platform. This fixed-term NSF-funded position focuses on high-impact data engineering: linking patent and academic research data, resolving citations, disambiguating inventors and authors, applying topic models, and extending data products and APIs. This is not a research role, but you should be confident implementing ML-driven solutions when off-the-shelf tools dont cut it. Who You Are: The Allen Institute for AI (Ai2) is hiring a Data Engineer to work in a high-performing engineering environment and own full-stack data tasks including building pipelines, integrating or training practical ML models, and deploying production services. This is a fixed term position scheduled for 2 years with the possibility of renewal. Responsibilities
Build scalable data pipelines (Airflow) for citation resolution and corpus integration Develop and deploy lightweight ML models for inventor disambiguation and author linking Train or adapt a topic model to classify patents using titles, abstracts, claims, and specs Extend REST APIs to expose linked metadata and topic classifications Contribute to dashboards and tools for evaluating data quality and model precision Collaborate with Ai2 engineers to ensure maintainability, test coverage, and robust deployment Produce reliable, well-documented code and contribute technical designs that support long-term maintainability Qualifications
What Youll Need: Bachelor's degree and 8+ years of technical experience; relevant experience may substitute for education. Strong Python engineering skills, especially for building and maintaining data pipelines Experience with SQL and schema design in production settings (PostgreSQL preferred) Familiarity with common ML workflows (training classifiers, tuning models, and deploying for inference), particularly for large-scale or ambiguous structured datasets Comfortable working with structured datasets (XML/JSON/Parquet) and writing ETL code Experience with workflow orchestration tools (Airflow or similar) and cloud infrastructure (e.g. AWS, S3, Docker) Strong communicator and a strong sense of ownership for results Preferred: Experience with author disambiguation, entity resolution, or record linkage problems Experience applying vector-based similarity or topic modeling techniques to real-world corpora at scale Exposure to citation networks or scholarly data systems (e.g., arXiv, OpenAlex, USPTO) Comfort building internal APIs and dashboards to support ML and data quality review About Ai2 and Environment
The Semantic Scholar team builds open, production-grade systems that power scientific discovery and large-scale AI research. We focus on creating high-quality structured datasets, integrating diverse content types, and enabling downstream applications across search, citation analysis, and model training. The team collaborates across Ai2s product and research orgs to deliver tools and infrastructure used by millions of researchers and developers worldwide. What We Offer
Team members and their families are covered by medical, dental, vision, and an employee assistance program Enrollment in health savings account plan, healthcare reimbursement arrangement plan, and flexible spending accounts Enrollment in the company's 401k plan Monthly stipends for commuting/internet and fitness/wellbeing expenses Vacation, personal days, sick leave, and paid holidays; bonus opportunities Equal Opportunity and Compliance
Ai2 is proud to be an Equal Opportunity employer. We do not discriminate based on race, religion, color, national origin, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, transgender status, age, veteran status, disability, or other protected characteristics. You may view related Know Your Rights and Pay Transparency notices. Ai2 participates in E-Verify and will provide Form I-9 information to confirm work authorization. We are committed to reasonable accommodations under the ADA. If you need accommodations, contact recruiting@allenai.org. Voluntary Self-Identification
We collect voluntary demographic information for government reporting purposes. The completion of this form is voluntary and confidential. More details are provided in the voluntary self-identification materials attached to this posting. How to Apply
Apply for this job. This posting includes fields for name, contact information, location, resume, education, experience, and confirmation of eligibility to work in the U.S. You may be asked to certify information and consent to a background check as part of the process. #J-18808-Ljbffr