The candidate will work in high visibility projects as a Data Scientist, bringing the Data Science and NLP expertise to projects. The candidate will work in RD Data Science team and collaborate with Product's managers, domain experts, Knowledge representation experts, to build high value outcome from Elsevier content. The candidate will have an opportunity to impact virtually all Elsevier applications related to Research and Operations.
In scientific publishing, consistent and clear organization of content is crucial for facilitating comprehension and navigation. However, the lack of standardized “section” labeling practices often poses challenges for readers seeking to navigate through scholarly works efficiently. Inconsistencies in section titles, such as variations between "Methods" and "Approach," can hinder effective information retrieval across scientific articles. While conventional methods like regular expressions have been utilized for section classification, their limitations in handling diverse terminology and context-specific variations call for more sophisticated solutions. Regular expressions are mostly to detect section types, but this method lacks generalizability and accuracy. This project aims to explore the use of modern natural language processing (NLP) models, such as large language models (LLMs) and BERT-based models, to accurately classify article sections into standardized section types, improving the consistency and usability of scientific articles on platforms like ScienceDirect.
The primary objectives of this project are:
Data Collection and Preprocessing: Compile a dataset of scientific articles with sections labeled using regular expressions. This step will invove collecting a diverse set of scientific articles from the ScienceDirect database and using existing regular expressions to label sections based on their titles. Next to the text, metadata such as article domain, publication year, and author information will be used to enhance the classification performance.
Model Training: Train NLP models on the noisy dataset to classify sections into standardized types. This will involve exploring the use of modern NLP models, including LLMs, and other transformer-based models, to classify section types.
Active Learning: This will include using an active learning method to improve the quality of the training data. In this setting, a set of samples will be selected (based on classifier’s confidence) to be evaluated and labeled by an LLM. This process will be iterated for several steps and at each step a set of high-quality samples will be extracted and added to the dataset to boost the performance of the classifier.
-----------------------------------------------------------------------
Elsevier is an equal opportunity employer: qualified applicants are considered for and treated during employment without regard to race, color, creed, religion, sex, national origin, citizenship status, disability status, protected veteran status, age, marital status, sexual orientation, gender identity, genetic information, or any other characteristic protected by law. We are committed to providing a fair and accessible hiring process. If you have a disability or other need that requires accommodation or adjustment, please let us know by completing our Applicant Request Support Form: https://forms.office.com/r/eVgFxjLmAK , or please contact 1-855-833-5120.
Please read our Candidate Privacy Policy.