Machine Learning in Chemistry by Hugh M Cartwright;

Machine Learning in Chemistry by Hugh M Cartwright;

Author:Hugh M Cartwright; [Cartwright, Hugh M.]
Language: eng
Format: epub
ISBN: 9781839160240
Publisher: Book Network Int'l Limited trading as NBN International (NBNi)
Published: 2020-06-29T00:00:00+00:00


12.2 Extracting Data from the Literature

From the perspective of using AI approaches to predict new materials and their properties, it is crucial to have a large quantity of high-quality, machine-readable data for training. Within the entirety of the scientific literature, there is a massive amount of experimental and theoretical data in the form of unstructured text, figures or tables in journal articles, technical documents, and patents.19,20 The manual collection and curation of this data into a machine-readable database becomes intractable on a large-scale. Fortunately, the scientific literature is often written in a formulaic way, which allows for the application of text-mining algorithms to automatically extract data. However, the ambiguity and variability in many conventions in different scientific domains limit the universality of established dictionary and rule-based approaches to text mining.19–21 The current trend for large-scale information extraction is the use of machine learning (ML)-based (or hybrid) text-mining workflows because they are much more flexible than hard-coded methods.19,22–24

Recently, Cole and co-workers developed ChemDataExtractor (CDE), open-source software for the automatic extraction of chemical information.25 CDE implements supervised and unsupervised ML models throughout the entire natural language processing (NLP) and named entity recognition (NER) workflow (see Figure 12.2). NER is used to find named entities within the text, which can include chemical species, processes, synthetic conditions, material properties, and characterisation methods, and can be performed using ML23 and hybrid ML, dictionary, and rule-based approaches.19 A significant challenge of text mining is reliably creating a unique identifier for each entity, i.e. matching synonymous terms, which is called entity normalisation. Weston et al. performed tokenisation using CDE, NER using a neural network (NN) and entity normalisation using word embeddings (a machine-learned map of each word onto a dense vector space based on their co-occurrences in text) on the materials science literature toward a structured and accessible database of materials.22 Similarly, Tshitoyan et al. learned a word embedding using text from scientific abstracts and used this to uncover relationships between materials and their potential properties.24 As a result of its inbuilt flexibility, CDE is useful in many subdomains. In the field of organic materials, Cole and co-workers applied the CDE to the discovery of optimal organic dye pairs for co-sensitised solar cells.26 They text mined 9431 dye candidates into a database that linked chemical structure with optical adsorption peak and molar extinction coefficient. From this initial database, a high-throughput screening workflow was applied to identify suitable candidate dye pairs and improved power conversion efficiency of the resultant top candidates was experimentally validated.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.