Machine Learning in Chemistry by Hugh M Cartwright;
Author:Hugh M Cartwright; [Cartwright, Hugh M.]
Language: eng
Format: epub
ISBN: 9781839160240
Publisher: Book Network Int'l Limited trading as NBN International (NBNi)
Published: 2020-06-29T00:00:00+00:00
12.2 Extracting Data from the Literature
From the perspective of using AI approaches to predict new materials and their properties, it is crucial to have a large quantity of high-quality, machine-readable data for training. Within the entirety of the scientific literature, there is a massive amount of experimental and theoretical data in the form of unstructured text, figures or tables in journal articles, technical documents, and patents.19,20 The manual collection and curation of this data into a machine-readable database becomes intractable on a large-scale. Fortunately, the scientific literature is often written in a formulaic way, which allows for the application of text-mining algorithms to automatically extract data. However, the ambiguity and variability in many conventions in different scientific domains limit the universality of established dictionary and rule-based approaches to text mining.19â21 The current trend for large-scale information extraction is the use of machine learning (ML)-based (or hybrid) text-mining workflows because they are much more flexible than hard-coded methods.19,22â24
Recently, Cole and co-workers developed ChemDataExtractor (CDE), open-source software for the automatic extraction of chemical information.25 CDE implements supervised and unsupervised ML models throughout the entire natural language processing (NLP) and named entity recognition (NER) workflow (see Figure 12.2). NER is used to find named entities within the text, which can include chemical species, processes, synthetic conditions, material properties, and characterisation methods, and can be performed using ML23 and hybrid ML, dictionary, and rule-based approaches.19 A significant challenge of text mining is reliably creating a unique identifier for each entity, i.e. matching synonymous terms, which is called entity normalisation. Weston et al. performed tokenisation using CDE, NER using a neural network (NN) and entity normalisation using word embeddings (a machine-learned map of each word onto a dense vector space based on their co-occurrences in text) on the materials science literature toward a structured and accessible database of materials.22 Similarly, Tshitoyan et al. learned a word embedding using text from scientific abstracts and used this to uncover relationships between materials and their potential properties.24 As a result of its inbuilt flexibility, CDE is useful in many subdomains. In the field of organic materials, Cole and co-workers applied the CDE to the discovery of optimal organic dye pairs for co-sensitised solar cells.26 They text mined 9431 dye candidates into a database that linked chemical structure with optical adsorption peak and molar extinction coefficient. From this initial database, a high-throughput screening workflow was applied to identify suitable candidate dye pairs and improved power conversion efficiency of the resultant top candidates was experimentally validated.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Alchemy and Alchemists by C. J. S. Thompson(2911)
The Elements by Theodore Gray(2432)
The Club by A.L. Brooks(2359)
How to Make Your Own Soap by Sally Hornsey(2339)
Drugs Unlimited by Mike Power(2191)
Wheels of Life by Anodea Judith(1608)
Cracking the Sat French Subject Test, 2013-2014 Edition by The Princeton Review(1520)
The Cosmic Machine: The Science That Runs Our Universe and the Story Behind It by Scott Bembenek(1479)
Perfume by Jean-Claude Ellena(1470)
The Flavor Matrix by James Briscione(1363)
1000 Multiple-Choice Questions in Organic Chemistry by Organic Chemistry Academy(1351)
Cracking the LSAT, 2012 Edition by Princeton Review(1344)
MCAT Physics and Math Review by Princeton Review(1304)
Cracking the SAT Premium Edition with 6 Practice Tests, 2017 by Princeton Review(1252)
Synchrotron Light Sources and Free-Electron Lasers by Eberhard J. Jaeschke Shaukat Khan Jochen R. Schneider & Jerome B. Hastings(1236)
Handbook of Modern Sensors by Jacob Fraden(1225)
A is for Arsenic: The Poisons of Agatha Christie (Bloomsbury Sigma) by Kathryn Harkup(1213)
Harry Potter All Books: 8 Books by J.k.rowling(1188)
Cracking the AP English Language & Composition Exam, 2018 Edition by Princeton Review(1032)