Analytics and Tech Mining for Engineering Managers by Jan H. Kwakkel Scott W. Cunningham
Author:Jan H. Kwakkel, Scott W. Cunningham
Language: eng
Format: epub
Publisher: Momentum Press
Published: 2018-09-25T16:00:00+00:00
CHAPTER 6
PARSING TREE-STRUCTURED FILES
There are two high-level objectives of this chapter. The first is to discuss character encoding. Until now, we have discussed text as if there is a single and universal way to represent text on our computers. In a world that is multinational and multilingual, it is important to understand the nature of text encoding and, second, to be able to handle various encodings when doing text mining. The second goal is to expand our repertoire of parsing routines. We’ve given widely applicable examples of how to parse row-structured and column-structured files. Now it is time to turn to tree-structured data formats.
Many richly annotated forms of media are stored in a tree-structured format. Many of these media are highly relevant for monitoring science and technology. Web pages, such as news sites and wiki pages, provide valuable information. Home pages of firms are also of strong interest. In addition, a lot of science and technology contents are in the form of pdf files. Until recently, the pdf format has not been very accessible for machine reading. This chapter provides you tools useful for mining the pdf files.
We’ll be discussing the BeautifulSoup, pdfminer, and xmltodict packages in the examples to follow. These are all packages specifically adapted to the needs of reading tree-structured files and formats. Let us first discuss tree-structured files.
Trees have nodes and links. The terminology of trees resembles that of a family, so that there are parent and children nodes. Two children of the same parents are siblings. Children can also be parents, with their own children, extending the tree to multiple layers. Links between elements are preserved by container elements such as lists or dictionaries. If the children are to be accessed for any special purpose, or if the data is well-structured, a dictionary is often used. Otherwise a list is more commonly seen. More rarely, you may find children for whom the ordering must be preserved. More specialized structures including tables are then seen.
In another departure, in this chapter we address parsing the whole text instead of just an abstract. Whole texts are usually stored and structured as a tree. Much like an outline, a text tree contains sections and subsections, each embedded in the whole text. In addition whole texts often contain metadata or other nonreadable elements that are used in rendering the document. Whole texts also warrant new styles of analyses.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Automotive | Engineering |
Transportation |
Whiskies Galore by Ian Buxton(41540)
Introduction to Aircraft Design (Cambridge Aerospace Series) by John P. Fielding(32893)
Small Unmanned Fixed-wing Aircraft Design by Andrew J. Keane Andras Sobester James P. Scanlan & András Sóbester & James P. Scanlan(32580)
Craft Beer for the Homebrewer by Michael Agnew(17938)
Turbulence by E. J. Noyes(7717)
The Complete Stick Figure Physics Tutorials by Allen Sarah(7148)
Kaplan MCAT General Chemistry Review by Kaplan(6605)
The Thirst by Nesbo Jo(6450)
Bad Blood by John Carreyrou(6286)
Modelling of Convective Heat and Mass Transfer in Rotating Flows by Igor V. Shevchuk(6231)
Learning SQL by Alan Beaulieu(6043)
Weapons of Math Destruction by Cathy O'Neil(5851)
Man-made Catastrophes and Risk Information Concealment by Dmitry Chernov & Didier Sornette(5672)
Digital Minimalism by Cal Newport;(5397)
Life 3.0: Being Human in the Age of Artificial Intelligence by Tegmark Max(5198)
iGen by Jean M. Twenge(5168)
Secrets of Antigravity Propulsion: Tesla, UFOs, and Classified Aerospace Technology by Ph.D. Paul A. Laviolette(5002)
Design of Trajectory Optimization Approach for Space Maneuver Vehicle Skip Entry Problems by Runqi Chai & Al Savvaris & Antonios Tsourdos & Senchun Chai(4849)
Electronic Devices & Circuits by Jacob Millman & Christos C. Halkias(4756)
