Applied Computing and Information Technology by Unknown
Author:Unknown
Language: eng
Format: epub
ISBN: 9783030252175
Publisher: Springer International Publishing
2 Word Representation
2.1 Bag of Words
In bag of words, attributes are unique words extracted from all documents in a data set. Stop words are normally excluded. Given a set of attributes, attribute values for each record representing a document can be as simple as word frequencies in that document, or complicated measures such as term frequency-inverse document frequency (TF-IDF) [10, 11]. Some studies used bags of word n-grams instead of unigrams [1, 10].
Among extracted words, some may be inflectional forms of others such as go, went, gone, and going. Their frequencies are usually quite low, which leads to high data variation. Word normalization helps reduce such variation by aggregating the frequencies of inflectional words into the frequencies of their roots. There are two approaches to word normalization: stemming and lemmatization.
Stemming strips prefixes and/or suffixes from inflectional words to obtain their roots or stems. For example, the stem of going is go. The stripping of prefixes and suffixes is regardless of vocabulary, word context, or part of speech. As a result, a stem needs not be any meaningful word. A notable rule-based stemmer for English is Porter stemmer. Porter later developed a framework called Snowball [12], which allows stemming algorithms to be written in high-level Snowball scripts and compiled into mainstream programming languages including C, Java, and Python. The framework also provides a collection of stemmers for various natural languages, including his own improved version of Porter stemmer, Porter2. Other well-known English stemmers are Lovins and Paice/Husk [13].
Contrarily, lemmatization utilizes word context and part of speech. One simple method is to scan the entire lexical database, such as WordNet, to find all variants of a word and determine the root or lemma from them. Therefore, every lemma is a meaningful word. Unlike stemming, lemmatization can find the lemma of irregular inflection. For example, go would be found as the lemma, but not the stem, of went.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Kathy Andrews Collection by Kathy Andrews(11285)
The remains of the day by Kazuo Ishiguro(8346)
Paper Towns by Green John(4759)
Spare by Prince Harry The Duke of Sussex(4754)
The Body: A Guide for Occupants by Bill Bryson(4543)
Industrial Automation from Scratch: A hands-on guide to using sensors, actuators, PLCs, HMIs, and SCADA to automate industrial processes by Olushola Akande(4451)
Be in a Treehouse by Pete Nelson(3628)
Harry Potter and the Goblet Of Fire by J.K. Rowling(3584)
Never by Ken Follett(3495)
Machine Learning at Scale with H2O by Gregory Keys | David Whiting(3459)
Goodbye Paradise(3417)
Into Thin Air by Jon Krakauer(3105)
The Remains of the Day by Kazuo Ishiguro(3103)
The Cellar by Natasha Preston(3062)
The Genius of Japanese Carpentry by Azby Brown(3011)
Drawing Shortcuts: Developing Quick Drawing Skills Using Today's Technology by Leggitt Jim(2917)
120 Days of Sodom by Marquis de Sade(2914)
Fairy Tale by Stephen King(2864)
The Man Who Died Twice by Richard Osman(2780)
