Applied Computing and Information Technology by Unknown

Applied Computing and Information Technology by Unknown

Author:Unknown
Language: eng
Format: epub
ISBN: 9783030252175
Publisher: Springer International Publishing


2 Word Representation

2.1 Bag of Words

In bag of words, attributes are unique words extracted from all documents in a data set. Stop words are normally excluded. Given a set of attributes, attribute values for each record representing a document can be as simple as word frequencies in that document, or complicated measures such as term frequency-inverse document frequency (TF-IDF) [10, 11]. Some studies used bags of word n-grams instead of unigrams [1, 10].

Among extracted words, some may be inflectional forms of others such as go, went, gone, and going. Their frequencies are usually quite low, which leads to high data variation. Word normalization helps reduce such variation by aggregating the frequencies of inflectional words into the frequencies of their roots. There are two approaches to word normalization: stemming and lemmatization.

Stemming strips prefixes and/or suffixes from inflectional words to obtain their roots or stems. For example, the stem of going is go. The stripping of prefixes and suffixes is regardless of vocabulary, word context, or part of speech. As a result, a stem needs not be any meaningful word. A notable rule-based stemmer for English is Porter stemmer. Porter later developed a framework called Snowball [12], which allows stemming algorithms to be written in high-level Snowball scripts and compiled into mainstream programming languages including C, Java, and Python. The framework also provides a collection of stemmers for various natural languages, including his own improved version of Porter stemmer, Porter2. Other well-known English stemmers are Lovins and Paice/Husk [13].

Contrarily, lemmatization utilizes word context and part of speech. One simple method is to scan the entire lexical database, such as WordNet, to find all variants of a word and determine the root or lemma from them. Therefore, every lemma is a meaningful word. Unlike stemming, lemmatization can find the lemma of irregular inflection. For example, go would be found as the lemma, but not the stem, of went.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.