Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS by Chakraborty Goutam & Pagolu Murali & Garla Satish

Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS by Chakraborty Goutam & Pagolu Murali & Garla Satish

Author:Chakraborty, Goutam & Pagolu, Murali & Garla, Satish [Chakraborty, Goutam]
Language: eng
Format: azw3
ISBN: 9781612907871
Publisher: SAS Institute
Published: 2013-10-24T16:00:00+00:00


Rule-based categorizers do not require training documents, so you can skip setting the training path. If you like to place documents from all categories and subcategories in only one folder, then check Identical Path in Propagate Options. Conversely, you can create a taxonomy using an existing folder structure on your physical drive. Right-click English, and select Create Categorizer from Directories.

Display 7.7: Defining a Categorizer from an Existing Folder Structure

Statistical Categorizer

A statistical categorizer helps users automatically build a model for categorization without having to write any rules. It requires a set of documents pre-identified for each category in the taxonomy to train the model. When training the model for a particular category, documents from all categories in the taxonomy are considered for analysis. A statistical categorizer tries to find uniquely identifiable terms that describe a category while making sure they do not match any other categories. As a result, any changes made to the training documents in any category impact the rules for other categories also. In this case, you need to rebuild the model. Once the model is trained, you can test it on other document to verify its accuracy. The underlying rules of a statistical categorizer cannot be viewed. This is why it is also called a black box model. The statistical categorizer performs well when the number of categories is limited and the categories are significantly different from each other with regard to their content. For example, if the categories are sports, media, events, and entertainment, the statistical categorizer might not be efficient in categorizing documents precisely. The biggest advantage of a statistical model is that it is the easiest to develop, requiring just a set of well-collected training documents. However, due to the difficulty of matching concepts on the basis of a statistical measure, it is difficult to achieve a high level of accuracy, which is a major drawback for the statistical categorizer.

Here is a brief demonstration on how to build a statistical categorizer-based model. For the statistical categorizer, you need a training set of documents to build the model. For this purpose, we have extracted 466 SAS Global Forum paper abstracts published in the past three years from five different sections. Stats (Statistics and Data Analysis), DataMining (Data Mining and Predictive Modeling), Reports (Reporting and Information Visualization), BusInt (Business Intelligence), and SysArch (Systems Architecture) are the five section categories representing the 466 paper abstracts. These abstracts are split into Test and Train groups. Each group folder contains subfolders representing the five section-based categories and respective paper abstracts in raw files. SAS Content Categorization Studio requires input textual comments in .xml or .txt format. If you have your textual data as a SAS data set, you need to create a unique .xml or .txt file for each textual comment in your data set. Refer to the section “Appendix” in this chapter for SAS code that creates each observation in the SAS data set as a separate text file.

Create a new project in SAS Content Categorization Studio, enable the statistical categorizer, and create these five categories with the same names.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.