Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS by Chakraborty Goutam & Pagolu Murali & Garla Satish
Author:Chakraborty, Goutam & Pagolu, Murali & Garla, Satish [Chakraborty, Goutam]
Language: eng
Format: azw3
ISBN: 9781612907871
Publisher: SAS Institute
Published: 2013-10-24T16:00:00+00:00
Rule-based categorizers do not require training documents, so you can skip setting the training path. If you like to place documents from all categories and subcategories in only one folder, then check Identical Path in Propagate Options. Conversely, you can create a taxonomy using an existing folder structure on your physical drive. Right-click English, and select Create Categorizer from Directories.
Display 7.7: Defining a Categorizer from an Existing Folder Structure
Statistical Categorizer
A statistical categorizer helps users automatically build a model for categorization without having to write any rules. It requires a set of documents pre-identified for each category in the taxonomy to train the model. When training the model for a particular category, documents from all categories in the taxonomy are considered for analysis. A statistical categorizer tries to find uniquely identifiable terms that describe a category while making sure they do not match any other categories. As a result, any changes made to the training documents in any category impact the rules for other categories also. In this case, you need to rebuild the model. Once the model is trained, you can test it on other document to verify its accuracy. The underlying rules of a statistical categorizer cannot be viewed. This is why it is also called a black box model. The statistical categorizer performs well when the number of categories is limited and the categories are significantly different from each other with regard to their content. For example, if the categories are sports, media, events, and entertainment, the statistical categorizer might not be efficient in categorizing documents precisely. The biggest advantage of a statistical model is that it is the easiest to develop, requiring just a set of well-collected training documents. However, due to the difficulty of matching concepts on the basis of a statistical measure, it is difficult to achieve a high level of accuracy, which is a major drawback for the statistical categorizer.
Here is a brief demonstration on how to build a statistical categorizer-based model. For the statistical categorizer, you need a training set of documents to build the model. For this purpose, we have extracted 466 SAS Global Forum paper abstracts published in the past three years from five different sections. Stats (Statistics and Data Analysis), DataMining (Data Mining and Predictive Modeling), Reports (Reporting and Information Visualization), BusInt (Business Intelligence), and SysArch (Systems Architecture) are the five section categories representing the 466 paper abstracts. These abstracts are split into Test and Train groups. Each group folder contains subfolders representing the five section-based categories and respective paper abstracts in raw files. SAS Content Categorization Studio requires input textual comments in .xml or .txt format. If you have your textual data as a SAS data set, you need to create a unique .xml or .txt file for each textual comment in your data set. Refer to the section “Appendix” in this chapter for SAS code that creates each observation in the SAS data set as a separate text file.
Create a new project in SAS Content Categorization Studio, enable the statistical categorizer, and create these five categories with the same names.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Implementing Enterprise Observability for Success by Manisha Agrawal and Karun Krishnannair(7403)
Supercharging Productivity with Trello by Brittany Joiner(6664)
Mastering Tableau 2023 - Fourth Edition by Marleen Meier(6431)
Secrets of the JavaScript Ninja by John Resig Bear Bibeault(6424)
Inkscape by Example by István Szép(6282)
Visualize Complex Processes with Microsoft Visio by David J Parker & Šenaj Lelić(5976)
Build Stunning Real-time VFX with Unreal Engine 5 by Hrishikesh Andurlekar(4979)
Design Made Easy with Inkscape by Christopher Rogers(4639)
Customizing Microsoft Teams by Gopi Kondameda(4177)
Linux Device Driver Development Cookbook by Rodolfo Giometti(3939)
Extending Microsoft Power Apps with Power Apps Component Framework by Danish Naglekar(3764)
Business Intelligence Career Master Plan by Eduardo Chavez & Danny Moncada(3754)
Salesforce Platform Enterprise Architecture - Fourth Edition by Andrew Fawcett(3643)
Pandas Cookbook by Theodore Petrou(3617)
The Tableau Workshop by Sumit Gupta Sylvester Pinto Shweta Sankhe-Savale JC Gillet and Kenneth Michael Cherven(3417)
TCP IP by Todd Lammle(2993)
Drawing Shortcuts: Developing Quick Drawing Skills Using Today's Technology by Leggitt Jim(2924)
Applied Predictive Modeling by Max Kuhn & Kjell Johnson(2884)
Exploring Microsoft Excel's Hidden Treasures by David Ringstrom(2882)
