Python Naturalâ©Language Processingâ©Cookbook by Zhenya Antić
Author:Zhenya Antić
Language: eng
Format: epub
Publisher: Packt Publishing Pvt. Ltd
Published: 2021-04-22T00:00:00+00:00
In step 1, we import the necessary packages and functions. In step 2, we initialize the global variables. In step 3, we define the read_in_csv function, which reads in the file and returns the data as a list. In step 4, we define the tokenize_and_stem function, which reads in a sentence, splits it into words, removes tokens that are punctuation, and finally stems the resulting tokens.
In step 5, we define the get_stopwords function, which reads in the stopwords file and returns the stopwords in a list, and in step 6 we use it to get the stopwords list.
In step 7, we define the get_data function, which turns the CSV input into a dictionary, where the keys are the five topics and the values are lists of texts for that topic.
In step 8, we print out the number of texts for each topic. Since business and sports have the most examples, we use these two topics for classification.
In step 9, we create the get_stats function. In the function, we tokenize the text, then remove all stopwords and words that include characters other than letters of the English alphabet, and finally we create a FreqDist object, which provides information about the most frequent words in a text. In step 10, we get the data and compare business and sports news using the get_stats function. We see that the distributions are quite different for business and sport, although there are some frequent words that they share, such as world.
In step 11, we define the create_vectorizer function, which will encode our texts as vectors for use with the classifier. In it, we use the TfidfVectorizer class as in Chapter 3, Representing Text: Capturing Semantics. In step 12, we create the split_test_train function, which will split the dataset into training and test sets. The function takes in the data and a percent of the data to be used as training. It calculates the border where the list needs to be split and uses that to create two lists, one training and the other test.
In step 13, we split both the business and sport news lists with 80% of the data reserved for training and 20% for testing. We then create the vectorizer including both business and sports news. The creation of the classifier is part of the training process, and thus only training data is used.
In step 14, we create the label encoder that will transform text labels into numbers using the LabelEncoder class. In step 15, we define the create_dataset function, which takes in the vectorizer, the input data dictionary, and the label encoder and creates numpy arrays of vector-encoded text and labels for both business and sports news.
In step 16, we define the create_data_matrix function, which is a helper function to create_dataset. It takes in a list of texts, the vectorizer, the label being used, and the label encoder. It then creates the vector representation of the text using the vectorizer. It also creates a list of labels and then encodes them using the label encoder.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Computer Vision & Pattern Recognition | Expert Systems |
Intelligence & Semantics | Machine Theory |
Natural Language Processing | Neural Networks |
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8261)
Test-Driven Development with Java by Alan Mellor(6396)
Data Augmentation with Python by Duc Haba(6295)
Principles of Data Fabric by Sonia Mezzetta(6073)
Hadoop in Practice by Alex Holmes(5942)
Learn Blender Simulations the Right Way by Stephen Pearson(5932)
Microservices with Spring Boot 3 and Spring Cloud by Magnus Larsson(5818)
Jquery UI in Action : Master the concepts Of Jquery UI: A Step By Step Approach by ANMOL GOYAL(5787)
RPA Solution Architect's Handbook by Sachin Sahgal(5213)
Big Data Analysis with Python by Ivan Marin(5182)
Life 3.0: Being Human in the Age of Artificial Intelligence by Tegmark Max(5108)
The Infinite Retina by Robert Scoble Irena Cronin(4904)
Pretrain Vision and Large Language Models in Python by Emily Webber(4158)
Functional Programming in JavaScript by Mantyla Dan(4022)
The Age of Surveillance Capitalism by Shoshana Zuboff(3917)
Infrastructure as Code for Beginners by Russ McKendrick(3916)
WordPress Plugin Development Cookbook by Yannick Lefebvre(3620)
Embracing Microservices Design by Ovais Mehboob Ahmed Khan Nabil Siddiqui and Timothy Oleson(3433)
Applied Machine Learning for Healthcare and Life Sciences Using AWS by Ujjwal Ratan(3405)
