Home > Computers & Technology > Computer Science > AI & Machine Learning > Natural Language Processing

Python Naturalâ©Language Processingâ©Cookbook by Zhenya Antić

Author:Zhenya Antić , Date: March 28, 2022 ,Views: 126

Python Naturalâ©Language Processingâ©Cookbook by Zhenya Antić

Author:Zhenya Antić
Language: eng
Format: epub
Publisher: Packt Publishing Pvt. Ltd
Published: 2021-04-22T00:00:00+00:00

In step 1, we import the necessary packages and functions. In step 2, we initialize the global variables. In step 3, we define the read_in_csv function, which reads in the file and returns the data as a list. In step 4, we define the tokenize_and_stem function, which reads in a sentence, splits it into words, removes tokens that are punctuation, and finally stems the resulting tokens.

In step 5, we define the get_stopwords function, which reads in the stopwords file and returns the stopwords in a list, and in step 6 we use it to get the stopwords list.

In step 7, we define the get_data function, which turns the CSV input into a dictionary, where the keys are the five topics and the values are lists of texts for that topic.

In step 8, we print out the number of texts for each topic. Since business and sports have the most examples, we use these two topics for classification.

In step 9, we create the get_stats function. In the function, we tokenize the text, then remove all stopwords and words that include characters other than letters of the English alphabet, and finally we create a FreqDist object, which provides information about the most frequent words in a text. In step 10, we get the data and compare business and sports news using the get_stats function. We see that the distributions are quite different for business and sport, although there are some frequent words that they share, such as world.

In step 11, we define the create_vectorizer function, which will encode our texts as vectors for use with the classifier. In it, we use the TfidfVectorizer class as in Chapter 3, Representing Text: Capturing Semantics. In step 12, we create the split_test_train function, which will split the dataset into training and test sets. The function takes in the data and a percent of the data to be used as training. It calculates the border where the list needs to be split and uses that to create two lists, one training and the other test.

In step 13, we split both the business and sport news lists with 80% of the data reserved for training and 20% for testing. We then create the vectorizer including both business and sports news. The creation of the classifier is part of the training process, and thus only training data is used.

In step 14, we create the label encoder that will transform text labels into numbers using the LabelEncoder class. In step 15, we define the create_dataset function, which takes in the vectorizer, the input data dictionary, and the label encoder and creates numpy arrays of vector-encoded text and labels for both business and sports news.

In step 16, we define the create_data_matrix function, which is a helper function to create_dataset. It takes in a list of texts, the vectorizer, the label being used, and the label encoder. It then creates the vector representation of the text using the vectorizer. It also creates a list of labels and then encodes them using the label encoder.

Download

Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.

Categories

Computer Vision & Pattern Recognition	Expert Systems
Intelligence & Semantics	Machine Theory
Natural Language Processing	Neural Networks

Python Naturalâ©Language Processingâ©Cookbook by Zhenya Antić

Python Naturalâ©Language Processingâ©Cookbook by Zhenya Antić