Python Machine Learning Cookbook by Prateek Joshi

Python Machine Learning Cookbook by Prateek Joshi

Author:Prateek Joshi [Joshi, Prateek]
Language: eng
Format: azw3, pdf
Publisher: Packt Publishing
Published: 2016-06-23T04:00:00+00:00


If you want to split these punctuations into separate tokens, then we need to use WordPunct Tokenizer:# Create a new WordPunct tokenizer from nltk.tokenize import WordPunctTokenizer word_punct_tokenizer = WordPunctTokenizer() print "\nWord punct tokenizer:" print word_punct_tokenizer.tokenize(text)

The full code is in the tokenizer.py file. If you run this code, you will see the following output on your Terminal:

Stemming text data

When we deal with a text document, we encounter different forms of a word. Consider the word "play". This word can appear in various forms, such as "play", "plays", "player", "playing", and so on. These are basically families of words with similar meanings. During text analysis, it's useful to extract the base form of these words. This will help us in extracting some statistics to analyze the overall text. The goal of stemming is to reduce these different forms into a common base form. This uses a heuristic process to cut off the ends of words to extract the base form. Let's see how to do this in Python.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.