Mastering Social Media Mining with Python by 2016

Mastering Social Media Mining with Python by 2016

Author:2016
Language: eng
Format: epub, mobi
Publisher: Packt Publishing


Visualizing posts as a word cloud

After analyzing interactions, we move our attention back to the content of the posts.

Word clouds, also called tag clouds (https://en.wikipedia.org/wiki/Tag_cloud), are visual representations of textual data. The importance of each word is usually represented by its size in the image.

In this section, we will use the wordcloud Python package, which provides an extremely easy way to produce word clouds. Firstly, we need to install the library and its dependency (an imaging library) in our virtual environment using the following commands:

$ pip install wordcloud $ pip install Pillow

Pillow is a fork of the old Python Imaging Library (PIL) project, as PIL has apparently been discontinued. Among its features, Pillow supports Python 3, so after this brief installation, we're good to go.

The following script reads a .jsonl file as the one produced to store the posts from PacktPub, and creates a .png file with the word cloud:

# Chap04/facebook_posts_wordcloud.py import os import json from argparse import ArgumentParser import matplotlib.pyplot as plt from nltk.corpus import stopwords from wordcloud import WordCloud def get_parser(): parser = ArgumentParser() parser.add_argument('--page') return parser if __name__ == '__main__': parser = get_parser() args = parser.parse_args() fname = "posts_{}.jsonl".format(args.page) all_posts = [] with open(fname) as f: for line in f: post = json.loads(line) all_posts.append(post.get('message', '')) text = ' '.join(all_posts) stop_list = ['save', 'free', 'today', 'get', 'title', 'titles', 'bit', 'ly'] stop_list.extend(stopwords.words('english')) wordcloud = WordCloud(stopwords=stop_list).generate(text) plt.imshow(wordcloud) plt.axis("off") image_fname = 'wordcloud_{}.png'.format(args.page) plt.savefig(image_fname)

As usual, the script uses an instance of ArgumentParser to get the command-line parameter (the Page name or Page ID).

The script creates a list, all_posts, with the textual message of each post. We use post.get('message', '') instead of accessing the dictionary directly, as the message key might not be present in every post (for example, in the case of images without comment), even though this event is quite rare.

The list of posts is then concatenated into a single string, text, which will be the main input to generate the word cloud. The WordCloud object takes some optional parameters to define some aspects of the word cloud. In particular, the example uses the stopwords argument to define a list of words that will be removed from the word cloud. The words that we include in this list are the standard English stop words as defined in the Natural Language Toolkit (NLTK) library, as well as a few custom keywords that are often used in the PacktPub account but that do not really carry interesting meaning (for example, links to bit.ly and references to offers for particular titles).

An example of the output image is shown in the following Figure 4.9:



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.