21 Recipes for Mining Twitter by Russell Matthew A
Author:Russell, Matthew A. [Matthew A. Russell]
Language: eng
Format: epub
Tags: COMPUTERS / Data Modeling & Design
ISBN: 9781449303853
Publisher: O'Reilly Media, Inc.
Published: 2011-01-30T16:00:00+00:00
Note
The following discussion is somewhat advanced and focuses on trying to explain how the summing_reducer function works, depending on whether the value of its rereduce parameter is True or False. Feel free to skip this section if you're not interested in honing in on those details just yet.
In short, a mapper will take a tweet and emit normalized entities such as #hashtags and @mentions, and a reducer will perform aggregate analysis on those values emitted from the mapper by counting them. The output from multiple mappers is then passed into a reducer for the purpose of performing an aggregate operation. The important subtlety with the way that the reducer is invoked is that it is passed keys and values such that each invocation’s values parameter guarantees matching keys. This turns out to be a very convenient characteristic, and for the problem of tabulating frequencies, it means that you only need to count the number of values to know the frequency for the key if the rereduce parameter is False. In other words, if the keys were ['@user', '@user', '@user'], you’d only need to compute the length of that list to get the frequency of @user for that particular invocation of the reduction function.
The actual number of keys and values that are passed into each invocation of a reduction function is a function of the underlying B-Tree used in CouchDB, and here, the illustration used a tiny size of 3 for simplicity. The subtlety to note is that multiple calls to the reducer could occur with the same keys—which conceptually means that you wouldn’t have a final aggregated answer. Instead you’d end up with something like [(“@user”, 3), (“@user”, 3), “@user”, 3), ...], which represents an intermediate result. When this happens, it’s necessary for the output of these reductions to be rereduced, in which case the rereduce flag will be set to True. The value for the keys is of no consequence, since we are already operating on output that’s guaranteed to have been produced from the same keys. In the working example, all that needs to happen is a sum of the values, 3 + 3 + 3 +, ... + 3, in order to come to a final aggregate value. A discussion of rereduce is inherently a slightly advanced topic, but is fundamental to an understanding of the map/reduce paradigm. It may bend your brain just a little bit, but manually working through some examples is very conducive to getting the hang of it.
Once the frequency maps are computed, the details for visualizing the entities in a tag cloud amount to little more than scaling the size of each entity and writing out the JSON data structure that the WP-Cumulus tag cloud expects. The HTML_TEMPLATE in the example contains the necessary SCRIPT tag references to pull the JavaScript libraries and other necessary artifacts. Only the data needs to be written to a %s placeholder in the template.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Access | Data Mining |
Data Modeling & Design | Data Processing |
Data Warehousing | MySQL |
Oracle | Other Databases |
Relational Databases | SQL |
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8305)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6775)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6750)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6638)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6421)
Driving Data Quality with Data Contracts by Andrew Jones(6366)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6124)
Learning SQL by Alan Beaulieu(6002)
Weapons of Math Destruction by Cathy O'Neil(5789)
Big Data Analysis with Python by Ivan Marin(5379)
Data Engineering with dbt by Roberto Zagni(4383)
Solidity Programming Essentials by Ritesh Modi(4032)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3893)
Pandas Cookbook by Theodore Petrou(3596)
Blockchain Basics by Daniel Drescher(3303)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2912)
Feature Store for Machine Learning by Jayanth Kumar M J(2817)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2800)
Mastering Python for Finance by Unknown(2747)
