Mining of Massive Datasets by Jure Leskovec & Anand Rajaraman & Jeffrey David Ullman
Author:Jure Leskovec & Anand Rajaraman & Jeffrey David Ullman [Leskovec, Jure]
Language: eng
Format: epub
Publisher: Cambridge University Press
Published: 2014-09-29T22:00:00+00:00
6.5.4Exercises for Section 6.5
!! EXERCISE 6.5.1Suppose we are counting frequent itemsets in a decaying window with a decay constant c. Suppose also that with probability p, a given stream element (basket) contains both items i and j. Additionally, with probability p the basket contains i but not j, and with probability p it contains j but not i. As a function of c and p, what is the fraction of time we shall be scoring the pair {i, j}?
6.6Summary of Chapter 6
✦Market-Basket Data: This model of data assumes there are two kinds of entities: items and baskets. There is a many–many relationship between items and baskets. Typically, baskets are related to small sets of items, while items may be related to many baskets.
✦Frequent Itemsets: The support for a set of items is the number of baskets containing all those items. Itemsets with support that is at least some threshold are called frequent itemsets.
✦Association Rules: These are implications that if a basket contains a certain set of items I, then it is likely to contain another particular item j as well. The probability that j is also in a basket containing I is called the confidence of the rule. The interest of the rule is the amount by which the confidence deviates from the fraction of all baskets that contain j.
✦The Pair-Counting Bottleneck: To find frequent itemsets, we need to examine all baskets and count the number of occurrences of sets of a certain size. For typical data, with a goal of producing a small number of itemsets that are the most frequent of all, the part that often takes the most main memory is the counting of pairs of items. Thus, methods for finding frequent itemsets typically concentrate on how to minimize the main memory needed to count pairs.
✦Triangular Matrices: While one could use a two-dimensional array to count pairs, doing so wastes half the space, because there is no need to count pair {i, j} in both the i-j and j-i array elements. By arranging the pairs (i, j) for which i < j in lexicographic order, we can store only the needed counts in a one-dimensional array with no wasted space, and yet be able to access the count for any pair efficiently.
✦Storage of Pair Counts as Triples: If fewer than 1/3 of the possible pairs actually occur in baskets, then it is more space-efficient to store counts of pairs as triples (i, j, c), where c is the count of the pair {i, j}, and i < j. An index structure such as a hash table allows us to find the triple for (i, j) efficiently.
✦Monotonicity of Frequent Itemsets: An important property of itemsets is that if a set of items is frequent, then so are all its subsets. We exploit this property to eliminate the need to count certain itemsets by using its contrapositive: if an itemset is not frequent, then neither are its supersets.
✦The A-Priori Algorithm for Pairs: We can find all frequent pairs by making two passes over the baskets.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8309)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6807)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6783)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6670)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6456)
Driving Data Quality with Data Contracts by Andrew Jones(6400)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6156)
Learning SQL by Alan Beaulieu(6004)
Weapons of Math Destruction by Cathy O'Neil(5797)
Big Data Analysis with Python by Ivan Marin(5396)
Data Engineering with dbt by Roberto Zagni(4402)
Solidity Programming Essentials by Ritesh Modi(4050)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3909)
Pandas Cookbook by Theodore Petrou(3613)
Blockchain Basics by Daniel Drescher(3306)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2914)
Feature Store for Machine Learning by Jayanth Kumar M J(2820)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2803)
Mastering Python for Finance by Unknown(2748)
