Getting Started with Data Science: Making Sense of Data with Analytics (Michael LaRiviere's Library) by Murtaza Haider
Author:Murtaza Haider
Language: eng
Format: epub
Publisher: IBM Press
Published: 2016-07-11T16:00:00+00:00
Spuriously Correlated
Even when we find statistically significant correlation between two variables, it may turn out that the two variables might be completely unrelated. Consider the case of ice cream sales and drownings. One may find a statistically significant and positive correlation between drownings and the sale of ice cream. Can one assume that drownings are caused by ice cream sales? As a result, would one impose a restriction on ice cream sales to reduce deaths by drownings?
The preceding example depicts spurious correlation between two rather unrelated variables. During summer season, hot weather leads to higher ice cream sales. At the same time, people head to pools, lakes, rivers, and beaches for swimming. As more people swim, the odds of drowning increase. Hence, the positive correlation between ice cream sales and drownings has no causal linkage, except that both ice cream sales and drowning are influenced by hot weather.
While one acknowledges the utility of correlation analysis, spurious correlations and the presence of confounding or mitigating factors warn that correlation is not the same as causation, and hence, one has to undertake more involved and systematic analysis to determine the relationships between behaviors. Regression analysis is more apt for such analysis.
Another point to remember regarding spurious correlation is that this challenge will become pronounced with big data. Very large data sets by default will show some statistically significant correlations among rather unrelated variables. An example of spurious correlation could be found in Varian (2014) where Google Correlate finds high correlation between new homes sold in the U.S. and oldies lyrics.
Hal Varian, Google’s chief economist, while talking about Google Trend data, speaks of the emerging challenges posed by large-sized data and spurious correlation. He warns: “The challenge is that there are billions of queries so it is hard to determine exactly which queries are the most predictive for a particular purpose. Google Trends classifies the queries into categories, which helps a little, but even then we have hundreds of categories as possible predictors so that overfitting and spurious correlation are a serious concern.”9
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8300)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6742)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6717)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6597)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6379)
Driving Data Quality with Data Contracts by Andrew Jones(6327)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6093)
Learning SQL by Alan Beaulieu(5995)
Weapons of Math Destruction by Cathy O'Neil(5779)
Big Data Analysis with Python by Ivan Marin(5364)
Data Engineering with dbt by Roberto Zagni(4361)
Solidity Programming Essentials by Ritesh Modi(4010)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3870)
Pandas Cookbook by Theodore Petrou(3578)
Blockchain Basics by Daniel Drescher(3294)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2905)
Feature Store for Machine Learning by Jayanth Kumar M J(2814)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2796)
Mastering Python for Finance by Unknown(2744)
