Getting Started with Data Science: Making Sense of Data with Analytics (Michael LaRiviere's Library) by Murtaza Haider

Getting Started with Data Science: Making Sense of Data with Analytics (Michael LaRiviere's Library) by Murtaza Haider

Author:Murtaza Haider
Language: eng
Format: epub
Publisher: IBM Press
Published: 2016-07-11T16:00:00+00:00


Spuriously Correlated

Even when we find statistically significant correlation between two variables, it may turn out that the two variables might be completely unrelated. Consider the case of ice cream sales and drownings. One may find a statistically significant and positive correlation between drownings and the sale of ice cream. Can one assume that drownings are caused by ice cream sales? As a result, would one impose a restriction on ice cream sales to reduce deaths by drownings?

The preceding example depicts spurious correlation between two rather unrelated variables. During summer season, hot weather leads to higher ice cream sales. At the same time, people head to pools, lakes, rivers, and beaches for swimming. As more people swim, the odds of drowning increase. Hence, the positive correlation between ice cream sales and drownings has no causal linkage, except that both ice cream sales and drowning are influenced by hot weather.

While one acknowledges the utility of correlation analysis, spurious correlations and the presence of confounding or mitigating factors warn that correlation is not the same as causation, and hence, one has to undertake more involved and systematic analysis to determine the relationships between behaviors. Regression analysis is more apt for such analysis.

Another point to remember regarding spurious correlation is that this challenge will become pronounced with big data. Very large data sets by default will show some statistically significant correlations among rather unrelated variables. An example of spurious correlation could be found in Varian (2014) where Google Correlate finds high correlation between new homes sold in the U.S. and oldies lyrics.

Hal Varian, Google’s chief economist, while talking about Google Trend data, speaks of the emerging challenges posed by large-sized data and spurious correlation. He warns: “The challenge is that there are billions of queries so it is hard to determine exactly which queries are the most predictive for a particular purpose. Google Trends classifies the queries into categories, which helps a little, but even then we have hundreds of categories as possible predictors so that overfitting and spurious correlation are a serious concern.”9



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.