Principles of Big Data by Jules J. Berman

Principles of Big Data by Jules J. Berman

Author:Jules J. Berman
Language: eng
Format: epub
ISBN: 9780124047242
Publisher: Elsevier Inc.
Published: 2013-05-23T16:00:00+00:00


Normalizing and Adjusting Data

When extracting data from multiple sources, recorded at different times, and collected for different purposes, the data values may not be directly comparable. The Big Data analyst must contrive a method to normalize or harmonize the data values.

1. Adjusting for population differences. Epidemiologists are constantly reviewing large data sets on large populations (e.g., local, national, and global data). If epidemiologists did not normalize their data, they would be in a constant state of panic. Suppose you are following long-term data on the incidence of a rare childhood disease in a state population. You notice that the number of people with the disease has doubled in the past decade. You are about to call the New York Times with the news when one of your colleagues taps you on the shoulder and explains that the population of the state has doubled in the same time period. The incidence, described as cases per 100,000 population, has remained unchanged. You calm yourself down and continue your analysis to find that the reported cases of the disease has doubled in a different state that has had no corresponding increase in state population. You are about to call the White House with the news when your colleague taps you on the shoulder and explains that the overall population of the state has remained unchanged, but the population of children in the state has doubled. The incidence as expressed as cases occurring in the target population has remained unchanged.

An age-adjusted rate is the rate of a disease within an age category, weighted against the proportion of persons in the age groups of a standard population. When we age adjust rates, we cancel out the changes in the rates of disease that result from differences in the proportion of people in different age groups.

Some of the most notorious observations on nonadjusted data come from the field of baseball. In 1930, Bill Terry maintained a batting average of 0.401, the best batting average in the National league. In 1968, Carl Yastrzemski led his league with a batting average of 0.301. You would think that the facts prove that Terry’s lead over his fellow players was greater than Yastrzemski’s. Actually, both had averages that were 27% higher than the average of their fellow ballplayers of the year. Normalized against all the players for the year in which the data was collected, Terry and Yastrzemski tied.

2. Rendering data values dimensionless. Histograms express data distributions by binning data into groups and displaying the bins in a bar graph (see Figure 9.2). A histogram of an image may have bins (bars) whose heights consist of the number of pixels in a black-and-white image that fall within a certain gray-scale range.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.