Principles of Big Data by Jules J. Berman
Author:Jules J. Berman
Language: eng
Format: epub
ISBN: 9780124047242
Publisher: Elsevier Inc.
Published: 2013-05-23T16:00:00+00:00
Normalizing and Adjusting Data
When extracting data from multiple sources, recorded at different times, and collected for different purposes, the data values may not be directly comparable. The Big Data analyst must contrive a method to normalize or harmonize the data values.
1. Adjusting for population differences. Epidemiologists are constantly reviewing large data sets on large populations (e.g., local, national, and global data). If epidemiologists did not normalize their data, they would be in a constant state of panic. Suppose you are following long-term data on the incidence of a rare childhood disease in a state population. You notice that the number of people with the disease has doubled in the past decade. You are about to call the New York Times with the news when one of your colleagues taps you on the shoulder and explains that the population of the state has doubled in the same time period. The incidence, described as cases per 100,000 population, has remained unchanged. You calm yourself down and continue your analysis to find that the reported cases of the disease has doubled in a different state that has had no corresponding increase in state population. You are about to call the White House with the news when your colleague taps you on the shoulder and explains that the overall population of the state has remained unchanged, but the population of children in the state has doubled. The incidence as expressed as cases occurring in the target population has remained unchanged.
An age-adjusted rate is the rate of a disease within an age category, weighted against the proportion of persons in the age groups of a standard population. When we age adjust rates, we cancel out the changes in the rates of disease that result from differences in the proportion of people in different age groups.
Some of the most notorious observations on nonadjusted data come from the field of baseball. In 1930, Bill Terry maintained a batting average of 0.401, the best batting average in the National league. In 1968, Carl Yastrzemski led his league with a batting average of 0.301. You would think that the facts prove that Terry’s lead over his fellow players was greater than Yastrzemski’s. Actually, both had averages that were 27% higher than the average of their fellow ballplayers of the year. Normalized against all the players for the year in which the data was collected, Terry and Yastrzemski tied.
2. Rendering data values dimensionless. Histograms express data distributions by binning data into groups and displaying the bins in a bar graph (see Figure 9.2). A histogram of an image may have bins (bars) whose heights consist of the number of pixels in a black-and-white image that fall within a certain gray-scale range.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Access | Data Mining |
Data Modeling & Design | Data Processing |
Data Warehousing | MySQL |
Oracle | Other Databases |
Relational Databases | SQL |
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8310)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6833)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6810)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6694)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6482)
Driving Data Quality with Data Contracts by Andrew Jones(6435)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6181)
Learning SQL by Alan Beaulieu(6007)
Weapons of Math Destruction by Cathy O'Neil(5800)
Big Data Analysis with Python by Ivan Marin(5406)
Data Engineering with dbt by Roberto Zagni(4415)
Solidity Programming Essentials by Ritesh Modi(4064)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3923)
Pandas Cookbook by Theodore Petrou(3629)
Blockchain Basics by Daniel Drescher(3308)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2914)
Feature Store for Machine Learning by Jayanth Kumar M J(2822)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2804)
Mastering Python for Finance by Unknown(2748)
