Algorithms for Data Science by Brian Steele John Chandler & Swarna Reddy

Algorithms for Data Science by Brian Steele John Chandler & Swarna Reddy

Author:Brian Steele, John Chandler & Swarna Reddy
Language: eng
Format: epub
Publisher: Springer International Publishing, Cham


6.8 Analysis of Residuals

The purpose of residual analysis is to investigate model adequacy. In some analyses involving small data sets, the analysis of residuals also may seek to investigate the origins of specific unusual observations. This discussion focuses on the different, though related, question of whether the conditions discussed at the beginning of the chapter are appropriate. First and foremost, the analysis investigates the question of whether the model is an adequate approximation of the true relationship between the mean of the response variable and the predictor variables.

When hypothesis testing is conducted, then the constant variance and independence conditions should be investigated, and if the sample size is small, the normality condition also should be examined. The investigation is aimed at determining whether the residuals are realizations of independent and normally distributed random variables with mean zero and constant variance. The constant variance condition was described earlier in the statement that all residuals have a common variance σ ɛ 2. The normality condition is relatively easy to confirm or refute in most applications. However, the normality condition recedes in importance when n is much larger than p because the Central Limit Theorem implies that the estimator will be nearly normal in distribution when n is relatively large, say, n > 100p. Thorough investigations of the constant variance and independence conditions are often difficult when p is large.

Examining individual residuals, specifically, outliers, is fruitful when n is small. When the sample size is small, an analyst may be able to glean information about the population or process by identifying a few individual residuals that are large in magnitude and examining the origin of the data pairs (y j , x j ), j = 1, …, r. It may be that these data pairs possess characteristics related to Y that had not been previously recognized. For example, in an analysis of red blood cell counts using the Australian athletes data set, there may be unusual residuals attributable to the consumption of performance enhancing drugs. In principle, the analyst may be able to identify the athlete by name. Furthermore, individual data pairs may have disproportionate influence on the calculation of and hence, on the fitted model when sample sizes are small. When n is large, undertaking an examination of influence is generally pointless because the influence of one or a few data pairs among many pairs will be negligible. We omit a discussion of influence and focus on residuals originating from large data sets. James et al. [29] and Ramsey and Schafer [48] provide accessible discussions of influence.

As a contrast to the Australian athletes data, consider the bike share data. Identifying the individual residual associated with a particular day and hour as unusual has little practical value. Residual analysis is important, though. We will see soon that a model of registered users containing only hour of the day as a predictor variable produces a preponderance of negative-valued residuals on weekend days. This observation about the residuals suggests that registered users are mostly commuting between home and work when they borrow bikes.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.