Cody's Data Cleaning Techniques Using SAS, Third Edition by Ron Cody

Cody's Data Cleaning Techniques Using SAS, Third Edition by Ron Cody

Author:Ron Cody [Cody, Ron]
Language: eng
Format: azw3
Publisher: SAS Institute
Published: 2017-03-15T04:00:00+00:00


This plot should clearly send up a red flag. You see a deposit close to $100,000 where the median deposit is around $10,000.

Using Regression Techniques to Identify Possible Errors in the Banking Data

You would expect deposits (or withdrawals) to be correlated with the account balance. It would be suspicious to deposit very large amounts into an account with a small balance. It would be even more unusual to withdrawal amounts as large or larger than the account balance.

You should, therefore, be able to run a regression between account balances and deposits and look for outliers. There are a number of ways to detect influential data points when performing simple or multiple regression. Most of these methods involve running the regression with all the data points included and then running it again with each data point removed, looking for differences either in the predicted values or the slopes (coefficients of the independent variables).

Let's use PROC REG to regress Deposit against Balance. Keep in mind that, statistically, this is a bit unorthodox because you are using a data set (Banking) where there are multiple observations per account number. We can still obtain useful information by performing the regression in this manner.

After running the regression, you can use some of the diagnostic plots and tables to attempt to identify influential data points that might be data errors. Here are the PROC REG statements with requests for almost all of the diagnostic plots that are available:

Program 6.10: Using PROC REG to Regress the Account Balance Against Deposits

title "Regression of Deposit by Balance";

proc reg data=Clean.Banking(where=(Deposit is not missing)

keep=Account Deposit Balance)

plots(only label)=(diagnostics(unpack)

residuals(unpack)

rstudentbypredicted dffits fitplot observedbypredicted);

id Account;

model Deposit=Balance / influence;

output out=Diagnostics rstudent=Rstudent cookd=Cook_D

dffits=DFfits;

run;

quit;



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.