Cody's Data Cleaning Techniques Using SAS, Third Edition by Ron Cody
Author:Ron Cody [Cody, Ron]
Language: eng
Format: azw3
Publisher: SAS Institute
Published: 2017-03-15T04:00:00+00:00
This plot should clearly send up a red flag. You see a deposit close to $100,000 where the median deposit is around $10,000.
Using Regression Techniques to Identify Possible Errors in the Banking Data
You would expect deposits (or withdrawals) to be correlated with the account balance. It would be suspicious to deposit very large amounts into an account with a small balance. It would be even more unusual to withdrawal amounts as large or larger than the account balance.
You should, therefore, be able to run a regression between account balances and deposits and look for outliers. There are a number of ways to detect influential data points when performing simple or multiple regression. Most of these methods involve running the regression with all the data points included and then running it again with each data point removed, looking for differences either in the predicted values or the slopes (coefficients of the independent variables).
Let's use PROC REG to regress Deposit against Balance. Keep in mind that, statistically, this is a bit unorthodox because you are using a data set (Banking) where there are multiple observations per account number. We can still obtain useful information by performing the regression in this manner.
After running the regression, you can use some of the diagnostic plots and tables to attempt to identify influential data points that might be data errors. Here are the PROC REG statements with requests for almost all of the diagnostic plots that are available:
Program 6.10: Using PROC REG to Regress the Account Balance Against Deposits
title "Regression of Deposit by Balance";
proc reg data=Clean.Banking(where=(Deposit is not missing)
keep=Account Deposit Balance)
plots(only label)=(diagnostics(unpack)
residuals(unpack)
rstudentbypredicted dffits fitplot observedbypredicted);
id Account;
model Deposit=Balance / influence;
output out=Diagnostics rstudent=Rstudent cookd=Cook_D
dffits=DFfits;
run;
quit;
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Implementing Enterprise Observability for Success by Manisha Agrawal and Karun Krishnannair(7333)
Supercharging Productivity with Trello by Brittany Joiner(6593)
Secrets of the JavaScript Ninja by John Resig Bear Bibeault(6419)
Mastering Tableau 2023 - Fourth Edition by Marleen Meier(6352)
Inkscape by Example by István Szép(6212)
Visualize Complex Processes with Microsoft Visio by David J Parker & Šenaj Lelić(5908)
Build Stunning Real-time VFX with Unreal Engine 5 by Hrishikesh Andurlekar(4900)
Design Made Easy with Inkscape by Christopher Rogers(4593)
Customizing Microsoft Teams by Gopi Kondameda(4134)
Linux Device Driver Development Cookbook by Rodolfo Giometti(3934)
Extending Microsoft Power Apps with Power Apps Component Framework by Danish Naglekar(3723)
Business Intelligence Career Master Plan by Eduardo Chavez & Danny Moncada(3684)
Salesforce Platform Enterprise Architecture - Fourth Edition by Andrew Fawcett(3597)
Pandas Cookbook by Theodore Petrou(3578)
The Tableau Workshop by Sumit Gupta Sylvester Pinto Shweta Sankhe-Savale JC Gillet and Kenneth Michael Cherven(3382)
TCP IP by Todd Lammle(2983)
Drawing Shortcuts: Developing Quick Drawing Skills Using Today's Technology by Leggitt Jim(2910)
Applied Predictive Modeling by Max Kuhn & Kjell Johnson(2859)
Work Smarter with Microsoft OneNote by Connie Clark(2842)
