The Ultimate Guide to Cleaning Data in Excel and Google Sheets: Proven techniques and best practices for cleaning business data. by Christopher Rafter
Author:Christopher Rafter [Rafter, Christopher]
Language: eng
Format: epub, pdf
Published: 2019-03-30T23:00:00+00:00
Problems that could fall into either category
(“Uh. I’ll need to get back to you.”)
Partial or incomplete values
Duplicates
These are general guidelines, nothing with data is ever absolute. However, misformatted data, assuming the data is otherwise complete and you know how the correct formatting should look, can be easy to correct with a few keystrokes.
Example:
Phone Numbers
(555)-###-####
And
555-###-####
Both are complete and accurate phone numbers, they are just formatted differently. Simply pick one format and change all of values to conform to that format (don’t worry- we’ll show you how in a later chapter). This can be done without needing to pursue additional information, all that you need to fix it is within your source data.
However, if a customer name record has the LastName but is missing the FirstName, and you don’t know their first name, there’s really nothing you can do to fix that. You’ll have to find somebody who knows the correct first name - which can be a challenge indeed.
You might also need to make some informed decisions through this process about what is worth fixing and what is not. If your analysis plans for the data later on don’t really rely on the First Name value, then great news: Your work is done!
With cleaning data there is always a trade-off between effort and results. Your goal should never be perfection, “good enough” is certainly good enough.
You may have noticed I slipped a 3rd category in there, problems that may or may not be fixable with just the source data. Well, it really does depend on your data and the nature of the problems.
Duplicate records are a perfect example. If every field in the duplicate record is the same value as the other duplicate record, meaning they are 100% identical, then you can usually delete one, keep the other, and move on with life.
However, what if the two records are identical except for one value, say email address, which is different.
Now, which one do you delete?
Without more information, you run the risk of deleting the “good” one. So again, time to go find someone who knows which email address is correct.
Now let’s throw another angle in.
What if your records contain a “Last Updated” date stamp. One of the records was last updated 2 months ago, the other was last updated 6 years ago. Does that make it easier to decide? It sure does.
Partial or incomplete values are similar, it depends on how much data is missing. Did someone just forget to type in a few letters in a State Name, so you have a value like:
“Oklahom”
You’d probably be justified in changing that to “Oklahoma” and getting on with life.
However, if the missing data is not something you can logically interpolate or infer, then you need to go get more data.
There are two other techniques that you can use to fix Type 2 problems without obtaining additional data, if you’re lucky: Interpolation & Inference. These are by no means going to be 100% accurate, but you can infer and sometimes interpolate missing values to produce a complete dataset.
Download
The Ultimate Guide to Cleaning Data in Excel and Google Sheets: Proven techniques and best practices for cleaning business data. by Christopher Rafter.pdf
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Access | Data Mining |
Data Modeling & Design | Data Processing |
Data Warehousing | MySQL |
Oracle | Other Databases |
Relational Databases | SQL |
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8296)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6714)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6693)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6563)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6348)
Driving Data Quality with Data Contracts by Andrew Jones(6298)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6067)
Learning SQL by Alan Beaulieu(5994)
Weapons of Math Destruction by Cathy O'Neil(5778)
Big Data Analysis with Python by Ivan Marin(5352)
Data Engineering with dbt by Roberto Zagni(4349)
Solidity Programming Essentials by Ritesh Modi(3996)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3854)
Pandas Cookbook by Theodore Petrou(3565)
Blockchain Basics by Daniel Drescher(3292)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2905)
Feature Store for Machine Learning by Jayanth Kumar M J(2811)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2794)
Mastering Python for Finance by Unknown(2743)
