The Ultimate Guide to Cleaning Data in Excel and Google Sheets: Proven techniques and best practices for cleaning business data. by Christopher Rafter

The Ultimate Guide to Cleaning Data in Excel and Google Sheets: Proven techniques and best practices for cleaning business data. by Christopher Rafter

Author:Christopher Rafter [Rafter, Christopher]
Language: eng
Format: epub, pdf
Published: 2019-03-30T23:00:00+00:00


Problems that could fall into either category

(“Uh. I’ll need to get back to you.”)

Partial or incomplete values

Duplicates

These are general guidelines, nothing with data is ever absolute. However, misformatted data, assuming the data is otherwise complete and you know how the correct formatting should look, can be easy to correct with a few keystrokes.

Example:

Phone Numbers

(555)-###-####

And

555-###-####

Both are complete and accurate phone numbers, they are just formatted differently. Simply pick one format and change all of values to conform to that format (don’t worry- we’ll show you how in a later chapter). This can be done without needing to pursue additional information, all that you need to fix it is within your source data.

However, if a customer name record has the LastName but is missing the FirstName, and you don’t know their first name, there’s really nothing you can do to fix that. You’ll have to find somebody who knows the correct first name - which can be a challenge indeed.

You might also need to make some informed decisions through this process about what is worth fixing and what is not. If your analysis plans for the data later on don’t really rely on the First Name value, then great news: Your work is done!

With cleaning data there is always a trade-off between effort and results. Your goal should never be perfection, “good enough” is certainly good enough.

You may have noticed I slipped a 3rd category in there, problems that may or may not be fixable with just the source data. Well, it really does depend on your data and the nature of the problems.

Duplicate records are a perfect example. If every field in the duplicate record is the same value as the other duplicate record, meaning they are 100% identical, then you can usually delete one, keep the other, and move on with life.

However, what if the two records are identical except for one value, say email address, which is different.

Now, which one do you delete?

Without more information, you run the risk of deleting the “good” one. So again, time to go find someone who knows which email address is correct.

Now let’s throw another angle in.

What if your records contain a “Last Updated” date stamp. One of the records was last updated 2 months ago, the other was last updated 6 years ago. Does that make it easier to decide? It sure does.

Partial or incomplete values are similar, it depends on how much data is missing. Did someone just forget to type in a few letters in a State Name, so you have a value like:

“Oklahom”

You’d probably be justified in changing that to “Oklahoma” and getting on with life.

However, if the missing data is not something you can logically interpolate or infer, then you need to go get more data.

There are two other techniques that you can use to fix Type 2 problems without obtaining additional data, if you’re lucky: Interpolation & Inference. These are by no means going to be 100% accurate, but you can infer and sometimes interpolate missing values to produce a complete dataset.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.