Data cleaning is a waste of time.
If the data had been collected properly in the first place there wouldn’t be any cleaning to do, and you wouldn’t now be faced with the prospect of weeks of cleaning to get your dataset analysis-ready.
Worse still, your boss won’t understand why your analysis report isn’t on his desk yet, a mere 48 hours after he’s asked for it. Bless him, he doesn’t understand – he thinks that cleaning data is just about clicking a few buttons in Excel and – ta da! – it’s all done. Even a monkey can do that, right?
And you won’t get much help from statistics books either. Data is messy and cleaning it can be difficult, time-consuming and costly. Not to mention it’s the least sexy thing you can do with a dataset.
Yet you’ve still got to do it, because, well, someone has to…
But it doesn’t have to be so difficult. If you're organised and follow a few simple data cleaning steps in Excel, your data cleaning can be simple, fast and effective.
Not to mention fun!
Well, not fun exactly, just not quite as coma-inducing.
It doesn’t matter whether you’re a scientist or an entrepreneur, in academia or in business, if you’re collecting data to try to answer some questions then you need to understand the fundamentals.
You’ll likely spend a lot of time observing, measuring, counting, classifying and quantifying what you see, and once you’ve collected your data you’re going to have to analyse it.
But hold on a minute. Don't try to run before you can walk - there is a process that you need to follow if you're to get your data analysis-ready in double-quick time (lest The Boss loses patience and sits on your head - that's never a good thing).
Your Data Cleaning Roadmap
So what you need is a plan of action, a roadmap that tells you in which order you need to tackle things, and here it is:
This all may seem a bit obvious, but I wish I had a pound for every time a researcher has brought me a dataset and tells me "I've already started doing some analyses", and when I look at the data there are lots of missing data points, there are errors and the data are not (yet) fit-for-purpose.
You wouldn't analyse your data before you've cleaned it, would you? And yet so many do exactly that - and then wonder why the results are not what they expect!
So I've put together a list of the 19 most important tips I can give you about cleaning your data and getting it ready for analysis.
Incidentally, these 19 tips are discussed in more detail in my FREE eBook Practical Data Cleaning - now in its fifth edition. If you don't have a copy you can get one right here:
It may seem a bit strange to start talking about data collection when the real issue is about data cleaning, but good data collection practices will ensure that your data cleaning process runs smoothly, and actually, 11 of the 19 tips are concerned with good data collection strategies:
Tip #1: Record Data on Paper First
This is where you get to design how you’re going to collect and store your data, which gets you to thinking in detail about your study.
Tip #2: Transfer Your Data to a Spreadsheet
If you’d have entered your data directly into a spreadsheet without having a paper copy, mistakes made would be permanent.
Tip #3: Enter Your Data on a Single Worksheet
Enter your data in a single worksheet. The amount of data you’re going to collect is unlikely to be a problem for Excel, and if it is, you should probably be collecting your data by more automated means.
Tip #4: Use a Unique ID Column
It’s absolutely crucial that you have the ability to restore the original order, and for this we start by using column A as a Unique ID column.
Tip #5: One Column per Variable
Don’t enter more than one piece of information into a single cell.
Tip #6: Row 1 is the Variable Name
The standard for pretty much all statistics programs is for the first row to be reserved for the name of the variable, so you should follow this too.
Tip #7: Every Cell Should Have Something In It
If you don’t have any data for a particular cell, don’t just leave it empty. There’s no information in a blank cell - use codes to tell you why there is no data in a cell.
Tip #8: Keep Great Notes
When you put codes in your cells you’ll need to keep notes about what they mean. This is perhaps the biggest and best tip I can give you – KEEP GREAT NOTES!!!
Tip #9: Be Consistent
A lot of the problems with data cleaning are caused because the data is not entered correctly or consistently in the first place.
Tip #10: Don’t Guess
Data accuracy is really important, so when entering your data don’t guess, approximate, round up or down – just enter the value exactly as registered on paper.
Tip #11: Zero is a Real Number
Zeros are real numbers and will be included in any calculations you make on your data, so don’t use the number zero as a code to mean ‘No data’.
Stage 2: Data Cleaning
Now that we've set a few ground rules on how to collect data, we should have a dataset that is in pretty good shape to start with. It won't be perfect though - a perfectly clean and analysis-ready dataset is a rare thing. It's likely that you'll have made a few errors when collecting and entering the data, so it's time to take a look at the essential principles of data cleaning.
Data Cleaning: "the least sexy thing you can do with a dataset" @eelrekab #data #cleandata
Tip #12: Make a Copy
Make a copy of your dataset so if you make any mistakes you’ll be able to check back and make corrections.
Tip #13: Clean Your Data in a Separate Worksheet
Cleaning data is a messy process and you WILL make mistakes so do it in a separate worksheet!
Tip #14: Report Errors Back to the Original Source
Next time you have to analyse some more data from the same source you’ll have a lot less cleaning to do.
Tip #15: Use Excel Functions to do the Hard Work...
Entry errors happened because someone entered data manually, so why use the same method that got you into trouble to get you out of it?
Tip #16: ...And Use Excel Formulae to do the Even Harder Work
I guarantee you won’t regret the time investment made to learn the Excel formulae you need.
Stage 3: Coding and Classifying Your Data
So you've collected and cleaned your data. Then, having checked your favourite stats program you discover that it won't accept data categories in text format (like Small, Medium, Large). So now you're going to have to code your data as integers (1, 2, 3) - and you're going to need to make a note of what all of your codes mean. After all, it's not a secret is it?!??
Tip #17: Keep a Code Sheet
For each variable you should list the codes that you’ve used and explain what they mean.
Tip #18: Identify Your Data Types
Take a little time to decide which of your variables are Ratio, and which are Interval, Ordinal or Nominal.
Stage 4: Data Integrity
A man who has committed a mistake and does not correct it is committing another mistake.
Real life follows rules, and your data must too.
When you collected and entered your data, did you check that the values were sensible? How old is the oldest patient in your dataset? If it's over 300 years then you have an error somewhere (hint: check the Date of Birth entered into the dataset). What about negative numbers or zeros? Take a count to check for the most obvious errors in your data:
Tip #19: Check That Your Data are Sensible
Test each variable for things like the minimum and maximum values, the number of zeros, positive and negative numbers, text entries, counts in categories. These kinds of simple descriptive statistics will help you identify the most common errors in your data.
Well, I hope you've enjoyed reading about our top 19 tips for data cleaning and will put them into practice with your own data.
The bottom line here is that effective data cleaning isn't a random act of "I'll clean the errors as I spot them". It is an organised process that starts with thinking about your study deeply before collecting data, proceeds through a set of pre-planned steps and ends with an analysis-ready dataset that is fit for purpose.
Most of all, effective data cleaning is a process that gives you, your boss and your statistician confidence that you really do know how to handle data correctly and efficiently.
Practical Data Cleaning Resources
This blog post is an accompaniment to the FREE eBook Practical Data Cleaning, and is here to help you take the next steps.
Below you'll find the best resources on learning about cleaning and preparing data that we've found on the web, and we update it frequently with new books, video courses, software and whatever else we can find (and create ourselves), so feel free to bookmark us, share us on the web and call in regularly to top up your data ninja skills.
Disclosure: This post may contain affiliate links. This means that if you click one of the links and make a purchase we may receive a small commission at no extra cost to you.
You can find further details in our TCs
Video Courses in The Hive
The Hive is our online learning portal where you can find courses on data analysis, statistics and machine learning.
Unlike other online course platforms, we actively encourage networking and collaboration, and The Hive is a place where you can chat and make friends in our exclusive members-only built-in social media community. You can also contact me and tell me what courses you need - I am in the habit of taking requests and creating personalised video courses!
In The Hive there are plenty of FREE courses (get a free plan to access these), and you can access the most in-depth courses for a small monthly or annual subscription (or purchase courses individually). Better still, each of the premium courses has a shortened free version so you can try-before-you-buy.
If you want access to exclusive, personalised content, then The Hive is the place to be!
Udemy is a great place to learn new stuff, not just about data, stats and AI, but about making model trains, how to apply make-up and, oh, just about anything else you can think of.
Courses (when they're on sale, which is very often) are typically priced at about 10-15 £/$/Euro. The upside is that the courses are very cheap, and usually very good. The downside is that courses aren't part of any formal programme, so you won't get any kind of certification.
If you want to fill gaps in your education or even learn whole topics, then Udemy is a great place to go.
*NOTE - the prices listed below are the full price, and are not automatically updated when a sale is on. If you want to find out the sale price, just click through!
Coursera Video Courses
Coursera offer a more considered approach to learning and offer individual and full degree courses, and you get certificates too that you can display on your LinkedIn profile to impress your future boss.
Prices are usually around the 50 £/$/Euro mark per course.
The great thing about Coursera is that all courses are taught by University professionals, so you know that these guys are the best and brightest in their field. On the downside, these courses are quite intensive and you need to be able to set aside a fair chunk of your time over a few weeks to complete the course. Unless you enjoy pulling all-nighters...