The previous blog post in this series 11 Essential Tips for Effective Data Collection dealt with practical data collection methods.
This blog post is the second in the series and deals with data cleaning (aka data cleansing or data scrubbing).
Data cleaning may not be everyone's favourite part of working with data, but it's definitely one of the most important - without clean data you can't even get started on anything else.
Yet despite data cleaning taking up around 60-80% of the typical data analyst's time it seems that it's still done in a mostly haphazard way.
Here I'm going to give you 5 great data cleaning techniques, show you how to improve data quality and help you build a simple data cleansing strategy that is quick, easy to follow and really works.
OK, so you’ve followed the advice in our previous blog about how to collect your data and if you’ve been very careful you might just have a perfect dataset.
Personally I’ve never seen a perfect dataset – it is the rarest of creatures. Most likely you will have to clean your data before you can start to analyse it.
Yet again the textbooks will give you little practical advice here, so let’s dive in and set a few ground-rules that will help you save time and keep your boss happy…
1. Make a Copy
You’ve got a ‘raw’ dataset that is essentially an electronic copy of all the paper-based data you have collected. If you have made an entry error in the electronic copy you can always check back to the original paper copy.
When you move on to the data cleaning you’re going to be changing the data and you need to be able to undo any cleaning mistakes you’ve made, and trust me – you’re going to make a few.
So create a duplicate worksheet of your dataset.
Call the original one ‘Raw Data’ and the new one ‘Cleaning In Progress’ until you’ve finished cleaning, then you can change the name to ‘Clean Data’, like this:
This is known as version controlling your dataset, and - believe it or not - this is one of the most important steps in data cleaning.
"Version controlling your dataset is one of the most important steps in data cleaning" @eelrekab @chi2innovations #datascience #gooddata
Oh yes – and make sure both worksheets have got the Unique ID column.
2. Clean Your Data in a Separate Worksheet
When cleaning an individual column of data you’ll use a variety of different tools built into Excel, like ‘Find And Replace’.
When you use ‘Find And Replace’ will it operate only on the selected column or on the whole worksheet? Are you sure? Really, really sure?
Do all of the in-built functions work in the same way?
Get the answer wrong and you’ll find that you’ve just introduced errors across your entire dataset with no easy way to undo them (hitting ‘Undo’ doesn’t work here).
So when you want to clean a single column of data, copy that column into a spare worksheet and work on it there. Rename it ‘Spare Sheet’ or something similar. When you’re done you can copy back, replacing the previous uncleaned column.
This process is known as Extract - Transform - Load, or ETL, and is the Holy Trinity of Data Cleaning:
It may take you a little more time, but it’s worth it – mistakes can be very costly and it pays to head them off at the pass.
Oh, I do hate that cliché…
3. Report Errors Back to the Original Source
It makes no sense cleaning your data if the same data has to be cleaned in exactly the same way time and time again.
If you’re using a shared dataset, such as a departmental database, make sure you report back to the original source any errors that you’ve found. Then, next time you have to analyse some more data from the same source you’ll have a lot less cleaning to do.
4. Use Excel Functions to do the Hard Work…
Whenever possible, try not to clean data manually.
One of the biggest sources of spelling errors, typos and incorrect entries comes from manual entry, so why use the same method that got you into trouble in the first place?
Excel has a shed-load of functions that can help with data cleaning, so use them.
For example, if you have a text-based column, you can use Excel’s ‘Remove Duplicates’ function to help you find and correct typos.
Here’s how to use the Remove Duplicates function in Excel:
- Copy your selected variable to your Spare Sheet TWICE, in adjacent columns
- Select the right-most column
- In the Data tab, in the Data Tools group, click Remove Duplicates
- You’ll get a warning asking you to ‘Expand the selection’ or ‘Continue with the current selection’. We need to Continue with the current selection
- You’ll then be asked whether your data has headers. Click inside the tick-box, then OK
- Excel will now collapse the column into all its unique elements
You now have a list of all the unique elements in your data.
It might be useful at this stage to see how many of each of the unique elements are present.
To do this we need to do a tiny little bit of programming. Don’t panic! It’s really quite simple:
- If your variable is in column E and your list of unique elements is in column F, then in the cell to the right of your first unique element (cell G2), enter the following code:
- Click Enter
- The number of times that that unique element appears in your list will be counted by Excel and will appear in the cell.
You’re done! In column E you should have the original data, in column F a list of all the unique elements in your data and in column G will be the number of times each element appears in your data.
It should look something like this:
You can now use ‘Find and Replace’ to correct misspelled entries, including correcting entries with the wrong case, like ‘case’, ‘Case’ or ‘CASE’. Do this with all the misspelled entries in your unique elements list until all errors have a count of zero.
So how long did that take? A couple of minutes?
How long would it have taken you to go through each cell by eye to find all the errors and correct them individually? Hours?
Learn all these little tips and tricks in Excel and your data cleaning processes might not feel like sticking pins in your eyes!
5. …And Use Excel Formulae to do the Even Harder Work
I cannot tell you how many weeks of my life I have lost – that I will never get back – trying to find the source of error that turn out to be a space at the beginning or end of the data in a cell.
You can’t see it, but it’s still there and it can wreak havoc when you start to do analyses.
Excel ignores spaces, so they can be incredibly difficult to detect, but other analysis and stats packages don’t ignore them and they treat the entry as something different.
Spaces are the bane of my life!!!
So what to do?
Excel has a few different formulae that can be used to detect and trim spaces and other unwanted characters, like:
So learn how to do simple coding in Excel and use these – and other – formulae.
I promise – it will definitely be time well spent!
Bonus Tip: DataKleenr – Automated Data Cleaning
While it’s great to get your hands dirty and learn the basics of data cleaning there’s really no substitute for cleaning your data quickly, accurately and automatically, so let us do the heavy lifting for you.
We’ve created DataKleenr, an automated data cleaning tool that cleans and classifies your data for you effortlessly, saving you time and money. Check it out – it might not be the worst thing that you do today!
Your Next Step
Let's face it - you didn't visit this blog post because you're passionate about data cleaning, did you?
You didn't leave college or University saying "now that I'm free to forge my own path, I'm going to be a ... a data cleaner!"
That sounds like a line straight out of a Monty Python sketch!
No, you're here because you need to clean your data and you want to know what the next steps are. Well, we're here to help.
I hope you've found this blog post useful, and to help you take your career to the next level we've created a series of video courses dedicated to data collection, data cleaning and data preparation to get your data analysis-ready in double quick time.
You can get the data cleaning video course right here: