5 Productivity Tips for Efficient Data Cleansing

Hand holding pocket watch

The previous blog post in this series 11 Essential Tips for Effective Data Collection dealt with practical data collection methods.

This blog post is the second in the series and deals with data cleaning (aka data cleansing or data scrubbing).

Data cleaning may not be everyone's favourite part of working with data, but it's definitely one of the most important - without clean data you can't even get started on anything else.

Yet despite data cleaning taking up around 60-80% of the typical data analyst's time it seems that it's still done in a mostly haphazard way.

Here I'm going to give you 5 great data cleaning techniques, show you how to improve data quality and help you build a simple data cleansing strategy that is quick, easy to follow and really works.


​The video below will give you a nice introduction to what's coming up, then when you're ready we'll go into a little more detail in the text.


OK, so you’ve followed the advice in our previous blog about how to collect your data and if you’ve been very careful you might just have a perfect dataset.

Well done!

Personally I’ve never seen a perfect dataset – it is the rarest of creatures. Most likely you will have to clean your data before you can start to analyse it.

Yet again the textbooks will give you little practical advice here, so let’s dive in and set a few ground-rules that will help you save time and keep your boss happy…

5 Productivity Tips for Efficient Data Cleaning

1. Make a Copy

You’ve got a ‘raw’ dataset that is essentially an electronic copy of all the paper-based data you have collected. If you have made an entry error in the electronic copy you can always check back to the original paper copy.

When you move on to the data cleaning you’re going to be changing the data and you need to be able to undo any cleaning mistakes you’ve made, and trust me – you’re going to make a few.

So create a duplicate worksheet of your dataset.

Call the original one ‘Raw Data’ and the new one ‘Cleaning In Progress’ until you’ve finished cleaning, then you can change the name to ‘Clean Data’, like this:

Duplicate worksheet

This is known as version controlling your dataset, and - believe it or not - this is one of the most important steps in data cleaning.

​"Version controlling your dataset is one of the most important steps in data cleaning" @eelrekab @chi2innovations #datascience #gooddata

Click to Tweet

Oh yes – and make sure both worksheets have got the Unique ID column.​

​2. Clean Your Data in a Separate Worksheet

When cleaning an individual column of data you’ll use a variety of different tools built into Excel, like ‘Find And Replace’.

When you use ‘Find And Replace’ will it operate only on the selected column or on the whole worksheet? Are you sure? Really, really sure?

Do all of the in-built functions work in the same way?

Get the answer wrong and you’ll find that you’ve just introduced errors across your entire dataset with no easy way to undo them (hitting ‘Undo’ doesn’t work here).

So when you want to clean a single column of data, copy that column into a spare worksheet and work on it there. Rename it ‘Spare Sheet’ or something similar. When you’re done you can copy back, replacing the previous uncleaned column. 

Spare sheet

This process is known as Extract - Transform - Load, or ETL, and is the Holy Trinity of Data Cleaning​:

The Holy Trinity of Data Cleaning: Extract - Transform - Load

It may take you a little more time, but it’s worth it – mistakes can be very costly and it pays to head them off at the pass.

Oh, I do hate that cliché…​

​3. Report Errors Back to the Original Source

It makes no sense cleaning your data if the same data has to be cleaned in exactly the same way time and time again.

If you’re using a shared dataset, such as a departmental database, make sure you report back to the original source any errors that you’ve found. Then, next time you have to analyse some more data from the same source you’ll have a lot less cleaning to do. 

Report errors

4. Use Excel Functions to do the Hard Work…

Whenever possible, try not to clean data manually.

One of the biggest sources of spelling errors, typos and incorrect entries comes from manual entry, so why use the same method that got you into trouble in the first place?

Excel has a shed-load of functions that can help with data cleaning, so use them.

For example, if you have a text-based column, you can use Excel’s ‘Remove Duplicates’ function to help you find and correct typos. 

Remove duplicates

Here’s how to use the Remove Duplicates function in Excel:

  1. Copy your selected variable to your Spare Sheet TWICE, in adjacent columns
  2. Select the right-most column
  3. In the Data tab, in the Data Tools group, click Remove Duplicates
  4. You’ll get a warning asking you to ‘Expand the selection’ or ‘Continue with the current selection’. We need to Continue with the current selection
  5. You’ll then be asked whether your data has headers. Click inside the tick-box, then OK
  6. Excel will now collapse the column into all its unique elements

You now have a list of all the unique elements in your data.

It might be useful at this stage to see how many of each of the unique elements are present.

To do this we need to do a tiny little bit of programming. Don’t panic! It’s really quite simple:

  1. If your variable is in column E and your list of unique elements is in column F, then in the cell to the right of your first unique element (cell G2), enter the following code:
    • =COUNTIF(E:E,F2)
    • Click Enter
  2. The number of times that that unique element appears in your list will be counted by Excel and will appear in the cell.

You’re done! In column E you should have the original data, in column F a list of all the unique elements in your data and in column G will be the number of times each element appears in your data.

It should look something like this: 

Remove Duplicates

You can now use ‘Find and Replace’ to correct misspelled entries, including correcting entries with the wrong case, like ‘case’, ‘Case’ or ‘CASE’. Do this with all the misspelled entries in your unique elements list until all errors have a count of zero.

So how long did that take? A couple of minutes?

How long would it have taken you to go through each cell by eye to find all the errors and correct them individually? Hours?

Learn all these little tips and tricks in Excel and your data cleaning processes might not feel like sticking pins in your eyes!

5. …And Use Excel Formulae to do the Even Harder Work

I cannot tell you how many weeks of my life I have lost – that I will never get back – trying to find the source of error that turn out to be a space at the beginning or end of the data in a cell.

You can’t see it, but it’s still there and it can wreak havoc when you start to do analyses.

Excel ignores spaces, so they can be incredibly difficult to detect, but other analysis and stats packages don’t ignore them and they treat the entry as something different.

Spaces are the bane of my life!!!

So what to do?

Excel has a few different formulae that can be used to detect and trim spaces and other unwanted characters, like:

  • TRIM()
  • CLEAN()
Clean Formula

So learn how to do simple coding in Excel and use these – and other – formulae.

I promise – it will definitely be time well spent!

Bonus Tip: DataKleenr – Automated Data Cleaning

While it’s great to get your hands dirty and learn the basics of data cleaning there’s really no substitute for cleaning your data quickly, accurately and automatically, so let us do the heavy lifting for you.

We’ve created DataKleenr, an automated data cleaning tool that cleans and classifies your data for you effortlessly, saving you time and money. Check it out – it might not be the worst thing that you do today!

Completely automated - DataKleenr is the fastest data cleaning program on the planet!


​Your Next Step

​Let's face it - you didn't visit this blog post because you're passionate about data ​cleaning, did you?

You didn't leave college or University saying "now that I'm free to forge my own path, I'm going to be a ... ​​a data ​cleaner!"

That sounds like a line straight out of a Monty Python sketch!

No, you're here because you need to ​clean your data​ and you want to know what the next steps are. Well, we're here to help.

I hope you've found this blog post useful, and to help you take your career to the next level we've created a series of video courses dedicated to data collection, data cleaning and data preparation to get your data analysis-ready in double quick time.

You can get the ​data cleaning video course right here:

Want to speed up your data cleaning? Check out these 5 productivity tips and spend less time doing it. #datacleaning #datatips Yet despite data cleaning taking up around 60-80% of the typical data analyst's time it seems that it's still done in a mostly haphazard way. Here I'm going to give you 5 great data cleaning techniques, show you how to improve data quality and help you build a simple data cleansing strategy that is quick, easy to follow and really works. #datacleaning #datatips
Do NOT follow this link or you will be banned from the site!