October 23

From Dirty Data to Clean Data – The Ultimate Practical Guide to Data Cleaning

Discover Data

Clean data doesn’t just happen, and neither does dirty data.

Dirty data is the result of poor data collection methods, and if you don’t know how to collect data cleanly and efficiently, you’ll have a nightmare of a time when it comes to data cleaning.

Similarly, if you don’t have good data cleaning methods, you’ll likely introduce more errors into your data, and ultimately, you’ll waste huge amounts of time and money.

In this post you’ll learn about the 6 common types of dirty data, and what you can do about each of them.

You’ll also learn about the 5 data collection methods in research that you’re likely to encounter.

You’ll then learn how important it is to have a robust data cleaning process that you can follow time and time again, and how to clean data in Excel using the 5 steps of data cleaning.

Finally, you’ll learn how the 4 data preprocessing steps using summary statistics in Excel will help you get your data fit-for-purpose and analysis-ready in double-quick time!

More...


Table Of Contents

This post forms part of a series on getting your data fit-for-purpose and analysis-ready.

For more detail, choose from the options below:

What is Data Cleaning?

Data cleaning is the process of transforming dirty data into clean data.

That is, identifying inaccurate, incomplete and inconsistent data to improve the quality of your dataset until it is clean enough to be stored (in a database) or to be analysed (in a spreadsheet or a stats program).

The term ‘clean data’ refers to the correctness and validity of the data themselves, but not to how useful the data are to your study.

Why Is It Important to Clean Data?

It is not usually possible to analyse dirty data, and your chosen analysis package will usually moan and groan at you when you try to – it will throw an error. Without clean data, at best you will not be able to do any analysis. At worst you will get results that are inaccurate or even completely wrong, and you may not even be aware of it!

You clean data to identify and fix each of the 6 flaws found in dirty data, transforming them from dirty data to clean data, and each of these types of dirty data flaws need to go through an individualised data clean up step – there is no one-size-fits-all!

On the contrary, when you have clean data (and fit-for-purpose data, which is an entirely different issue) you can be confident that whatever analyses you perform, the results will likely match the real world. You will then be able to extract valuable, actionable insights that will help make the world a better place.

What Are the Benefits of Clean Data?

The benefits of having clean data mean that you have better insights into the needs of your end users (customers, patients, etc.).

Here are 10 benefits of clean data:

  • Increased Use of Data
  • Reduced Data Costs
  • More Effective Analysis
  • More Accurate Results
  • Better Decision Making
  • Increased Performance and Productivity
  • Save Time and Money
  • Reduce Waste
  • Protect Reputation
  • Happier End Users

Having clean data prevents you from wasting time and money following strategies based on flawed analyses that could lead to faulty conclusions and ultimately loss of reputation.

On the contrary, clean data allow you to run analyses faster, with more clarity and greater accuracy, and that’s what you’re looking for, isn’t it?

60% of Your Project Time is Spent Cleaning Data (but it doesn't have to!) #datacleansing #data #statistics #datascience @chi2innovations

Click to Tweet

How Long Does It Take to Clean Data?

According to our good friend Gil Press in an article he wrote for Forbes, Data Scientists spend 20% of their time collecting data and 60% cleaning data – meaning that around 80% of their time is spent just preparing and managing data before they can analyse it!

Data Scientists Spend 60% of Their Time Cleaning Data

The figure of 60% for the data cleaning stage, though, is actually caused by poor data collection and data entry practices. Just imagine how much time you could save if you got your data collection right first time!

*Psst

Data collection is the most important part of data cleaning.

 If you want to get reliable, fit-for-purpose, analysis-ready data, then stay tuned for more…

What Are the Consequences of Not Cleaning Dirty Data?

Obviously, the consequences of not cleaning dirty data are the opposite of the benefits listed above, but The 1-10-100 Rule illustrates quite eloquently the exponential cost of failing to clean your dirty data.

The 1-10-100 Rule:

  • $1 – Prevention
  • $10 – Correction
  • $100 – Failure

Simply put, according to the 1-10-100 rule, for every dollar that it costs you to prevent dirty data from entering your dataset, it will cost you $10 for cleaning data after the fact, and $100 to fix the problems arising from doing nothing.

The 1-10-100 Rule – The Cost of Dirty Data Clean Up

From Dirty Data to Clean Data – The 4 Phases of Data Cleaning

If you’ve been tasked with the job of going out and collecting data, analysing them and presenting your results in a clean, crisp, coherent way to your colleagues, your boss and the wider world, then you’re going to need a plan.

Actually, you’re going to need a few plans, because clean data doesn’t just happen.

A fit-for-purpose, analysis-ready, clean dataset doesn’t appear magically. Not unless it comes over the hill riding a fluffy pink unicorn.

In fact, there are 4 phases of data cleaning, and here they are:

  1. 1
    Understand the Causes of Dirty Data
  2. 2
    How to Collect Data
  3. 3
    How to Clean Data
  4. 4
    Data Preprocessing (using summary statistics)
The 4 Phases of Data Cleaning: Dirty Data, Data Collection, Data Cleaning, Data Preprocessing (Summary Statistics)

Your first task in achieving clean data is to understand the causes of dirty data, and then having a plan to deal with each of these causes (spoiler alert – there are 6 of them).

Once you have a plan for each of these causes of dirty data, you will know how to collect data quickly, cleanly and efficiently.

And yet, you won’t yet have clean data – your data will still have errors.

Why?

Because you are human, and humans make mistakes.

It doesn’t matter whether you’re collecting data manually (by human hand) or automatically (by machine), you’re going to get errors because it’s not possible to anticipate every type of data issue that the world can throw at you.

Dirty data is inevitable.

Resistance is futile.

Except, it isn’t, because you will have plans to:

  1. 1
    Stop most type of error getting into your data in the first place, and
  2. 2
    Deal with every type of error that sneaks past you and does get into your data

With good data collection methods, you can avoid most mistakes entering your data, so when you come to data cleaning, your task will be that much smaller.

You will still need a plan for data cleaning, though (another spoiler alert – this will be a 5-point plan).

And with good data cleaning methods, will you then get clean data?

Yes.

But even when you have wonderfully perfect, clean data, they will still not likely be fit-for-purpose or analysis-ready, and you will need another plan for this (final spoiler alert – it’s a 4-point plan).

Your final phase of data cleaning, where you translate dirty data into clean data that is fit-for-purpose and analysis-ready is to run a set of summary statistics to tell you where the last issues in your data are to be found (and thereby cleaned).

We’re going to go through each of these 4 phases individually, so you get the broad strokes of the kind of depth you need to go into to get an amazing dataset that you can analyse and keep your boss happy.

Let’s start with understanding the causes of dirty data, because this is the first step on the path to getting clean data.

*Psst – did you know that around 60% of a data analyst’s time is spent on data cleaning?

It doesn’t have to be this way, though – you can get clean data in a matter of minutes.

 If only you knew how…

For Clean Data – Understand the Causes of Dirty Data

Clean data are data that are accurate, complete and consistent, and there are 6 basic requirements. Clean data must be:

  1. 1
    Correct
  2. 2
    Complete
  3. 3
    Consistent
  4. 4
    Unique
  5. 5
    Valid
  6. 6
    Correctly Formatted

Conversely, dirty data are data that are inaccurate, incomplete or inconsistent, and there are 6 common types of dirty data. Dirty data are flawed in at least one of:

  1. 1
    Incorrect
  2. 2
    Incomplete
  3. 3
    Inconsistent
  4. 4
    Duplicated
  5. 5
    Invalid
  6. 6
    Incorrectly Formatted
What Are Clean Data and Dirty Data?

Let’s take a brief look into each of these flaws found in dirty data. You will find more detail on all these points in the sister post All Dirty Data Have a Single Cause – How to Fix It.

Dirty Data Flaw #1: Incorrect Data

Data are correct when they are sufficiently accurate for your study.

For example, human height cannot be measured to the nearest metre – it is too imprecise. Similarly, spelling mistakes are examples of dirty data that need to be cleaned before analysis.

Dirty Data Flaw #2: Incomplete Data

Data are complete when there are sufficient data for you to accurately reflect what you are measuring in your study.

For example, missing data points can cause problems, but only with small datasets. Decide whether you have enough data to confidently answer your research hypothesis.

Dirty Data Flaw #3: Inconsistent Data

Data are consistent when they make sense across your dataset.

For example, for each patient, date of birth must always precede date of death. If they don’t (i.e., they were entered the wrong way round), your calculation of patient age will yield a negative number!

Dirty Data Flaw #4: Duplicate Data

Data are unique when there is only one entry per sample (or customer or patient, etc.).

Duplicate data risks biasing your analyses, leading to the wrong conclusions. You need to find and eliminate duplicate data, both by column and by row.

Dirty Data Flaw #5: Invalid Data

Data are valid when they are correct in the real world right now.

Circumstances in the real world change, and your dataset must be kept up-to-date to reflect that. Things that change periodically might include addresses, phone numbers and marital status.

Dirty Data Flaw #6: Incorrectly Formatted Data

Data are correctly formatted when they are entered in an agreed format and to an agreed standard.

For example, dates is the US system (mm-dd-yyyy) are different to those in the UK system (dd-mm-yyyy). Make sure you have consistent data entry standards, otherwise you won’t know if 09-11-2001 is a date in September or in November!

Examples of Dirty Data

Here are a few examples of dirty data (and their clean data equivalents):

Examples of Dirty Data and Clean Data

Don’t forget – you will find more detail on all these points in the sister post All Dirty Data Have a Single Cause – How to Fix It.

*Psst, it’s me again – all these issues can be dealt with quickly and easily, and with a minimum of fuss.

 If you want to know how, keep reading…

How to Collect Data: The 5 Data Collection Methods

To understand how to collect data, there are five data collection methods in research:

  • Surveys and Questionnaires
  • Interviews
  • Direct Observation
  • Focus Groups
  • Existing Documents and Records
How to Collect Data: 5 Data Collection Methods in Research

Whichever of these data collection methods you use, you need to be able to collect data effectively, avoiding unnecessary errors from entering your dataset.

So, how do you do this?

Well, for this, you need to remember the 6 basic requirements of clean data from above. Let’s have a brief look at each of them to see how you can avoid getting dirty data into your dataset. You will find more detail on all these points in the sister post How to Collect Data – 5 Data Collection Methods + Tips, Techniques, Practical Steps.

Data Flaw #1: Incorrect Data

Data are correct when they are sufficiently accurate for your study.

Incorrect text entries can often be corrected easily by, for example, running a spell checker, but setting up data validation in Excel can save valuable time. On the other hand, incorrect numerical entries can be impossible to spot. For these, you need a double entry system to avoid them getting into your dataset.

Data Flaw #2: Incomplete Data

Data are complete when there are sufficient data for you to accurately reflect what you are measuring in your study.

If cells in your Excel dataset are empty, you need to understand why to avoid this happening. Running a pilot project will give valuable insights into why certain datapoints aren’t being collected.

Data Flaw #3: Inconsistent Data

Data are consistent when they make sense across your dataset.

It can be quite difficult to stop inconsistent data getting into your dataset (like date of birth and date of death being entered the wrong way round), and it is best to deal with these using summary statistics at the data preprocessing stage.

Data Flaw #4: Duplicate Data

Data are unique when there is only one entry per sample (or customer or patient, etc.).

The easiest way to avoid duplicate data getting into your dataset is by having a checklist of all the forms that have been entered into your spreadsheet, but it’s very easy to delete the duplicates using summary statistics at the data preprocessing stage.

Data Flaw #5: Invalid Data

Data are valid when they are correct in the real world right now.

This is a problem for databases, but not so much for datasets. Mostly, databases are meant to store data that are perpetually valid, whereas datasets are meant to store data that have been collected at a specific snapshot in time.

Data Flaw #6: Incorrectly Formatted Data

Data are correctly formatted when they are entered in an agreed format and to an agreed standard.

Data cleaning is seriously difficult when standards haven’t been set from the beginning! I can’t stress how important it is to get your data collection and data entry standards and processes right in the first place.

Quick reminder – you can find more detail on all these points in the sister post How to Collect Data – 5 Data Collection Methods + Tips, Techniques, Practical Steps.

How to Clean Data in Excel: 5 Data Cleaning Steps

Learning how to clean data in Excel is perhaps the single biggest time-saving thing you can do with your data.

When it comes to data cleaning in Excel, the process and the order are very important.

Here are the 5 data cleaning steps – in this order – that you should follow for quick and easy data cleaning in Excel:

  1. 1
    Remove Unwanted Spaces
  2. 2
    Remove Unwanted Characters
  3. 3
    Case Standardisation
  4. 4
    Clean Text Data
  5. 5
    Clean Numerical Data
How to Cleanse Data in Excel – 5 Data Cleansing Steps

So, how do you clean data in Excel?

Well, let’s briefly go through each of these 5 steps in order. You will find more detail on all these points in the sister post How to Clean Data in Excel – The 5 Data Cleaning Steps You Need to Know.

Data Cleaning Step 1: Remove Unwanted Spaces

Unwanted spaces strewn throughout the dataset are a rookie data analyst’s worst nightmare! Excel, though, does give you a solution – you can use the TRIM formula to strip out all the unwanted spaces from your data.

Data Cleaning Step 2: Remove Unwanted Characters

When you download data from a webpage you might have unwanted characters dotted throughout your data. You can clean most of (but not all) these data using the Excel CLEAN formula.

Data Cleaning Step 3: Case Standardisation

Excel is mainly case insensitive. Excel does, though, give you a few tools to help deal with this – you can use the LOWER, UPPER or PROPER formulae.

*Psst, it’s you-know-who…

Did you know there’s a way you can cleanse data of ALL unwanted spaces, ALL unwanted characters AND case control your ENTIRE dataset in ONE amazing step?

This used to take me 2 weeks to do, but now data cleaning in Excel takes me less than 60 seconds!

 Stay tuned for more…

Data Cleaning Step 4: Clean Text Data

Typos and spelling mistakes are some of the easiest errors in data to find and correct – if you know how! The easiest way (but most time-consuming) is using Excel’s built-in Spell Checker, but my favourite – and by far the quickest – is Excel’s VLOOKUP formula.

Data Cleaning Step 5: Clean Numerical Data

Numerical entries contaminated with text can be identified and corrected quite easily. There are a few ways of doing this, and my personal favourite is the ISNUMBER formula. You use this to get a list of all the contaminated entries, which can then be corrected.

Don’t forget, you can find much more detail on all these points in the sister post How to Clean Data in Excel – The 5 Data Cleaning Steps You Need to Know.

Summary Statistics in Excel – 4 Data Preprocessing Steps

Summary statistics are used to find problems in your data – after data cleaning – that are not obvious, before you can proceed with your analysis.

Here are the 4 data preprocessing steps you will need to follow using summary statistics:

  1. 1
    Duplicate data
  2. 2
    Invalid entries in text data
  3. 3
    Invalid entries in numerical data
  4. 4
    Outliers

Let’s go through these briefly to get an idea of how to do it, but you can find much more detail on all these points in the sister post Summary Statistics in Excel: The 4 Essential Steps for Fit-For-Purpose, Analysis-Ready Data.

Data Preprocessing With Summary Statistics: Duplicate Data

Duplicate data are redundant data in the sense that you may have the same data for a single person repeated in your dataset.

You need to eliminate the duplicate rows and columns, but before you do, you might like to inspect them first, and you do that by using Excel’s Find Duplicates function.

This will highlight all the rows and columns that are identical, and then you can use Excel’s Remove Duplicates function to eliminate the duplicates.

Data Preprocessing With Summary Statistics: Invalid Entries in Text Data

In other words, at this stage you are looking for data that do not make sense, so you can either correct or eliminate them.

Here are the summary statistics you will need for your categorical columns of data:

  • Counts
  • Number of Categories
  • Counts in Each Category
  • Blank cells
What are Summary Statistics in Excel – Text Data

Data Preprocessing With Summary Statistics: Invalid Entries in Numerical Data

Let’s look at an example. In an Age column, a fat-fingered person might have clumsily typed the number 320 instead of 32. While 320 is a perfectly legitimate number, it is clearly not correct in the context of the age of a human being.

To find invalid entries like these, you use summary statistics, and here are some of the summary statistics you will need for your numerical columns of data:

  • Counts
  • Minimum & Maximum
  • Mean & Median
  • Negative numbers
  • Zeros & Blank cells
What are Summary Statistics in Excel – Numerical Data

Data Preprocessing With Summary Statistics: Outliers

An outlier in numerical data is a data point that differs substantially from the rest of the data points – in other words, they are unusually large or small compared with the other values.

Outliers skew your results, meaning that you may get a biased result that doesn’t accurately represent the real world.

So, you will need to identify your outliers and then deal with them in one of three ways:

  • Correct them (they might be incorrect)
  • Delete them (because they skew your data)
  • Leave them in (because they are real, accurate data)
Summary Statistics – Handling Outliers

Identifying outliers in Excel can be tricky, and involves coding using the QUARTILE formula, but if you know how to do it you will get a list of outliers in seconds that you can assess and make qualified decisions on what to do with them.

Just a quick reminder – you can find much more detail on all these points in the sister post Summary Statistics in Excel: The 4 Essential Steps for Fit-For-Purpose, Analysis-Ready Data.

*Psst, I’m back…

An outlier analysis can be a bit long-winded if you’re coding it yourself in Excel. There is a faster way, though – and we know exactly how to do it!

If you want to know how to do all the different methods of data cleaning and preparation in Excel in minutes – and get to know all the pros and cons – keep reading…

Data Cleaning in the Dirty Data Dojo

Did you know that data analysts spend a whopping 60% of their time on data cleaning and data preparation?

Well, it doesn’t have to be this way.

Over several years I created a method of cleaning and preprocessing data in Excel that will cut your data preparation time from 2 weeks to around 2 hours – or even less!

This method has been honed over several years with hundreds of researchers that have worked on thousands of datasets, and – after many years of being pestered – I finally relented and created a series of video courses to teach you how to clean and prepare data in Excel (and in Python and R).

This series of courses is called the Dirty Data Dojo, and you can get the Data Cleaning course right here:

Dirty Data Dojo - Data Cleaning

UNIQUE VIDEO COURSE

In less than 2 hours

your data can be:

  • Clean
  • Fit-For-Purpose
  • Analysis Ready

From Dirty Data to Clean Data – The Ultimate Practical Guide to Data Cleaning: Summary

In this post you’ve learnt that clean data and dirty data don’t just happen.

Dirty data are the result of poor data collection and data cleaning methods, whereas clean data result from good data collection and data cleaning methods.

In this post you’ve had a little snapshot of what it takes to get clean data and avoid the 6 common types of dirty data.

You’ve learnt about the 5 data collection methods in research that you’re likely to encounter.

And you’ve learnt how important it is to have a robust data cleaning process that you can follow time and time again, and we showed you how to clean data in Excel by introducing to you the 5 steps of data cleaning.

Finally, you learnt how the 4 data preprocessing steps using summary statistics in Excel can help you get your data fit-for-purpose and analysis-ready in double-quick time!

I hope you learnt something useful in this post, and that you’ll check out our more in-depth articles on the subjects (the jump-station to the other articles is below).

This post forms part of a series on getting your data fit-for-purpose and analysis-ready.

For more detail, choose from the options below:


Tags


Lee Baker

Follow me:

About the Author

Lee Baker is an award-winning software creator that lives behind a keyboard in a darkened room. Illuminated only by the light from his monitor, he aspires to finding the light switch.

With decades of experience in science, statistics and artificial intelligence, he has a passion for telling stories with data.

His mission is to help you discover your inner Data Ninja!

You may also like

45+ Awesome Gifts for Data Scientists, Statisticians and Other Geeks
Computational Statistics is the New Holy Grail – Experts
3 Crucial Tips for Data Processing and Analysis
Correlation Is Not Causation – Pirates Prove It!
Summary Statistics in Excel: The 4 Essential Steps for Fit-For-Purpose, Analysis-Ready Data
How to Clean Data in Excel – The 5 Data Cleaning Steps You Need to Know
Remember Me
Chi-Squared Innovations
Success message!
Warning message!
Error message!