October 26

How to Clean Data in Excel – The 5 Data Cleaning Steps You Need to Know

Discover Data

Did you know that up to 60% of a data analyst’s time is spent on data cleaning?

Well, if you’ve ever had to clean data in Excel (or in any other medium), then you’ll know that data cleaning (aka data cleansing) takes up a ridiculous amount of time.

It doesn’t have to be like this, though. Data cleaning isn’t actually about cleaning data – it’s about being organised.

That’s what this post is about – data cleaning in Excel and having a robust process that you can follow time and time again.

If you’ve ever wondered how to clean data in Excel rapidly, repeatedly and reliably, then this is the post for you.

You’ll learn about the 3 Rs of Data Cleaning, and the 5 data cleaning steps you need to follow to cleanse data in Excel.

I’ll also teach you a process that I call The Holy Grail of Data Cleaning.

More...

This post forms part of a series on getting your data fit-for-purpose and analysis-ready.

For more detail, choose from the options below:

What is Data Cleaning?

Data cleaning (aka data cleansing) is the process of transforming dirty data into clean data.

That is, identifying inaccurate, incomplete and inconsistent data, and – by using the 3 Rs of Data Cleaning: Repair, Replace, and Remove – gradually improve the quality of your dataset until you reach a point where you can say that ‘it’s good enough’.

In reality, ‘clean data’ is the rarest of things, and no matter how much work you do with it, there will inevitably be errors, bias, inconsistencies and all manner of dirt left in your supposedly clean data.

That’s just a fact of working with data.

And in any case, even if you’ve collected data and they all follow the 6 requirements of clean data, that doesn’t necessarily mean that your data are useful or fit-for-purpose.

For example, if you’re doing a study on how many ice cream sales are made at different times of the year it wouldn’t be fit or proper to collect sales data on haemorrhoid cream. Such data might be accurate, complete and consistent, but they are wholly irrelevant.

The term ‘clean data’ refers to the correctness and validity of the data themselves, but not to how useful the data are to your study.

What Data Do You Need to Clean?

When it comes to data that you’ll find (or collect) in a spreadsheet, there are 3 types:

  • Qualitative data (text data)
  • Quantitative data (numerical data)
  • Mixed data (both text and numerical)
Data Cleaning in Excel – 3 Types of Spreadsheet Data

Qualitative Data

Qualitative data are entries that are entirely made up of text elements, such as:

T-shirt Size:

  • Small
  • Medium
  • Large

Quantitative Data

Quantitative data are those entries that are wholly composed of numbers, such as:

Height:

  • 1.56
  • 1.71
  • 1.84

Mixed Data

And Mixed data are those that are a mix of text and numbers, such as:

Tumour Grade:

  • Grade 1
  • Grade 2
  • Grade 3

How to Cleanse Data in Excel: ETL – The Holy Grail of Data Cleansing

Most data rookies when confronted with their first dataset will typically go about their data cleaning in Excel manually. They will go cell by cell trying to spot errors by eye, then correct them manually.

This is the slowest, most inefficient way to clean data in Excel, and will often introduce new errors along the way too.

There is a better way to cleanse data in Excel, though.

The most important part about data cleansing is understanding that you need to have a process that is:

  1. 1
    Robust
  2. 2
    Repeatable
  3. 3
    Won’t introduce new errors

Below I will teach you how to clean data in Excel by showing you 5 data cleaning steps to help you get organised and build a robust data cleansing process, but first I want to introduce you to a process known as ETL – what I call The Holy Grail of Data Cleansing!

What Does ETL Mean?

ETL is a 3-stage process known as:

  • Extract
  • Transform
  • Load

ETL - The Holy Grail of Data Cleaning #etl #datacleansing #analysis @chi2innovations

Click to Tweet

What Is the Purpose of ETL in Data Cleaning?

The purpose of ETL in data cleaning is to ensure that in the process of data cleansing you don’t accidentally introduce new errors to your dataset.

To do this, you never cleanse data in Excel within your live worksheet.

How to Cleanse Data in Excel Using the ETL Process

When you’re ready to make an operation on a portion of your dataset (quite often an individual variable or a grouping of similar variables), you:

  1. 1
    Extract from your live worksheet the data you wish to work with, moving them to a spare worksheet,
  2. 2
    Transform these data according to your data cleansing protocols until you are satisfied that they are correct
  3. 3
    Load these data back into your live worksheet.

If, along the way, you make a mistake in your data cleaning steps, you will be introducing errors into the small, copied sample of your dataset, not the whole dataset.

Mistakes at this stage can be recovered from very quickly and easily without losing much time, and without losing any data, so knowing how to cleanse data in Excel using ETL is incredibly important!

How to Cleanse Data in Excel – ETL is the Holy Grail of Data Cleansing

So now that you understand how ETL can be used to help with data cleansing, it’s time to move on to the 5 data cleaning steps…

How to Clean Data in Excel: 5 Data Cleaning Steps

Learning how to clean data in Excel is perhaps the single biggest time-saving thing you can do with your data.

When it comes to data cleaning in Excel, the process and the order are very important.

If you don’t have a process, you’re stumbling about in the dark, and your data cleansing will take you weeks.

And if you do your data cleaning in Excel in the wrong order you will introduce new errors into your data that you will then waste time cleaning up all over again.

FYI – it should take you no longer than a couple of hours to cleanse data in Excel to get your data analysis-ready and fit-for-purpose.

Here are the 5 data cleaning steps – in this order – that you should follow for quick and easy data cleaning in Excel:

  1. 1
    Remove Unwanted Spaces
  2. 2
    Remove Unwanted Characters
  3. 3
    Case Standardisation
  4. 4
    Clean Text Data
  5. 5
    Clean Numerical Data
How to Clean Data in Excel – 5 Data Cleaning Steps

Data Cleaning Step 1: Remove Unwanted Spaces

Unwanted spaces strewn throughout the dataset are a rookie data analyst’s worst nightmare!

The problem is that Microsoft Excel ignores spaces, so in Excel, the following are identical:

  • My data (no unwanted spaces)
  •              My data (unwanted leading spaces)
  • My         data (unwanted interstitial spaces)
  • My data              (unwanted trailing spaces)
  •              My                 data              (unwanted spaces everywhere!)

Upload them to an analysis program, though, and each of these entries is seen as completely different data.

The worst thing is, though, that you usually can’t see where the unwanted spaces are and Excel won’t tell you whether they exist or where to find them.

Like I said – nightmare!

I honestly wish I had a pound/dollar/euro for every unwanted space that I’ve had to clean out of data. I could have retired years ago!

Excel, though, does give you a solution – you can use the TRIM formula to strip out all the unwanted spaces from your data*, like this:

Data Cleansing Step 1 – Use TRIM to Remove Unwanted Spaces

*Psst – if you want a pro-tip, you can use TRIM to cleanse data of all unwanted spaces from your entire dataset in one awesome data ninja move!

Data Cleaning Step 2: Remove Unwanted Characters

An issue that you’ll find when you download data from a webpage is that of unwanted characters dotted throughout your data.

Each of these characters has a unique character code, and if you don’t know what the character code is, you can’t find erroneous cells and you can’t clean them…

…unless you know how to clean data using the Excel CLEAN formula.

With this you can strip out most (but not all) of the unwanted characters*, like this:

Data Cleaning Step 2 – Use CLEAN to Remove Unwanted Characters

*Psst, it’s me again – you can also use CLEAN to cleanse data of ALL the unwanted characters from your entire dataset in the same awesome data ninja move as the one before.

 If only you knew how…

Data Cleaning Step 3: Case Standardisation

Did you know that Excel is mainly case insensitive?

It’s true, so whatever mistakes you make with the case of your text entries (Case, cASE, case, CAse, etc.), Excel won’t help you out in the slightest!

Data Cleaning – Excel is Case Insensitive

Fortunately, though, Excel does at least give you a few tools to help you deal with this – you can use the LOWER, UPPER or PROPER formulae, like this:

Cleaning Step 3 – Standardise Case of Text Data

*Psst, it’s you-know-who…

Did you know there’s a way you can cleanse data of ALL unwanted spaces, ALL unwanted characters AND case control your ENTIRE dataset in ONE amazing step?

This used to take me 2 weeks to do, but now data cleaning in Excel takes me less than 60 seconds!

 Stay tuned for more…

Data Cleaning Step 4: Clean Text Data

Typos and spelling mistakes are some of the easiest errors in data to find and correct – if you know how! Obviously, you’re not going to be doing manual data cleaning in Excel, so you need a way to find typos and clean them automatically (or, at least, semi-automatically), and there are a number of ways of doing this – each of which has various pros and cons.

Perhaps the easiest way (and also the most time-consuming with larger datasets) is by using Excel’s built-in Spell Checker. The Find & Replace function can also be used, but my favourite – and by far the quickest – is Excel’s VLOOKUP formula. It may be Excel’s biggest secret that VLOOKUP can be used for data cleaning in Excel – that’s not its purpose – but it’s also one of its most powerful data cleansing tools!

Data Cleaning Step 4 – Clean Text Data

*Psst, I’m back…

Choosing the right method for cleaning text data in Excel can make the difference between pulling all-nighters and taking afternoons off to go to the beach.

 If you want to know how to do all the different methods of data cleaning in Excel – and get to know all the pros and cons – keep reading…

Data Cleaning Step 5: Clean Numerical Data

Ultimately, there are only 2 types of error you will find in numerical data:

  • Incorrect entries
  • Numbers contaminated with text

If your entry is incorrect (such as 3.14 incorrectly entered as 3.41), you won’t be able to identify that it is incorrect, and there is nothing you can do about it, apart from going back through all the data and checking every single entry in your entire dataset. Which is incredibly expensive.

On the other hand, when you have a numerical entry that is contaminated with text, these entries can be identified and corrected quite easily.

There are a few ways of doing this, including Excel’s built-in Filter function, Custom Filter and – my personal favourite – the ISNUMBER formula.

Each of these methods can help you identify when numbers are not formatted as numbers, such as when they have text items within their structure, like this:

Data Cleaning Step 5 – Clean Numerical Data

The following video (from our exclusive video course Dirty Data Dojo: Data Cleaning) will show you how to use Excel’s built-in Filter function, Custom Filter and – my personal favourite – the ISNUMBER formula:

Data Cleaning in the Dirty Data Dojo

Did you know that data analysts spend up to 20% of their time collecting data and a whopping 60% on data cleansing and preparation?

Well, it doesn’t have to be this way.

Over several years I created a method of cleaning data in Excel that will cut your data cleaning time from 2 weeks to around 2 hours – or even less!

This method has been honed over several years with hundreds of researchers that have worked on thousands of datasets, and – after many years of being pestered – I finally relented and created a series of video courses to teach you how to clean data in Excel (and in Python and R).

This series of courses is called the Dirty Data Dojo, and you can get the entire Data Cleaning course right here:

Dirty Data Dojo - Data Cleaning

UNIQUE VIDEO COURSE

In less than 2 hours

your data can be:

  • Clean
  • Fit-For-Purpose
  • Analysis Ready

Summary – Data Cleaning in Excel

Data cleaning in Excel (and in other media) is, unfortunately, a necessary evil – without it you can’t analyse your data.

It doesn’t have to be so difficult and time-consuming, though – you can actually cleanse data in Excel in a matter of just a couple of hours (or even minutes), if you know how.

Knowing how to clean data by simply following our 5 data cleaning steps means that you can clean data in Excel in a fraction of the time compared to manual data cleansing.

This post forms part of a series on getting your data fit-for-purpose and analysis-ready.

For more detail, choose from the options below:


Tags


Lee Baker

Follow me:

About the Author

Lee Baker is an award-winning software creator that lives behind a keyboard in a darkened room. Illuminated only by the light from his monitor, he aspires to finding the light switch.

With decades of experience in science, statistics and artificial intelligence, he has a passion for telling stories with data.

His mission is to help you discover your inner Data Ninja!

You may also like

45+ Awesome Gifts for Data Scientists, Statisticians and Other Geeks
Computational Statistics is the New Holy Grail – Experts
3 Crucial Tips for Data Processing and Analysis
Correlation Is Not Causation – Pirates Prove It!
Summary Statistics in Excel: The 4 Essential Steps for Fit-For-Purpose, Analysis-Ready Data
How to Clean Data in Excel – The 5 Data Cleaning Steps You Need to Know
Remember Me
Chi-Squared Innovations
Success message!
Warning message!
Error message!