Data preprocessing is one of the most important parts of data profiling and data preparation, and involves using summary statistics to find and fix errors in your data.

In this post you’ll discover the answers to the questions ‘what is data profiling?’, ‘what is data preprocessing?’, and ‘what is data preparation?’. Crucially, you’ll learn that they are not the same thing!

You’ll also discover the answer to the question ‘what are summary statistics?’, and you’ll learn which summary statistics in Excel will help you understand your data so you can correct errors and inconsistencies in double-quick time!

Let’s start by taking a look at data profiling to see just what it is…

More...

This post forms part of a series on getting your data fit-for-purpose and analysis-ready.

For more detail, choose from the options below:

## What is Data Profiling?

Data profiling is the process of interrogating your dataset to create useful summaries of it, which can then be used to discover anomalies and issues of data quality. Data profiling is commonly used during data cleaning (which is the process of applying the data summaries to clean the data and prepare them for analysis).

Typically, data are profiled in 2 ways:

A single column data profile is used to get an understanding of the data within each column, whereas dependencies can be uncovered in an across-columns analysis.

For example, running a single column data profile will allow you to identify and correct the following error:

However, the following error can only be identified by running an across-columns data profile:

Did you spot the error?

Both are legitimate dates, so on their own (within their own data columns) they appear perfectly fine, but the date of birth precedes the date of death, which cannot be correct. When the data were collected, they were probably recorded the wrong way round.

## What are Summary Statistics?

Summary statistics *are used to transform your data into information* that gives a quick and simple description of the values in your dataset, and are particularly useful for comparing your data before and after various data cleaning stages.

Such summary statistics might include:

Summary statistics are the tools that you use in data profiling, to get an understanding of the quality (or otherwise) of your data.

For example, in calculating summary statistics in Excel, you could use the **MIN formula** to find the smallest value in the Age column. Any values less than 0 can be found and corrected very quickly and easily. On the contrary, without summary statistics such values can be difficult to find and correct – especially with manual data cleaning.

**Summary Statistics - Just 4 Simple Steps to Get Your Data Fit-For-Purpose and Analysis-Ready** #dataprocessing #statistics #datascience @chi2innovations

## What is Data Preprocessing (aka Data Preparation)?

Data preprocessing is a set of techniques for improving the quality of your dataset by cleaning and preparing your data prior to some other step, such as statistical analysis or modelling.

Data preprocessing involves transforming raw data to well-behaved data that is fit-for-purpose and analysis-ready.

**Hot Tip**

The term ‘data preprocessing’ is often used interchangeably with ‘data preparation’, but they are not quite the same.

Data preprocessing involves the preparation of data *prior to* some other step, while data preparation encompasses all preparatory steps.

It’s a subtle difference, but in practical terms it has no impact on the way you handle data.

In reality, data preprocessing is where you act upon the anomalies and issues of data quality found by your summary statistics and correct as many of them as is feasible.

For example, using Excel’s MIN formula and discovering an Age of -32, the next step is to correct this error in some way, by either replacing it or removing it from the dataset if the correct value cannot be found.

By corrected anomalies like these, the quality of your dataset will be increased, and confidence in the results of your subsequent analyses will be higher.

## Data Profiling v Summary Statistics v Data Preprocessing

So, in summary, data profiling is where you check your data for errors and inconsistencies, summary statistics are (some of) the tools you use to find the errors, and data preprocessing (or data preparation) is where you fix the errors, thereby improving the quality of your dataset.

Now that we’re all on the same page in terms of the terminology (it wasn’t *that* difficult, was it?), let’s move on to the practical stuff – the summary statistics – which is where all the magic happens!

## Summary Statistics in Excel – 4 Data Preprocessing Steps

Let’s get one thing perfectly clear before we start. When you get a fresh, raw dataset, you don’t start with summary statistics. There will probably be too many issues in your dataset right off the bat, such as:

- 1Unwanted spaces
- 2Unwanted characters
- 3Case standardisation issues (Case, case, cASE, etc.)
- 4Typos and spelling mistakes (Large, Llarge, Larg, etc.)
- 5Numbers contaminated with text (42ish, 39.3x, 39.3.3, etc.)

These issues need to be dealt with at an earlier stage of data cleaning.

Summary statistics, on the other hand, are used to find problems in your data that you will have to rectify before you can proceed with your analysis.

Here are the 4 data preprocessing steps you will need to follow using summary statistics:

- 1Duplicate data
- 2Invalid entries in text data
- 3Invalid entries in numerical data
- 4Outliers

We’re going to deal with these one by one…

### Data Preprocessing With Summary Statistics: Duplicate Data

Duplicate data are redundant data in the sense that you may have the same data for a single person repeated in your dataset, like this:

Why is duplicate data a problem?

Well, if you were to take the average (the mean) of either of the Height or Weight columns, the duplicate data would be included in the calculation, and that would bias your result, perhaps leading you to the wrong conclusion. Duplicate data biases your analyses.

So, you need to eliminate the duplicate rows, but before you do, you might like to inspect them first, and you do that by using Excel’s Find Duplicates function – you’ll find it in Excel’s Conditional Formatting on the Home tab, like this:

This will highlight all the rows that are identical (bear in mind that they might not be next to each other – they could be thousands of rows apart), and then you can use Excel’s Remove Duplicates function to eliminate the duplicates (it will leave the uppermost instance intact, removing the rest of the duplicates). You will find it on the ribbon under the Data tab, like this:

Strictly speaking, removing duplicate rows of data isn’t part of summary statistics, but it really ought to be, and you need to do it anyway because it’s really important…

*Psst – if you want a pro tip, you can use Remove Duplicates with VLOOKUP to find and clean all sorts of problems in your data in a matter of seconds.

If only you knew how…

### Data Preprocessing With Summary Statistics: Invalid Entries in Text Data

First of all, you need to understand what we mean by ‘valid data’:

**Data are valid when they are correct in the real world right now.**

In other words, at this stage you are looking for data that do not make sense, so you can either correct or eliminate them.

Here are the summary statistics you will need for your categorical columns of data:

And when you run each of these summary statistics on your categorical data, you’ll get something that looks like this:

Again, look at these numbers with a critical eye:

When the summary statistics give you results that you don’t expect, dive into the data and find out why – then replace them with the correct values or remove them.

*Psst, it’s me again – summary statistics are superb for finding errors in your data, but they are just part of a larger suite of data preprocessing steps that you need to take, and some can be quite tricky.

We have you covered, though – we have every step you’ll ever need.

Keep reading to find out more…

### Data Preprocessing With Summary Statistics: Invalid Entries in Numerical Data

Now that you’ve dealt with categorical data, it’s time to move on to numerical data.

Let’s look at an example. In an Age column, a fat-fingered person might have clumsily typed the number 320 instead of 32. While 320 is a perfectly legitimate number, it is clearly not correct in the context of the age of a human being. Excel will happily accept it – and it is then your job to find and fix it!

So, how do you find invalid entries like these?

By eye?

Absolutely not!

To find invalid entries, summary statistics come to your rescue, and here are the *minimum* summary statistics you will need for your numerical columns of data (you might need more, like checking the difference between Dates of Birth and Dates of Death, etc.):

And when you run each of these summary statistics on your numerical data, you’ll get something that looks like this:

Look at each of these numbers with a critical eye:

These summary statistics will help you find invalid data so you can replace them with the correct values or remove them.

The following video (from our exclusive video course **Dirty Data Dojo: Data Validation**) will show you how to use summary statistics to identify invalid entries in your numerical data:

### Data Preprocessing With Summary Statistics: Outliers

Now that your numerical data are clean and analysis-ready, they might not yet be fit-for-purpose.

Why?

Because there might be outliers that will skew your results, meaning that you may get a biased result that doesn’t accurately represent the real world.

An outlier in numerical data is a data point that differs substantially from the rest of the data points – in other words, they are unusually large or small compared with the other values.

So, you will need to identify your outliers and then deal with them in one of three ways:

The decision on what to do is very much the decision of the researcher, and whatever you do you need to make sure you can justify your decision to the satisfaction of whoever is judging your research.

Identifying outliers in Excel can be tricky, and involves coding using the **QUARTILE formula**, but if you know how to do it you will get a list of outliers in seconds that you can assess and make qualified decisions on what to do with them, like this:

*Psst, I’m back…

An outlier analysis can be a bit long-winded if you’re coding it yourself in Excel. There is a faster way, though – and we know exactly how to do it!

If you want to know how to do **all** the different methods of data cleaning and preparation in Excel in minutes – and get to know all the pros and cons – keep reading…

## Data Preparation in the Dirty Data Dojo

Did you know that data analysts spend a whopping **60% of their time on data cleaning and data preparation**?

Well, it doesn’t have to be this way.

Over several years I created a method of cleaning and preprocessing data in Excel that will **cut your data preparation time from 2 weeks to around 2 hours** – or even less!

This method has been honed over several years with hundreds of researchers that have worked on thousands of datasets, and – after many years of being pestered – I finally relented and created a series of video courses to teach you how to clean and prepare data in Excel (and in Python and R).

This series of courses is called the **Dirty Data Dojo**, and you can get the **Data Validation** course right here:

### UNIQUE VIDEO COURSE

In **minutes **your data can be:

## Summary: Data Preprocessing With Summary Statistics

Data preprocessing is one of the most important parts of data profiling and data preparation, and involves using summary statistics to find and fix errors in your data.

Taking your dataset from ‘dirty’ to ‘clean’ involves far more than just spotting a few errors by-eye and correcting them.

There are 6 types of dirty data, and you need a strategy to deal with each of them – in the correct order – and in this post we’ve looked at how to use summary statistics to identify where flaws are in your data.

You’ve discovered the answers to the questions ‘what is data profiling?’, ‘what is data preprocessing?’, and ‘what are summary statistics?’, and you’ve also learned which summary statistics in Excel will help you understand your data to correct errors and inconsistencies.

I hope this will help you get your data clean, analysis-ready and fit-for-purpose in double quick time!

This post forms part of a series on getting your data fit-for-purpose and analysis-ready.

For more detail, choose from the options below: