Cleaning numerical data can take a ridiculous amount of time – if you’re cleaning it manually, that is.
Learning to use a few simple formulae in Excel can save you time, hassle and having to explain to your boss why that analysis report isn’t on his desk yet, a mere 2 weeks after he gave you the dataset.
Bless him, he doesn’t understand. He thinks that cleaning and analysing data is a matter of clicking a few buttons here and there in Excel. Even a monkey can do that, right?
In this blog post I’m going to show you which formulae to use and give you a plan of action to clean your numerical data.
Your data won’t know what’s hit it – and your boss will wonder just how you managed to get that report on his desk early while everyone else is still struggling…
On its own, data is useless.
You can look at it for hour after hour, day after day, and it will show you few of its secrets.
On the other hand, by using a few simple Excel functions and formulae you can learn an enormous amount about your data, and that’s without doing any fancy shmancy stats.
Descriptive statistics can give you a really good ‘feel’ for your data, but they can also show you where you might find some problems, errors, typos and all sorts of crap in your data – which you’ll need to clean up before you can do your ‘real’ stats.
In this blog post we’re going to take a look at just 2 Excel formulae – COUNT and COUNTIF – to show what they can tell us about numerical data.
We’ll see how accompanying these with other equally simple formulae – SUM, COUNTBLANK, ISBLANK, AND, ISTEXT and IF-THEN-ELSE – can really help get your numerical data clean in double quick time. And get a good sense of whether your data are sensible and fit-for-purpose too.
There's also a sister to this blog post for when you need to learn how to clean text data too.
The Data – and its Problems…
Let’s take a sample of Age data and see what obvious problems we can spot. As always I’ve included a UniqueID column (consecutive integers starting from 1), and we’ll use this column to help us interrogate the data.
The below left table shows what our data looks like. By eye, we can see that there are a number of issues with it (below right), including missing data, zeros, negative numbers and text entries (all highlighted with coloured cells).
There might be other kinds of problems we might encounter in these data if we scroll down, but we don’t want to do that – there might be tens of thousands of rows and I tend to get rather upset when my eyes burst into flames.
What we need is to do some simple descriptive statistics on these data so that we can see – at a glance – what we’re dealing with. In other words, we’re going to semi-automate our data cleaning. It’ll take a bit of work, but it’ll be worth it and I get to keep my eyeballs cool.
Using COUNT and COUNTIF to Count All The Numerical Entries
In Excel, data falls into 3 categories – numbers, text and empty cells. You can get mixed data in cells (numbers and text), but they’re just categorised as text.
We can distinguish between positive numbers, negative numbers and zeros by using just one formula – COUNTIF.
The form of COUNTIF is like this:
To count all the positive values in our Age data we type:
This means ‘look in column B and return a count of all the cells that contain a number greater than zero’.
Similarly, we can count all the negatives by using “<0” instead of “>0”, and count all the zeros by typing the number 0 in the criterion part of the formula (note: you don’t need to use quotation marks when using a number rather than a range).
Here are the results of running COUNTIF formulae on the numerical entries in our data:
As you can see, we have 49 positive numbers, 9 negative numbers and 4 zeros. We can calculate a subtotal of numerical entries in the data by summing, 49 + 9 + 4 = 62.
I always like to double-check the results by finding alternative ways to arrive at the same result, so I’ve used COUNT(B:B) to give me a count of all the numbers in column B, also 62.
It looks like this isn’t going to be a large dataset after all, but never mind, it’s the journey that counts, not the destination.
So clearly there are some errors in our data – Age cannot be zero or negative, so something has gone wrong with our data collection or data entry.
Twas ever thus…
Using COUNTIF and SUM to Count All The Text Entries
So onto the text data.
We know that we used codes in our data collection to tell us things about our empty cells, such as:
- a = data not collected
- b = waiting for lab
- c = data incorrect
- d = something else
There could be other text entries too, but let’s wait and see.
Using COUNTIF to count our coded data is precisely the same as using it for numerical data, we simply substitute letters in the place of the comparison within COUNTIF, like this:
This means ‘look in column B and return a count of all the cells that contain the letter ‘a’. Note that it is not case sensitive.
As you can see, we have 1 cell containing the code ‘a’, and 2 cells for each of codes b, c and d.
Summing these using SUM, we have 7 cells that contain coded text data.
We can also use wild-card entries in COUNTIF, so we can tell COUNTIF to return a count of all the cells in column B that contain text, like this:
=COUNTIF(B:B, “*”) – 1
Since column B has a header row, we need to reduce the count by 1, so COUNTIF tells us that there are 7 text entries in our Age column.
This is precisely what we got when we asked for a count of all our codes, so we know that there are no other text entries in column B.
Using COUNT and COUNTBLANK to Count All The Empty Cells
Now that we know how many number cells and how many text cells we have in our data column, what remains must be empty cells (and/or cells populated only with spaces, but that’s a subject best left for a whole other blog post!).
Firstly, we need to figure out how many entries we have in our Age column, and for that we use our UniqueID column (I bet you’d forgotten we were going to use that…).
That’s simple enough, just use COUNT on column A, and we find there are a total of 72 entries in our dataset:
Next, we subtract from that the number of numerical and text entries (62 + 7 = 69):
- 72 – 69 = 3
So there are 3 empty cells in our Age column.
Of course, we need to check this, so we use COUNTBLANK on column B:
This time we can’t select the entire column (B:B) because Excel will count all the blank cells down to the bottom of the spreadsheet – over 1 million of them – so we have to select the correct range.
Applied correctly, COUNTBLANK tells us that there are 3 blank cells in our data, which is the result we expected.
And finally, a confirmation that we have successfully accounted for every cell in column B:
Of course, all we have done so far is find out what we have in our Age data. We still don’t know where to find each of the cells that need attention.
For this we’re going to use the IF-THEN-ELSE construct.
If you’re not familiar with it, it might sound complicated, but it’s not so scary. Honest. Why are you running – come back!
Diagrammatically, IF-THEN-ELSE looks like this:
The basic syntax structure of IF-THEN-ELSE looks like this:
IF (condition is true)
THEN (do something)
ELSE (do something else)
A real world example of this might be:
IF (Cloudy = TRUE)
THEN (Get Umbrella)
ELSE (Wear Sunglasses)
In Excel, only the IF part of the statement is typed, while the THEN and ELSE parts are separated by commas, so the example above, written in Excel syntax, would look like this:
=IF(B2=“Cloudy”, “Get Umbrella”, “Wear Sunglasses”)
OK, got it?
Simple, really, isn’t it? Well, it can get more complicated, with nested IF statements and multiple conditions, but we’re not going there in this blog post – we’ll leave that for another day.
Using IF-THEN-ELSE to Locate Negative Numbers
Right. Back to our data.
We need to find out which cells contain negative numbers, and for this we’re going to use IF-THEN-ELSE. First we need to identify the negative cells, then tag them in a way that makes it easy for us to detect them either by eye or by search.
In cell C2, we type the following:
=IF(B2<0, “NEGATIVE”, “”)
This means ‘IF the value in cell B2 is less than zero, THEN return the text NEGATIVE, ELSE leave the cell empty’.
This formula is copied to the bottom of the data to populate the entire column. The result will look like this:
As you can see, most cells remain blank, except those adjacent to negative cells. This is exactly what we’re looking for. It is now very simple to scroll down and see by eye which cells contain negative values, and it is equally simple to do a search for cells containing the text NEGATIVE.
Using IF-THEN-ELSE to Locate Empty Cells
Similarly, we can identify empty cells:
=IF(ISBLANK(B2), “EMPTY”, “”)
This means ‘IF cell B2 is blank, THEN return the text EMPTY, ELSE leave the cell blank’, and the result looks like this:
Using IF-THEN-ELSE to Locate Zeros
Identifying zeros is a little more complicated. To locate cells with zeros in them we can use =IF(B2=0, “ZERO”, “”), but Excel will return a zero when the reference cell is blank.
Oops. How silly of Microsoft. Did no one ever tell them that zero is a real number and an empty cell contains no information?
Anyway, to get around this we need a double-barreled condition:
=IF(AND(ISBLANK(B2)=FALSE, B2=0), “ZERO”, “”)
This reads as ‘IF cell B2 is not blank AND cell B2 is zero, THEN return the text ZERO, ELSE leave the cell empty’. The result will look like this:
Using IF-THEN-ELSE to Locate Text Cells
Locating text-coded cells and other text items in our data is pretty simple. Using the ISTEXT formula, in cell C2 we type:
=IF(ISTEXT(B2), B2, “”)
This means ‘IF cell B2 contains text, THEN return the contents of cell B2, ELSE leave the cell empty’. The result will look like this:
Putting together everything we’ve just learnt, our Excel worksheet will now look like this:
All the positive data are ignored, while all the negative cells, empty cells, cells containing zeros and cells containing text have been tagged. We can now very quickly focus on any cell that contains an entry that is incorrect.
As always, we should check that what we have found is correct and that nothing has been omitted from our analysis, so we can use COUNTIF once again on our metadata (our ‘new’ data columns):
All our numerical entries have been verified, the text entries have been confirmed and the empty cells have been counted. Everything is present and correct!
Use Filter to Isolate Incorrect Cells
Now that we have everything bagged and tagged, it’s a simple matter to filter our data to gather together all the incorrect cells so we can give them some much-needed TLC.
On the right of the Home tab, click the Sort & Filter button, then click Filter and deselect the blank cells in the Negative Cells column, like this:
This hides the rows containing blank cells and shows only the rows containing the text NEGATIVE, corresponding to all the negative values in our Age data, like this:
From here we can identify the UniqueID values of these data, look them up in our original data and see where any errors might have crept in.
The same process can be followed to follow up on empty cells, cells with zeros in them and cells containing text.
We can then make decisions as to whether time and effort should be spent to clean the data, re-capture the information (which might involve going back into the archives or re-interviewing – a process that could be quite expensive) or exclude it from analysis.
Cleaning data is a messy business. It’s not sexy and it can sometimes make you want to lose the will to live, but by learning a few simple Excel formulae and having a standard procedure to follow, even the most unruly of data will soon roll over and ask you to tickle its tummy.
Excel is a powerful program and gives you enough functions and formulae to mean that data cleaning doesn’t have to be manual, difficult or lengthy.
I still doubt that you’ll enjoy it though…