Let’s face it, we’ve all been there. We’ve had a dataset that needed cleaning before we could start analysing it, so we started eyeing the data and fixing errors as we spot them. We convinced ourselves that it wouldn’t take long before we could get it done, then we can get on to the real stuff. A few hours later, when our eyes were bleeding, we finally decided to get organised and treat the data with a bit more respect.
OK, so what does ‘get your data organised’ really mean?
In this blog post I’m going to give you 7 steps to get you going in your quest to getting your data more organised. They’re not the only steps you could use, but I reckon they’re the most important and will get you most of the way there.
Disclosure: this blog post may contain affiliate links. As an Amazon Associate we earn from qualifying purchases.
OK, first things first – most people still use Microsoft Excel to store, clean and analyse their data, so this is our starting point.
When collecting data, most people enter them in a single Excel worksheet, then clean them and process them right there in the same worksheet. This really isn’t a very good idea.
If you select a single column of data and do a find & replace operation, will Excel do what you ask and replace items only in that column, or will it operate across the entire worksheet? Are you sure? Really, really sure? You need to be, because if Excel does something different to what you expect you’re likely to enter errors across your entire dataset. Worse still, some of Excel’s operations don’t have an ‘undo’, so correcting the errors that you’ve just introduced can quickly lead you down a rabbit hole.
No, the way to handle data is to be organised and use multiple worksheets to introduce a chronological flow to your data processing and cleaning, and this is what we’re going to learn here.
1. Collect and Store Your Data
When you collect your data, you should store them in a single Excel worksheet. These data are a copy of the paper-based forms that you used in data collection (if you collected data on paper). If you spot any errors as you’re entering the data, DO NOT CORRECT THEM! Enter the data exactly as you found them on the paper forms. Make a note of the errors, you’ll correct them later, but this first worksheet should be an identical copy of the paper-based forms. When your data have been entered this sheet should be named ‘Raw Data’, and should then remain untouched. Forever!
Of course, these data will likely contain errors, typos and other issues that will need your attention, but this sheet is your starting point. From here you’re going to use other worksheets to get organised and build a flow into your processes.
2. Make a Copy – Clean Your Data
From here you make a copy of ‘Raw Data’, and name it ‘In Progress’, where you’re going to clean your data. So now you have two worksheets that are identical, and you can safely do find/replace operations without the risk of entering errors, can’t you?
Au contraire! You NEVER do data cleaning operations in the worksheet where your data are stored. Instead, you extract the data you want to clean (i.e. a single column), clean and process it in a separate worksheet, then once you’re sure that these data are perfectly clean, you copy it back into your In Progress worksheet.
This procedure is called Extract-Transform-Load, or ETL for short, and ensures that you never introduce new errors into your data.
You can save various versions of this worksheet at various points if you wish (this is a good thing to do), and when you’re happy that your data are perfectly clean, you can rename your final ‘In Progress’ worksheet to ‘Clean Data’.
3. Make a Copy – Do Your Data Calculations
Now you have two (or more) worksheets, containing your data at different stages of preparation – one with the original (dirty) data, perhaps some with dirty data at various stages of cleaning and a final one with clean data. As you move to the next phase, you create a copy of the Clean Data worksheet and name it ‘Calculated Data’.
Some data are collected, like date of birth, while others need to be calculated, such as age – typically the difference between dates. Here is where you create, calculate and add in these new data.
Do you see why you clean the data before doing calculations on them? If you do it the other way round, you’re making calculations on data that contain errors. Once you’ve cleaned the errors you’re going to have to do the calculations all over again.
4. Understand Your Data – Descriptive Statistics
So now you have at least three worksheets, and I hope by now you’re starting to understand how you’re introducing a chronological flow to your data cleaning processes. When you discover an error – and you will – you can go back through your worksheets to see when, where and how the errors were introduced, and this gives you an opportunity to improve your data collection, cleaning and processing practices so that you’ll have fewer errors in future.
Talking about errors, computing descriptive statistics on each column of data is where you start to get a real understanding of your data and find the less obvious errors that you didn’t know existed.
The descriptive stats you’ll need to compute are different for categorical data and numerical data, but these are the most useful measures:
For Numerical Columns:
- all the entries
- all the positive entries
- all the negative entries
- all the entries that are zero
- all the empty cells
- The minimum value
- The maximum value
For Categorical Columns:
- all the entries
- all the entries in each category
- all the empty cells
Checking through these results will highlight more errors in your data. For example, are there ages that are negative or zero? There shouldn’t be any. Do the number of entries tally correctly? Are there any missing data points? Are the minimum and maximum values sensible?
If you’re using Excel, learn how to use the formulae COUNT, MIN, MAX, and AVERAGE for your numerical entries. For categorical entries, COUNTIF can tell you how many entries of each category you have in your variable. For empty cells, COUNTBLANK is a very useful formula to use.
Whenever you discover an error, trace it back through the worksheets until you reach the source of the error. I guarantee you’ll feel a flush of satisfaction when you realise that your new-found data organisational skills have just saved you!
5. Keep a ‘Codes’ Sheet
Some statistics programs don’t accept text data, so sometimes it can be useful to store categorical data as integers [1, 2, 3] rather than [Small, Medium, Large]. So if you’ve coded your categories as integers, will you remember what they signify? You shouldn’t need to remember, and this is why we keep a codes sheet - after all, they're not a secret, are they?
You need to remember that you may not be the only person that uses these data. You want those that come after you to say nice things about you, not moan about how disorganised you were, so keeping a note of the codes not only helps you, but it also helps others.
And don’t forget to keep a note of the measurement units in this sheet too!
6. Keep a ‘Notes’ Sheet
During your project you will make lots of decisions, and at some point you will need to describe and explain them to someone else. Keeping a notes sheet allows you to document everything about your study.
There’s nothing wrong with writing notes in your lab book, but if someone else needs access to your data you’re unlikely to hand this over to someone else or make a photocopy of it, so I recommend that you keep notes in a separate worksheet – that way your notes will remain with your data. If, at some point later, you need to hand your dataset over to a statistician for analysis, you’ll be glad you kept a notes sheet – it will save you hours of explanations!
7. Karma Banking
If your data were extracted from a departmental database, it is good practice to report all errors back to the original source. You’re not just being good to others, though – you’re being good to yourself.
It may be that at some point in the future you need to extract the same or similar data again. Only this time you won’t have to waste time cleaning errors you’ve already cleaned before.
Identifying errors and cleaning data might not be your idea of a good time (trust me, it’s not mine either), but if you get yourself organised and learn a few simple, repeatable processes, your time spent cleaning and preparing your data for analysis can be shorter – and less painful – than you expect.
Learning – even before you’ve begun collecting data – how to set up your Excel workbook to utilise multiple worksheets and introduce ‘flow’ into your data will help you reach the story of your data much quicker, and that usually goes down well with the boss!