We live in an increasingly rich world of data – the amount of data that currently exists doubles every 18 months. That’s a phenomenal rate of growth and we’re just at the beginning of an incredible journey creating awesome intelligent applications that can handle these unimaginable amounts of data automatically.
This Big Data movement is happening at one end of the scale.
At the other, there are millions of researchers around the globe collecting and working with Small Data – data that is small enough to fit in an Excel spreadsheet and store on a floppy disc (remember those?).
It doesn’t matter whether you’re a scientist or an entrepreneur, in academia or in business, if you’re collecting data to try to answer some questions then you need to understand the fundamentals.
You’ll likely spend a lot of time observing, measuring, counting, classifying and quantifying what you see, and once you’ve collected your data you’re going to have to analyse it.
But let’s not get too far ahead of ourselves.
The brief video below will give you quick overview of this blog post, so you know what's coming.
Before you can get any answers you’re going to have to:
- Record & Store
- Clean & Classify
The textbooks tend not to dwell on the practical issues of these steps because, well, to be honest, it can get quite messy, but these are vitally important steps and you really do need to know how to do them properly if you’re going to get the most out of your data.
So let’s rewind to the beginning and see what we can do to get you off to a good start...
This blog post will deal with data collection methods and data management and will give you some great tips on how to get organised, how to get started and how to avoid making the kind of mistakes that will make you prematurely grey!
Over the next couple of weeks I will deal with other important issues such as data cleaning and data classification, but for the moment let’s focus on the most important – data collection. Mess this one up and there’s no point moving on to the others, so focus guys!
Here are 3 rules to start off with. Repeat after me:
- Don’t Panic
- Start thinking about the data before you start collecting it
- Make a personal vow to understand the basics of data
I've put together an infographic that might help you a little:
1. Record Data on Paper First…
So you’ve got your hypothesis (theory, idea or hunch). Once you’ve decided what data you need to collect, the first thing you should do is design a paper-based form to store all your data (assuming that at least some of your data is going to be recorded by hand).
Keep it simple, print it out, then manually record your data with pen and paper. One form per case/patient/customer/test-tube, etc..
2. …Then Transfer it to an Electronic Medium (Excel is most common)
We may be living in an electronic world, but ultimately you need a system where you (or anyone else) can follow the data trail from beginning to end and – more crucially – from end to beginning.
From time to time you WILL make a mistake with the data, so it is vitally important that you design a method that will let you spot and rectify the mistake by going back through all the steps until you find the error.
So now you have your data recorded on paper you need to transfer it into an electronic system. More than likely this will be either Microsoft Excel or Access.
In general, Excel is more common and easier to use, and has the added advantage that you can manipulate the data and do some simple analyses right there without having to export your data.
Most data is stored in Excel (in 7 years as a medical statistician I was only once given data in Access – all the other times it was in Excel), so we’ll go with that from here on in…
3. Enter Your Data on a Single Worksheet Whenever Possible
Trying to sort your data when it is spread across multiple worksheets can lead to all sorts of problems, so try to avoid it whenever you can – keep your data on a single worksheet.
Excel 2003 has limits of 65,536 rows by 256 columns. That’s large enough for most datasets, but if you need higher limits you can use Excel 2010 or 2013 (1,048,576 rows by 16,384 columns).
4. Use a Unique ID column
You’ll likely have to sort your data many times and by different columns, so you’re going to need a way of restoring the original order.
Use column A as a unique identifier to insert consecutive numbers starting from 1. It may be simple, but it’s very effective.
When you’ve put your Unique IDs into column A, go back to your original paper sheets and write the Unique ID there as well. Trust me – you’ll thank me for this tip later…
5. One Column per Variable
Each variable should have… oh, hold on a minute, what’s a variable?
Well, simply put, these are the things that can change or can be changed as part of your study. In short, these are all the pieces of information that you are observing, measuring, counting and collecting, like age, gender, distance, temperature, etc..
You can find more information in my blog on different data types.
Where were we? Ah yes…
Each variable should have its own column, and each variable should correspond to just one piece of information.
If you’re entering the age of a patient, then just enter their age, don’t enter their date of birth in the same column or cell. If you want to record their age and DOB, then use 2 separate columns.
If you’re recording a composite variable made up of 2 or more constituent parts, like Body Mass Index – made up of Height and Weight – then record them in separate columns. You can always combine them into a single variable later.
6. Row 1 is the Variable Name
Eventually you’ll need to analyse your data and you may need to export it to a statistical program.
The standard for pretty much all commercial stats programs is for the first row to be reserved for the name of the variable and all other rows for the data. So don’t be tempted to use rows 2, 3 and 4 as well as row 1 for the variable name. It might keep everything looking nice and tidy in Excel, but it will only create more work for you later.
7. Every Cell Should Have Something In It
What do empty cells tell you?
- waiting for more information?
- data not recorded?
- original data incorrect?
An empty cell is just a great big question mark and tells you nothing.
Worse still, incomplete datasets give reviewers a reason to whack you about the head with a metaphorical stick (and believe me they will – I’ve been there many times…).
So make sure that something is entered in every cell.
It is quite common to use ‘illegal’ numbers as codes to give you information, so where the entries for a variable can only be positive values (like age or height), we can use codes such as:
If negative numbers aren’t useful, then use letters a, b, c, etc..
If you’re not comfortable entering something in cells that strictly shouldn’t be there (after all, you are going to have to clean them up later before you can analyse your data), then use Excel’s Comment feature. I tend to use this sparingly, but that’s just me.
8. Keep Great Notes
When using codes you’ll need to keep notes to tell you what the codes mean. Keep the codes and notes in a different spreadsheet.
While we’re on the subject, it’s really important to
KEEP GREAT NOTES !!!
You’re likely not the only person that will ever work with this dataset, so get used to writing stuff down.
Explain what the project is all about, the question you’re trying to answer, why you’re collecting this data and how you’re going to get the answers you’re looking for. Explain how you measured things and under what conditions. If more than one person is collecting data, then explain who, what, where, when, why and how.
This will be the document that explains all the important stuff about your dataset, so write it down.
If there’s too much information to comfortably put into an Excel spreadsheet, then a Microsoft Word document will be just fine – and keep it in the same folder as the dataset.
9. Be Consistent
There’s nothing worse than getting a dataset that takes a fortnight to clean because data entry has not been consistent.
By that I mean make sure that if the entry for a variable should be ‘Positive’, then make it ‘Positive’ and not some other variation:
It’s hard enough correcting speeling missteakes and typos without also having to correct things that were deliberately entered differently.
Restrict the number of people that can enter data to cut down on this kind of issue, and make it clear what your data entry standards are.
10 Don’t Guess
Data should be entered as accurately as possible.
Don’t guess, approximate, round up or down.
Enter the value exactly as registered on paper.
If you need the data to be rounded up or down you can use Excel’s functions to achieve this, but if you’re doing calculations in your head, on paper or in a calculator you’ll make mistakes which can be difficult – if not impossible – to spot later.
11. Zero is a Real Number
Don’t enter the number Zero into a cell unless what has been measured, counted or calculated results in the answer Zero.
I’ve often received datasets with lots of zeros and when I asked, the zeros meant ‘I don’t have data for this’.
The problem is that if you want to calculate something, like the mean, then all the zeros will be used in the calculation and you will get an inaccurate answer – or one that is just plain wrong!
Well, this just about wraps it up for this blog post. I hope you found something valuable here.