Top 3 Tips for Data Classification

This blog post is all about data coding, and will give you our top 3 tips on how best to code your data.

It is the third blog post in the series, following hot on the heels of 11 Essential Tips for Effective Data Collection and 5 Data Cleaning Techniques You Need To Learn.

Here we deal with coding data, also known as data classification, and I’m going to give you my Top 3 Tips.

More...

OK, so we learnt how to collect, record and store our data (in Excel) and we set up a way of working that allows us to follow the trail of data from the beginning to the end and, more importantly, from end to beginning so that we can identify and rectify mistakes.

We learnt how to clean our data, use the functions and formulae built into Excel to ward off ‘bleeding eye syndrome’ and report any errors back to the original source to save time, effort and money later on.

So you now have a perfectly clean dataset, but you still have some work to do before you start analysing it. You need to learn how to code and classify your data, and that is what we’ll learn here.

Quick Navigation

1. Data Coding - Keep a Code Sheet

2. Data Coding - Identify Your Data Types

Data Cleaning Bootcamp for Analysts and Researchers

3. Data Coding - Check That Your Data are Sensible

Bonus Tip: DataKleenr – Automated Data Cleaning & Classification

1. Data Coding - Keep a Code Sheet

OK, so what is data coding or data classification?

Well, simply put, it’s the process of organising your data into categories for its most effective and efficient use.

But you already knew that, didn’t you?

You’ve already classified your data at the collection and cleaning stages. You’ve probably used data codes to represent categories, like 1 for male and 2 for female, but what’s important right now is that you note what your data means.

After all, your data codes aren't a secret are they?

If you go away on holiday for a month and come back to your dataset, will you still understand it? What if you hand it to someone else to work with. Will they understand it?

To make sure that the dataset can be easily understandable by anyone at any time, we need a code sheet.

Say you’ve entered the data for a variable as 1, 2 or 3. What does that mean?

Small, Medium or Large?
Pig, Sheep or Goat?

It matters because you shouldn’t be expected to remember all the details of how and why you coded your data that way.

Keep your codes in a separate worksheet and name it ‘Codes’ or 'Data Coding'. For each column make a note of what you've done to code your data and what your codes really mean.

If you’ve used additional data codes using ‘illegal’ entries such as negative numbers or letters, make a note of what they mean too, and you’ll have a code sheet that might look something like this:

When you come back to the dataset after a couple of weeks away from it, you’ll be glad you got organised like this. You’ll also make your boss, colleagues and local friendly statistician happy too, and that’s never a bad thing…

2. Data Coding - Identify Your Data Types

When you get to the analysis stage you’ll need to know your data types – Ratio, Interval, Ordinal and Nominal – so take a little time to decide which of these are appropriate for each variable, and note this down in your data coding sheet.

When you have a variable that has more than 2 categories, check whether there is some kind of order or progression to the data. If the categories are descriptive (Nominal), like ‘Pig’, ‘Sheep’ or ‘Goat’, rather than ordered (Ordinal), like ‘Small’, ‘Medium’ or ‘Large’, you might need to create a new variable for each category, like this:

It may seem like you’re just creating more work for yourself, but it’ll save you time later on – analysis of variables with only 2 categories is much easier than with 3 or more categories because the question of whether there is a natural order to the data becomes irrelevant if your data has only 2 categories.

UNIQUE VIDEO COURSE

In less than 2 hours

your data can be:

Clean
Fit-For-Purpose
Analysis Ready

READY TO GET STARTED?Try it free - no obligation

3. Data Coding - Check That Your Data are Sensible

Just because you’ve got a perfectly clean, classified, codified and organised dataset, it doesn’t mean that the data are correct.

Real life follows rules, and your data must too.

I once discovered that we had the oldest man in the world currently being treated in the hospital. At well over 300 years old it could be said that he’d had ‘a good innings’.

In the dataset I was analysing, the difference between his date of birth (somewhere in the 18th century) and date of hospital admission (21st century) meant that he was very old indeed.

Or perhaps his DOB wasn’t quite right…

"Real life follows rules, and your data must too" @eelrekab #gooddata #datascience #dataanalysis

Click to Tweet

The error in his DOB couldn’t be detected by standard error-checking in Excel because it was a perfectly legitimate date.

Sometimes, putting together 2 or more pieces of data can reveal errors that otherwise can be difficult to find, so it is sensible to do a few simple calculations on each variable to check that the data conform to sensible rules, such as:

Calculate the minimum, maximum and mean
Keep a count for each variable and each category
Check differences between dates

Making these checks (in a separate worksheet!) lets you find outliers, such as people who have a negative age or are several hundred years old, and gives you a good feel for your data.

Data Coding - Check That Your Data are Sensible

Something doesn’t feel right about the answers?

Then dive back in and take a look.

Don't be afraid of getting your hands dirty!

Bonus Tip: DataKleenr – Automated Data Cleaning & Classification

While it’s great to get your hands dirty and learn the basics of data cleaning and classification there’s really no substitute for cleaning and classifying your data quickly, accurately and automatically, so let us do the heavy lifting for you.

We’ve created DataKleenr, an automated data cleaning tool that cleans and classifies your data for you effortlessly, saving you time and money. Check it out – it might not be the worst thing that you do today.