If you want to have clean data, you need to understand the causes of dirty data.
Not only will that help you to prevent dirty data happening in the first place, but it will help you to build a robust data clean up strategy.
In this post you’re going to get an insight into the 6 common types of dirty data, and what you can do about each of them.
Finally, I’ll reveal to you the single cause of dirty data and break it down so you can learn how to fix it and get clean data.
This post forms part of a series on getting your data fit-for-purpose and analysis-ready.
For more detail, choose from the options below:
What Are Clean Data and Dirty Data?
Clean data are data that are accurate, complete and consistent, and there are 6 basic requirements. Clean data must be:
- 6Correctly Formatted
Conversely, dirty data are data that are inaccurate, incomplete or inconsistent, and there are 6 common types of dirty data. Dirty data are flawed in at least one of:
- 6Incorrectly Formatted
Let’s take a deeper look into each of these flaws found in dirty data.
Dirty Data Flaw #1: Incorrect Data
Data are correct when they are sufficiently accurate for your study.
For example, if you are collecting data on Height of seedlings, you will likely need to measure them to within the nearest millimetre, but probably not more accurately than that. For trees you may only need to measure them to the nearest centimetre, depending on the requirements of your study.
Data are incorrect when they are not sufficiently accurate, such as when you measure the weight of new born babies to the nearest kilogram – all babies would then have a weight of 2, 3 or 4kg. How could you tell if any of the babies are under- or over-weight with such a crude scale?
Similarly, data are incorrect when they are collected using different units. In 1628, the newly-built Swedish ship, the Vasa, left its Stockholm port for the first time and promptly sunk less than a mile into her maiden voyage, culminating in the deaths of 30 people. A recent investigation into the sinking of what was considered to be the most powerful warship in the world has discovered that the ship is thinner on the port side that the starboard side. Apparently, the workers on the starboard side used rulers that were calibrated in Swedish feet (12 inches), while workers on the port side used rulers calibrated in Amsterdam feet (11 inches).
What Are The Causes of Dirty Data? #dirtydata #datacollection #statistics #datascience #data @chi2innovations
Dirty Data Flaw #2: Incomplete Data
Data are complete when there are sufficient data for you to accurately reflect what you are measuring in your study.
That doesn’t necessarily mean that every cell in your spreadsheet must be complete (although that is desirable), there must be enough data to answer your research hypothesis with sufficient confidence.
For example, if you have a dataset with 20,000 patients in it and your study only needs 200 patients for meaningful results, it is of little consequence that 10% of the datapoints are missing – simply extract the data for 300 patients rather than 200!
So how do you deal with missing data?
Ultimately, you have 4 choices:
- 1Accept that there are empty cells scattered across your dataset (which can bias your dataset and anger reviewers)
- 2Delete the entire record from your analysis (which can bias your dataset)
- 3Find the values for the missing data (which can be expensive)
- 4Estimate (in some way) the values for the missing data (which can be complex, expensive and could also bias your dataset)
Choose your poison!
Dirty Data Flaw #3: Inconsistent Data
Data are consistent when they make sense across your dataset.
So, for example, if you have the same data in different tables in a database, they must be the same. A change of name, for instance, would not necessarily be reflected in all instances:
Similarly, if you have collected dates of birth and dates of death of patients, the dates of death should be (a) later than their corresponding dates of birth, and (b) within approximately 100 years.
In other words, consistent data make sense in the real world, and not just in a spreadsheet.
Dirty Data Flaw #4: Duplicate Data
Data are unique when there is only one entry per sample (or customer or patient, etc.).
For example, if a questionnaire is entered into a spreadsheet twice, it will no longer be unique. Such data will likely skew your analyses and could give you misleading results.
Duplicated data often occurs when different databases are combined into a single, larger database without stringent checks being made beforehand. The data guy responsible for writing the code to amalgamate the datasets needs to account for all eventualities, which, by the way, is pretty much impossible. Or at least is too expensive to be practical (which is a fallacy – see the 1-10-100 rule further down the page…).
Dirty Data Flaw #5: Invalid Data
Data are valid when they are correct in the real world right now.
Data validation is the process of verifying the data that are collected before they can be used. These data must be accurate, current and in the correct format. Since the accuracy and format of the data are also included on this list, we are only concerned here with whether the data are current or out-of-date.
Data often changes as a result of changing circumstances, for example, your current mobile phone number may be different to the one you had a year ago.
Conversely, data become invalid when valid data becomes out-of-date.
This is an issue that needs to be dealt with in the data collection stage. When you’re in the process of data cleaning you are unlikely to be able to establish whether data are valid or not.
Dirty Data Flaw #6: Incorrectly Formatted Data
Unlike humans, computers and data analysis software cannot use experience to interpret the data they see, so data need to be formatted consistently.
For example, dates may be entered in a dataset as dd-mm-yyyy (UK standard) or alternatively as mm-dd-yyyy (US standard). The data can only be correctly formatted when (a) a standard has been agreed upon, and (b) fully implemented.
Similarly, data entered with abbreviations, mixed case or even symbols can be considered as incorrectly formatted.
For example, there are an almost infinite number of ways in which a user can enter the term ‘Positive’ into a dataset:
A human being can see that these entries are (probably) all the same thing, but a computer or data analysis program has absolutely no chance!
I have lost soooo many hours of my life correcting entry errors exactly like these. It’s enough to make a grown man cry…
This is the biggest area for data cleaning, and there are many ways in which data may be incorrectly formatted, such as:
You will find that most of the types of errors that we discussed above can be effectively dealt with in just 5 data cleaning steps!
*Psst – did you know that around 60% of a data analyst’s time is spent on data cleaning?
It doesn’t have to be this way, though – you can get clean data in a matter of minutes.
If only you knew how…
What Are Examples of Dirty Data?
Whether the errors are in the form of typos, transpositions or variations in spelling, errors will creep into your data.
Here are a few examples of dirty data (and their clean data equivalents):
Clean Data Example #1: Incorrect/Correct Data
Some incorrect data can be rectified quite easily. For example, when a text entry has a typo or spelling variation (UK/US), a spell checker should be able to pick this up quite easily, such as:
Variations in spelling:
On the other hand, when an entry of “Small” should have been entered into a cell where “Medium” was entered unwittingly, this error may never be found. Both “Small” and “Medium” are legitimate entries for the variable “T-Shirt Size”, so it can be very difficult – if not impossible – to spot and correct them.
It can be similarly difficult to identify errors in numerical data, such as:
Running some simple summary statistics will pick up many errors, so in the case of age, use the formulae MIN and MAX (in Excel) to find the limits of your data. The age of most human beings is in the range [0,100], so zeros, negative numbers and ages larger than 100-ish will be easy to identify in a data clean up.
It can be fiendishly difficult, though, to detect the transposition of a numerical entry, such as:
Which – if any – is the correct one?
In these cases, the only way of finding them is to have a Double Entry system in place so that each entry is checked for correctness – which is very expensive.
Clean Data Example #2: Incomplete/Complete Data
If you have lots of empty cells in your dataset, you give reviewers an excuse to beat you about the head with a metaphorical stick (been there, done that…), but they are a fact of life in the world of data!
You can use the following Excel formulae to check how many entries you have for each variable (column):
If you get the same number for all variables, then you have a complete dataset and you’re ready to ride off into the sunset.
If there are a small number of empty cells, it’s not the end of the world (unless your boss says that it is – been there, done that…).
A large number of empty cells should ring alarm bells, though.
How many is too many? Dunno – it all depends on what you need to do with your data!
Clean Data Example #3: Inconsistent/Consistent Data
Quite often, inconsistencies can be found by doing a few simple checks and using summary statistics to help you out.
For example, when doing some analysis on a medical dataset I discovered that in our hospital we were treating the oldest person in the world – he was over 300 years old! Either that or his date of birth was recorded incorrectly (spoiler alert – it was). The database didn’t have any issue with his date of birth being somewhere in the 16th Century and didn’t raise a red flag (it should have). It only became apparent that the date of birth was incorrect when I took the difference between date of diagnosis and date of birth to determine his age at diagnosis.
I could only have discovered this if I had:
- 1calculated the difference between dates, and
- 2used MIN and MAX to find errors (negative numbers) and outliers (unlikely large numbers).
If you have extremely large numbers, it is likely that the date of birth was recorded incorrectly. When you have negative ages, that usually indicates that the date of birth and date of (some event) were entered the wrong way round.
*Psst, it’s me again – all these issues can be dealt with quickly and easily, and with a minimum of fuss.
If you want to know how, keep reading…
Clean Data Example #4: Duplicate/Unique Data
Whether you’re working with a database or a single spreadsheet, duplicate data is quite common.
When Debbie Smith of 13 Canoga Drive takes out home insurance, then takes out another home insurance policy a couple of years later when she’s moved to 73 Mulholland Way, the smart solution is to check whether this is the same person or a different one. Get the answer wrong and you might just have created duplicate data!
It’s even easier to enter duplicate data into a spreadsheet. Either 2 people enter the same data separately or the same person enters it twice.
Deleting (or fixing) duplicate data is one of the most common data clean up steps, particularly in businesses that have customer databases.
It’s quite simple to delete duplicate entries from a spreadsheet – there is an Excel function called Remove Duplicates that will do it for you. It’s much harder in a database, though – either the process is done manually (which is expensive) or you’ll need expert coding to help with automating it (also expensive)!
Clean Data Example #5: Invalid/Valid Data
The example above of Debbie Smith changing address is both an example of duplicate data and invalid data. She may well end up with more than a single profile in the dataset, but at least one piece of data is invalid. Her old address is now out-of-date, and her address in the dataset will only become valid when the new address is added.
Her old address and new address are both correct in the sense that both houses still exist, but only one of them is valid because she only lives in one of them at any given time.
Similarly, phone numbers used to have a text prefix denoting which exchange you were connected to, so you could have had a number that looked like these:
Phone numbers like these don’t exist anymore, and all numbers were moved to be number-only and in a specific format, depending on which country they are in, such as these (fictitious) numbers:
The point is that all your old phone numbers are examples of invalid data, whereas your current numbers are valid data.
Clean Data Example #6: Incorrectly/Correctly Formatted Data
Dates are classic examples of how data can be formatted incorrectly.
As an example, consider the following date in your dataset: 9/11/2001. Does this date correspond to the 11th of September or to the 9th of November? There is no way to tell. If you have a pair of researchers entering data into a dataset, from the US and the UK, you’d better be really sure that they are following precisely the same standards…
In fact, the following dates are all the same, and are all correct, but may all be incorrectly formatted:
You need really good data guys or coders to sort this mess out – and even then, you’ll still have to go back to the original source to check that they are correct!
Data clean up is hard! I can’t stress how important it is to get your data collection and data entry standards and processes right in the first place!
What Are the Causes of Dirty Data?
Dirty data can arise in any part of your project, and happens because of poor design of your project, poor measurement – including poorly calibrated instruments – and poor data collection and data entry.
Ultimately, there is only one cause of dirty data – human intervention.
Whatever system you have for data collection and data entry, it is created and designed by humans, and all resulting errors occur because somebody did something wrong.
We can, though, categorise how dirty data ends up in a dataset:
How Do You Prevent Dirty Data?
With the best will in the world, human beings are notoriously fat-fingered, and when you enter data manually into a dataset, mistakes WILL happen – no matter how careful you are.
Human Error (Entry Error)
As we know, typos, transpositions and variations in spelling are all too common in datasets, as well as many, many other types of entry error.
So, clearly, the first solution to preventing dirty data is to switch away from manual data collection and manual data entry and towards more automated means.
Poor Data Strategy
When it comes to the automated methods of data collection and entry, though, you will still likely get errors. After all, the computer code that collects and enters the data was written by a human, and it is unlikely that every possible scenario has been thought of.
Who knew that data collection and data entry would be so hard to get right?!??
You can mitigate some of this by having a strong data strategy, such as ensuring that different systems have consistent data entry standards and procedures, and datasets are kept up-to-date with changing requirements.
Moreover, if your data are stored in more than one place, for example multiple databases across different departments, then communication across these data depositories becomes critical in maintaining clean data.
For example, if you have the same data in different tables in a database, if a change in one occurs, then the code governing the database must be written such that the corresponding data in other tables updates automatically too.
Data Cleaning in the Dirty Data Dojo
It takes most people weeks to clean their data before they can start their analyses.
Well, it doesn’t have to be this way.
Over several years I created a method of cleaning data in Excel that will cut your data cleaning time from 2 weeks to around 2 hours – or even less!
This method has been honed over several years with hundreds of researchers that have worked on thousands of datasets, and – after many years of being pestered – I finally relented and created a series of video courses to teach you how to clean data in Excel (and in Python and R).
This series of courses is called the Dirty Data Dojo, and you can get the entire Data Cleaning course right here:
UNIQUE VIDEO COURSE
In less than 2 hours
your data can be:
Summary: Data Clean Up – From Dirty Data to Clean Data
Clean data doesn’t just happen.
Nor does dirty data.
Dirty data happens because human beings are flawed and cannot think of every possibility in advance. I mean, sure, over time you can gradually put in place procedures that mean that data can increasingly be gathered cleanly and dirty data can be cleaned automatically, accurately and robustly.
But that takes time, money and a huge commitment.
When that time, money and commitment are not invested, clean data are a virtual impossibility and dirty data are inevitable.
And then you have to face the consequences of a huge data clean up programme to transform dirty data into clean data.
Which will cost even more time and money.
The bottom line here is that the cheapest and most effective way to get clean data is to have world-class data collection procedures.
That doesn’t necessarily mean that you have to have almost infinite resources and a huge department dedicated to data.
It just means taking the time to think deeply about the data that you plan to collect and what you’re going to do with them.
Are there simple but effective procedures you can put in place that minimises errors getting into your data in the first place? Don’t forget – the most important part of data cleaning is data collection!
Think about it, plan it, then implement it.
I guarantee you’ll be in a much better place when it comes to your data clean up phase if you’ve taken the time to plan your data collection!
This post forms part of a series on getting your data fit-for-purpose and analysis-ready.
For more detail, choose from the options below: