December 19

What is Data Integrity, and Why Does it Matter?

Blog, Discover Data

0  comments

Without proper procedures, sharing your data can put your data integrity at risk. Just 3 vital steps will save your dataset, though - and we reveal them here...

More...

Disclosure: This post contains affiliate links. This means that if you click one of the links and make a purchase we may receive a small commission at no extra cost to you. As an Amazon Associate we may earn an affiliate commission for purchases you make when using the links in this page.

You can find further details in our TCs

I’m sure you’ve all heard about the party game Chinese Whispers, where a message is given to a person at one end of a line and is then whispered to the next, then the next, and so on until it reaches its eventual destination at the other end. Typically, small changes in the message occur at each stopping point until the end message is something that bears little or no resemblance to what it started life as.

World War One gave us a staggering real-world example of what can happen when a message suffers from Chinese Whispers. A message sent from the trenches to British headquarters started as:

Send reinforcements, we’re going to advance

By the time the message had reached HQ it had become:

Send three and fourpence, we're going to a dance

We can laugh about it now, but I can’t help but wonder just how many lives could have been saved if the message had reached HQ unmolested.

Data Integrity - Don't Let Chinese Whispers Kill Your Data

Pin it for later

Need to save this for later?


Pin it to your favourite board  and you can get back to it when you're ready.

So what has this got to do with data integrity, I hear you ask. Well, if you work with shared datasets, your data can suffer the same fate as the British message in the trenches. Over time, as your dataset is passed around, small changes and errors introduced – accidentally or otherwise – can kill the accuracy of your data, and what started out as a perfectly reasonable dataset is now not fit for purpose.

In this blog post, we’re going to take a look at data integrity in the context of shared data to see if we can introduce procedures that will mitigate against Chinese Whispers, and I’ve also interviewed a few experts in the field to get their take on things.

What is Data Integrity?

The principle of data integrity is that data should be recorded exactly as intended, and when later retrieved, is the same as when it was recorded.

Spot The Difference

To do this, any data handling procedures we put in place must ensure the accuracy, reliability and consistency of data over its entire life cycle.

ALCOA - Data Integrity Guardian

As an attempt to ensure integrity in its data, the FDA uses the acronym ALCOA to define data integrity standards, where data is expected to be:

  • Attributable – Data should clearly demonstrate who observed and recorded it, when it was observed and recorded, and who or what it is about
  • Legible – Data should be easy to understand, recorded permanently and original entries should be preserved
  • Contemporaneous – Data should be recorded at the same time as it was observed
  • Original – Source data should be accessible and preserved in its original form
  • Accurate – Data should be free from errors

Data Integrity - What is it and why does it matter? #data #opendata #databackup @chi2innovations

Click to Tweet

Why is Data Integrity Important?

Central to data integrity is the principle that there should be a single definitive source of data to establish an authoritative dataset that could serve as a single source of truth.

Data Integrity Best Practices

Here, we need to think SMaRRT, where having a single, well-controlled and well-defined data integrity system increases:

  • Stability – all data integrity operations are performed in one centralised system, ensuring consistency and repeatability of operations
  • Maintainability – one centralised system makes all data integrity administration simpler, more streamlined and less costly
  • Reusability – all applications benefit from a single centralised data integrity system
  • Recoverability – a single, centralised data source can be backed up regularly, ensuring easy recovery in case of data corruption or accidental loss
  • Traceability – every data point should be traceable back to its origin

What is Data Version Control?

This last point – traceability – is particularly important in terms of data integrity. When we think of a database or dataset, we typically think in terms of something that is static and unchanging, but it isn’t. When we first collect data, perhaps on paper, there will be errors in those data. It’s unavoidable – we’re only human, after all. Principles of data integrity state that we need to maintain those data in the original state. And then we clean them. This means changing the data from the original state to something else. So how can we keep a dataset in its original state, change it, and still have access to the original? The answer is that we need to keep multiple copies of the same dataset in various stages of handling (cleaning, preparation, etc.). This is called data version control, and allows us to keep a chronological record of everything that has been done to the dataset right from the original to the current data. This also has the benefit of being a natural back-up system, although you will also need a separate back-up policy.

Backup Everything

How is Data Version Control Implemented?

“Data version control is extremely important,” says Dr Deans Buchanan, Palliative Medicine Consultant and Clinical Lead at NHS Tayside. “You need to know what’s updated and when”. He recommends naming files with date prefixes in this format:

year.month.day – name of file


Examples might be:

2018.08.28 – Version Control Your Files.docx

2018.08.31 – Cancer Dataset.xlsx


Dr Steven Wall, Director of SJW Bio-Consulting Limited, echoes these sentiments, maintaining that you should “version control every data change, and with each change highlight what was changed and who performed the up-revision”. He also insists that “full transparency and openness is key to building trust with all partners”, including reporting all decisions and operations “whether bad or good”.

A Lie Hurts Forever

Following all this advice, every time you create a new version of a file you should immediately make at least one backup copy. In this way, every file has both a history and a backup, and they are all listed in chronological order.

Data Sharing - What Can Go Wrong?

As demonstrated earlier, if you don’t have an effective data integrity system in place, your data might suffer from the effects of Chinese Whispers, changing over time until it bears little or no resemblance to the dataset that it started as.

Chinese Whispers

Briefly, data integrity may be compromised through:

  • Human error – whether malicious or unintentional (spelling errors, transcription errors, entering incorrect data, entering correct data into the wrong place, etc.)
  • Transfer errors – including unintended alterations or data compromise during transfer from one device to another
  • Bugs – also viruses, malware, hacking, and other cyber threats
  • Compromised hardware – such as a device or disk crash
  • A lack of metadata – the data may be available, but the information needed to understand and use those data may be missing, making the data useless for anyone outside the immediate research team

If you put rubbish in, you get rubbish out” Deans told me, “but most dangerous of all” he added, “is when you have good data that is turned to rubbish via error – if you don’t recognise that error then your rubbish out is potentially believed”.

How to Avoid Data Errors and Data Corruption

As mentioned above, maintaining a single dataset as an authoritative source of truth is critical to data integrity. If you have any data, whether it is a departmental database, a Microsoft Excel spreadsheet, a list of clients, passwords, documents or any other type of data, there should be a single source of truth for these data.

A Single Source of Truth For Our Times - Honest

Data Integrity Principles

As such, these are the minimum you should have in order to maintain data integrity:

  • A single authoritative data source
  • Version control
  • Back-up systems (both on-site and off-site)
  • A gate-keeper (a source of responsibility for the data)
  • Maintenance procedures, including adequate training for those involved
  • Documentation of procedures, ensuring that data handling standards are upheld
  • An access policy which determines who may access the data (including correct level of access to users based on their training and role), and assignment of read/write privileges
  • A user record-keeping strategy, detailing who gained access, for what purpose and what they did with the data
  • A reporting system, where users of the data can report errors back to the original source
  • An auditing procedure, so that individuals can be held accountable for any inaccurate data entered into the system

How to Protect Your Data - Small and Large

While this may seem a lot to ask in terms of maintaining data integrity, these procedures can be, how can I put this – elastic – depending on the importance of your data. For example, documents detailing data handling standards might be little more than a ‘cheat-sheet’ for a small Excel dataset, but could be more akin to ‘War and Peace’ for a large departmental database. An access policy might be a list in a Word document for a small dataset, but for a large, important database could be an entire security system with password control, retina scanning and entering need-to-know information, such as your boss’ inside leg measurement.

Data Security

The size and cost of the procedures you put in place should be proportional to the perceived value of the dataset that you wish to protect.

A Good Example of Data Integrity in Action

Dr Catherine Paterson, a Lecturer in the School of Nursing and Midwifery at Robert Gordon University, gave me some insights into how they collected and shared data in an international, multi-centre study across the UK and Australia:

“We developed an agreed coding book for the whole research team and a master data file with precisely the same variables and labels. This was distributed to all those involved in data entry, and facilitated easy merging of the UK and Australian datasets”. Although they did experience some inevitable disparities, “the clear and transparent coding of the variables from the outset minimised problems and data failure”.

On the issue of data entry, Steven said that “if possible, implement electronic data recording, rather than manual recording”, and in terms of ensuring the data are correct, he told me that you should “have all data checked and signed off as accurate by an operator, then verified by a second operator”.

Electronic Data Entry

Deans, speaking from bitter personal experience, insists that you should “never have only one person who knows the password”. After all, passwords are data too, and backup copies need to be kept. Unless you like losing access to entire datasets that you’ve carefully built and maintained for several years…

Summary - 3 Critical Aspects of Data Integrity

The three most important take-away messages when it comes to maintaining integrity in any data that is shared are that:

  • 1
    There should be a single source of truth for that data
  • 2
    All changes should be traceable back to the origin via version control
  • 3
    Every version should be backed up

All other procedures are really only created to maintain these three critical aspects of data integrity.

I’ll leave you with a little advice from Deans Buchanan, who recognises that shared wisdom itself is also version controlled and backed up:

“You’ve got to plan ahead. Seek advice from those who have experience and made all the usual mistakes. Listen to them! Then commit to data entry and sharing being the foundation of the work”.

Wise words indeed…

How to Lie With Numbers, Stats and Graphs

A Box Set Containing Truth, Lies & Statistics and Graphs Don't Lie

How to Lie with Numbers, Stats & Graphs

Truth, Lies & Statistics and Graphs Don't Lie are two of our biggest selling books – and by far our funniest!

In these eye-opening books, award-winning statistician and author Lee Baker uncovers the key tricks of the trade used by politicians, corporations and other statistical conmen to deceive, hoodwink and otherwise dupe the unwary.

Discover the exciting world of lying with data, statistics and graphs. Get this book, TODAY!

Data Integrity - Don't Let Chinese Whispers Kill Your Data

Data Integrity - Don't Let Chinese Whispers Kill Your Data


Tags

data cleaning, data cleansing, data integrity, data processing, data tips, productivity tips


You may also like

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

Data Cleaning - The Big Picture

FREE Ultra HD pdf

Download your FREE mind map to learn the secrets to effortless data cleaning.

Remember Me
Success message!
Warning message!
Error message!