Data Integrity – Don’t Let Chinese Whispers Kill Your Data

​I’m sure you’ve all heard about the party game Chinese Whispers, where a message is given to a person at one end of a line and is then whispered to the next, then the next, and so on until it reaches its eventual destination at the other end. Typically, small changes in the message occur at each stopping point until the end message is something that bears little or no resemblance to what it started life as.

World War One gave us a staggering real-world example of what can happen when a message suffers from Chinese Whispers. A message sent from the trenches to British headquarters started as:

Send reinforcements, we’re going to advance

By the time the message had reached HQ it had become:

Send three and fourpence, we're going to a dance

We can laugh about it now, but I can’t help but wonder just how many lives could have been saved if the message had reached HQ unmolested.

Data Integrity - Don't Let Chinese Whispers Kill Your Data

More...

So what has this got to do with data integrity, I hear you ask. Well, if you work with shared datasets, your data can suffer the same fate as the British message in the trenches. Over time, as your dataset is passed around, small changes and errors introduced – accidentally or otherwise – can kill the accuracy of your data, and what started out as a perfectly reasonable dataset is now not fit for purpose.

In this blog post, we’re going to take a look at data integrity in the context of shared data to see if we can introduce procedures that will mitigate against Chinese Whispers, and I’ve also interviewed a few experts in the field to get their take on things.


What is Data Integrity?

The principle of data integrity is that data should be recorded exactly as intended, and when later retrieved, is the same as when it was recorded.

Spot The Difference

To do this, any data handling procedures we put in place must ensure the accuracy, reliability and consistency of data over its entire life cycle.

As an attempt to ensure integrity in its data, the FDA uses the acronym ALCOA to define data integrity standards, where data is expected to be:

  • Attributable – Data should clearly demonstrate who observed and recorded it, when it was observed and recorded, and who or what it is about
  • Legible – Data should be easy to understand, recorded permanently and original entries should be preserved
  • Contemporaneous – Data should be recorded at the same time as it was observed
  • Original – Source data should be accessible and preserved in its original form
  • Accurate – Data should be free from errors


Why is Data Integrity Important?

Central to data integrity is the principle that there should be a single definitive source of data to establish an authoritative dataset that could serve as a single source of truth. Here, we need to think SMaRRT, where having a single, well-controlled and well-defined data integrity system increases:

  • Stability – all data integrity operations are performed in one centralised system, ensuring consistency and repeatability of operations
  • Maintainability – one centralised system makes all data integrity administration simpler, more streamlined and less costly
  • Reusability – all applications benefit from a single centralised data integrity system
  • Recoverability – a single, centralised data source can be backed up regularly, ensuring easy recovery in case of data corruption or accidental loss
  • Traceability – every data point should be traceable back to its origin


This last point – traceability – is particularly important in terms of data integrity. When we think of a database or dataset, we typically think in terms of something that is static and unchanging, but it isn’t. When we first collect data, perhaps on paper, there will be errors in those data. It’s unavoidable – we’re only human, after all. Principles of data integrity state that we need to maintain those data in the original state. And then we clean them. This means changing the data from the original state to something else. So how can we keep a dataset in its original state, change it, and still have access to the original? The answer is that we need to keep multiple copies of the same dataset in various stages of handling (cleaning, preparation, etc.). This is called version control, and allows us to keep a chronological record of everything that has been done to the dataset right from the original to the current data. This also has the benefit of being a natural back-up system, although you will also need a separate back-up policy.

Backup Everything

“Version control is extremely important,” says Dr Deans Buchanan, Palliative Medicine Consultant and Clinical Lead at NHS Tayside. “You need to know what’s updated and when”. He recommends naming files with date prefixes in this format:

  • “year.month.day – name of file”


Examples might be:

  • “2018.08.28 – Version Control Your Files.docx”
  • “2018.08.31 – Cancer Dataset.xlsx”


Dr Steven Wall, Director of SJW Bio-Consulting Limited, echoes these sentiments, maintaining that you should “version control every change, and with each change highlight what was changed and who performed the up-revision”. He also insists that “full transparency and openness is key to building trust with all partners”, including reporting all decisions and operations “whether bad or good”.

A Lie Hurts Forever

Following all this advice, every time you create a new version of a file you should immediately make at least one backup copy. In this way, every file has both a history and a backup, and they are all listed in chronological order.


What Can Go Wrong?

As demonstrated earlier, if you don’t have an effective data integrity system in place, your data might suffer from the effects of Chinese Whispers, changing over time until it bears little or no resemblance to the dataset that it started as.

Chinese Whispers

Briefly, data integrity may be compromised through:

  • Human error – whether malicious or unintentional (spelling errors, transcription errors, entering incorrect data, entering correct data into the wrong place, etc.)
  • Transfer errors – including unintended alterations or data compromise during transfer from one device to another
  • Bugs – also viruses, malware, hacking, and other cyber threats
  • Compromised hardware – such as a device or disk crash
  • A lack of metadata – the data may be available, but the information needed to understand and use those data may be missing, making the data useless for anyone outside the immediate research team


“If you put rubbish in, you get rubbish out” Deans told me, “but most dangerous of all” he added, “is when you have good data that is turned to rubbish via error – if you don’t recognise that error then your rubbish out is potentially believed”.


How to Safeguard Against Errors and Data Corruption

As mentioned above, maintaining a single dataset as an authoritative source of truth is critical to data integrity. If you have any data, whether it is a departmental database, a Microsoft Excel spreadsheet, a list of clients, passwords, documents or any other type of data, there should be a single source of truth for these data.

A Single Source of Truth For Our Times - Honest

As such, these are the minimum you should have in order to maintain data integrity:

  • A single authoritative data source
  • Version control
  • Back-up systems (both on-site and off-site)
  • A gate-keeper (a source of responsibility for the data)
  • Maintenance procedures, including adequate training for those involved
  • Documentation of procedures, ensuring that data handling standards are upheld
  • An access policy which determines who may access the data (including correct level of access to users based on their training and role), and assignment of read/write privileges
  • A user record-keeping strategy, detailing who gained access, for what purpose and what they did with the data
  • A reporting system, where users of the data can report errors back to the original source
  • An auditing procedure, so that individuals can be held accountable for any inaccurate data entered into the system


While this may seem a lot to ask in terms of maintaining data integrity, these procedures can be, how can I put this – elastic – depending on the importance of your data. For example, documents detailing data handling standards might be little more than a ‘cheat-sheet’ for a small Excel dataset, but could be more akin to ‘War and Peace’ for a large departmental database. An access policy might be a list in a Word document for a small dataset, but for a large, important database could be an entire security system with password control, retina scanning and entering need-to-know information, such as your boss’ inside leg measurement.

Data Security

The size and cost of the procedures you put in place should be proportional to the perceived value of the dataset that you wish to protect.

Dr Catherine Paterson, a Lecturer in the School of Nursing and Midwifery at Robert Gordon University, gave me some insights into how they collected and shared data in an international, multi-centre study across the UK and Australia:

“We developed an agreed coding book for the whole research team and a master data file with precisely the same variables and labels. This was distributed to all those involved in data entry, and facilitated easy merging of the UK and Australian datasets”. Although they did experience some inevitable disparities, “the clear and transparent coding of the variables from the outset minimised problems and data failure”.

On the issue of data entry, Steven said that “if possible, implement electronic data recording, rather than manual recording”, and in terms of ensuring the data are correct, he told me that you should “have all data checked and signed off as accurate by an operator, then verified by a second operator”.

Electronic Data Entry

Deans, speaking from bitter personal experience, insists that you should “never have only one person who knows the password”. After all, passwords are data too, and backup copies need to be kept. Unless you like losing access to entire datasets that you’ve carefully built and maintained for several years…


​Summary

The three most important take-away messages when it comes to maintaining integrity in any data that is shared ​are that:

  1. there should be a single source of truth for that data
  2. all changes should be traceable back to the origin via version control
  3. every version should be backed up


All other procedures are really only created to maintain these three ​critical aspects of data integrity.

I’ll leave you with a little advice from Deans Buchanan, who recognises that shared wisdom itself is also version controlled and backed up:

“You’ve got to plan ahead. Seek advice from those who have experience and made all the usual mistakes. Listen to them! Then commit to data entry and sharing being the foundation of the work”.

Wise words indeed…

Data Integrity - Don't Let Chinese Whispers Kill Your Data

Data Integrity - Don't Let Chinese Whispers Kill Your Data