Without proper procedures, sharing your data can put your data integrity at risk. Just 3 vital steps will save your dataset, though - and we reveal them here...
Disclosure: This post may contain affiliate links. This means that if you click one of the links and make a purchase we may receive a small commission at no extra cost to you.
You can find further details in our TCs
I’m sure you’ve all heard about the party game Chinese Whispers, where a message is given to a person at one end of a line and is then whispered to the next, then the next, and so on until it reaches its eventual destination at the other end. Typically, small changes in the message occur at each stopping point until the end message is something that bears little or no resemblance to what it started life as.
World War One gave us a staggering real-world example of what can happen when a message suffers from Chinese Whispers. A message sent from the trenches to British headquarters started as:
Send reinforcements, we’re going to advance
By the time the message had reached HQ it had become:
Send three and fourpence, we're going to a dance
We can laugh about it now, but I can’t help but wonder just how many lives could have been saved if the message had reached HQ unmolested.
Pin it for later
Need to save this for later?
Pin it to your favourite board and you can get back to it when you're ready.
So what has this got to do with data integrity, I hear you ask. Well, if you work with shared datasets, your data can suffer the same fate as the British message in the trenches. Over time, as your dataset is passed around, small changes and errors introduced – accidentally or otherwise – can kill the accuracy of your data, and what started out as a perfectly reasonable dataset is now not fit for purpose.
In this blog post, we’re going to take a look at data integrity in the context of shared data to see if we can introduce procedures that will mitigate against Chinese Whispers, and I’ve also interviewed a few experts in the field to get their take on things.
What is Data Integrity?
The principle of data integrity is that data should be recorded exactly as intended, and when later retrieved, is the same as when it was recorded.
To do this, any data handling procedures we put in place must ensure the accuracy, reliability and consistency of data over its entire life cycle.
As an attempt to ensure integrity in its data, the FDA uses the acronym ALCOA to define data integrity standards, where data is expected to be:
Data Integrity - don't let Chinese Whispers kill your data @eelrekab @chi2innovations #data #opendata #databackup
Why is Data Integrity Important?
Central to data integrity is the principle that there should be a single definitive source of data to establish an authoritative dataset that could serve as a single source of truth. Here, we need to think SMaRRT, where having a single, well-controlled and well-defined data integrity system increases:
This last point – traceability – is particularly important in terms of data integrity. When we think of a database or dataset, we typically think in terms of something that is static and unchanging, but it isn’t. When we first collect data, perhaps on paper, there will be errors in those data. It’s unavoidable – we’re only human, after all. Principles of data integrity state that we need to maintain those data in the original state. And then we clean them. This means changing the data from the original state to something else. So how can we keep a dataset in its original state, change it, and still have access to the original? The answer is that we need to keep multiple copies of the same dataset in various stages of handling (cleaning, preparation, etc.). This is called version control, and allows us to keep a chronological record of everything that has been done to the dataset right from the original to the current data. This also has the benefit of being a natural back-up system, although you will also need a separate back-up policy.
“Version control is extremely important,” says Dr Deans Buchanan, Palliative Medicine Consultant and Clinical Lead at NHS Tayside. “You need to know what’s updated and when”. He recommends naming files with date prefixes in this format:
year.month.day – name of file
Examples might be:
2018.08.28 – Version Control Your Files.docx
2018.08.31 – Cancer Dataset.xlsx
Dr Steven Wall, Director of SJW Bio-Consulting Limited, echoes these sentiments, maintaining that you should “version control every change, and with each change highlight what was changed and who performed the up-revision”. He also insists that “full transparency and openness is key to building trust with all partners”, including reporting all decisions and operations “whether bad or good”.
Following all this advice, every time you create a new version of a file you should immediately make at least one backup copy. In this way, every file has both a history and a backup, and they are all listed in chronological order.
What Can Go Wrong?
As demonstrated earlier, if you don’t have an effective data integrity system in place, your data might suffer from the effects of Chinese Whispers, changing over time until it bears little or no resemblance to the dataset that it started as.
Briefly, data integrity may be compromised through:
If you put rubbish in, you get rubbish out” Deans told me, “but most dangerous of all” he added, “is when you have good data that is turned to rubbish via error – if you don’t recognise that error then your rubbish out is potentially believed”.
How to Safeguard Against Errors and Data Corruption
As mentioned above, maintaining a single dataset as an authoritative source of truth is critical to data integrity. If you have any data, whether it is a departmental database, a Microsoft Excel spreadsheet, a list of clients, passwords, documents or any other type of data, there should be a single source of truth for these data.
As such, these are the minimum you should have in order to maintain data integrity:
While this may seem a lot to ask in terms of maintaining data integrity, these procedures can be, how can I put this – elastic – depending on the importance of your data. For example, documents detailing data handling standards might be little more than a ‘cheat-sheet’ for a small Excel dataset, but could be more akin to ‘War and Peace’ for a large departmental database. An access policy might be a list in a Word document for a small dataset, but for a large, important database could be an entire security system with password control, retina scanning and entering need-to-know information, such as your boss’ inside leg measurement.
The size and cost of the procedures you put in place should be proportional to the perceived value of the dataset that you wish to protect.
Dr Catherine Paterson, a Lecturer in the School of Nursing and Midwifery at Robert Gordon University, gave me some insights into how they collected and shared data in an international, multi-centre study across the UK and Australia:
“We developed an agreed coding book for the whole research team and a master data file with precisely the same variables and labels. This was distributed to all those involved in data entry, and facilitated easy merging of the UK and Australian datasets”. Although they did experience some inevitable disparities, “the clear and transparent coding of the variables from the outset minimised problems and data failure”.
On the issue of data entry, Steven said that “if possible, implement electronic data recording, rather than manual recording”, and in terms of ensuring the data are correct, he told me that you should “have all data checked and signed off as accurate by an operator, then verified by a second operator”.
Deans, speaking from bitter personal experience, insists that you should “never have only one person who knows the password”. After all, passwords are data too, and backup copies need to be kept. Unless you like losing access to entire datasets that you’ve carefully built and maintained for several years…
The three most important take-away messages when it comes to maintaining integrity in any data that is shared are that:
- 1There should be a single source of truth for that data
- 2All changes should be traceable back to the origin via version control
- 3Every version should be backed up
All other procedures are really only created to maintain these three critical aspects of data integrity.
I’ll leave you with a little advice from Deans Buchanan, who recognises that shared wisdom itself is also version controlled and backed up:
“You’ve got to plan ahead. Seek advice from those who have experience and made all the usual mistakes. Listen to them! Then commit to data entry and sharing being the foundation of the work”.
Wise words indeed…
How to Lie With Numbers, Stats and Graphs
A Box Set Containing Truth, Lies & Statistics and Graphs Don't Lie
Truth, Lies & Statistics and Graphs Don't Lie are two of our biggest selling books – and by far our funniest!
In these eye-opening books, award-winning statistician and author Lee Baker uncovers the key tricks of the trade used by politicians, corporations and other statistical conmen to deceive, hoodwink and otherwise dupe the unwary.
Discover the exciting world of lying with data, statistics and graphs. Get this book, TODAY!