Chapter 02 - Grand Project

Lesson 01 - Grand Project:

Part 1 - Data Cleaning

Download

Download the Excel file below.

This is a much larger dataset than you've been presented with so far, but it doesn't matter - you now have all the skills you need to clean it and render it fit-for-purpose and analysis-ready.

Take 20-30 minutes completing the exercises in Excel.

If you're coding in Python/R, you will probably want to set aside an equal amount of time to complete these exercises.

Stopwatch - Start Timing

Go and get a stopwatch (or an app on your phone or whatever) and time all of your data cleaning operations.

Trust me - you'll get a real buzz out of how quickly you can clean your data now that you know how!

Go ahead, go and get one. I'll wait...

Practicing in Excel (and Python/R - optional)

Hot Tip

Data preparation isn't an academic exercise - you have to do it and practice it to gain experience.


Don't skip these exercises!


This is your chance to really take your data prep skills to the next level!

Exercise 1: Remove Unwanted Spaces

In sheet 'Raw Data', apply the Grand Unified Theory of Data Cleaning to these data.

Remember what this does?

It removes all unwanted spaces, printing characters and case-controls your entire dataset in one awesome ninja move.

Don't forget the basic principles - never clean your dataset in your latest worksheet. Always create a separate worksheet and work on it there.

And remember to make copies of your live worksheet at various stages to version control your processes!

If you need a refresher, you'll find the relevant lesson here:

Exercise 2: Clean Your Text Data

Now clean your text data using your specially-created VLOOKUP worksheet.

You should have a translation matrix already created - all you need to do is adjust it for each text variable.

The codes for what should be in the dataset are in worksheet 'Codes'.

Go through each text variable one at a time.

If you see an entry that is not listed on the code sheet then it shouldn't be there - clean it out!

If you need a refresher, you'll find the relevant lesson here:

Exercise 3: Clean Your Numerical Data

Now clean your numerical data using your specially-created ISNUMBER worksheet.

Find all the cells that are contaminated and clean them up.

Not sure what to do with the contaminated data?

Inspect each error and make a decision. Decide whether to fix it or delete it.

If you need a refresher, you'll find the relevant lesson here:

This size of dataset is very typical of the datasets I used to receive every day as a Consultant Medical Statistician. At the beginning it used to take me about 2 weeks to clean it manually, until I found a better way.

You should be capable of cleaning this entire dataset of 19 columns and 1953 rows in less than 30 minutes!

Stopwatch - Stop Timing

So how did you do?

How long did it take you to go through each exercise and overall?

Make a note, and add them to the next set of exercises coming up - you'll end up with a total time for all your data operations!


Submit your timings to the relevant forum page:

Remember Me
Success message!
Warning message!
Error message!