The Hive - Learn, Help, Connect

Data Cleaning Bootcamp - in Excel, Python and R

Chapter 1 - Introduction to Data Cleaning

Lesson 3 - Synopsis

Setting The Scene

Data analysts, statisticians and Data Scientists typically spend 60-80% of their time cleaning and preparing data before they can get started on the analysis and, depending on the dataset, this process can often take up to two weeks to complete!

With a solid process and the right tricks of the trade, though, datasets – irrespective of their size – can be rendered analysis-ready in just a couple of hours.

In this course, you’re a statistician in a busy research-active teaching hospital.

Most days you will have researchers coming into your office to give you a new dataset in Microsoft Excel that they wish you to analyse, and they often have short deadlines – usually within 2 weeks.

The problem you have is that no dataset you’re given is ever analysis-ready, and you need to clean and prepare each one before you can get started, a process that often takes up to 2 weeks to complete manually – for every dataset!

At this stage, you’re no doubt thinking to yourself that this is an impossible situation and is unrealistic.

Au contraire! This was the precise situation your author found himself in, and in 8 years in the job he never missed a deadline. EVER!

The key is in finding ways to automate your data handling processes as much as possible, and this course is designed to give you all the Excel tools you need to fast-track your data handling process to get any dataset analysis-ready in just a couple of hours.

Moreover, you will be encouraged to grow beyond Excel and create and submit to the forum your own code in R or using the Pandas library in Python so you will have data cleaning and preparation skills to die for!

Why do I want you to submit your code to the forum?

So that we can all help each other become better coders and faster data cleaners!

Chi-Squared Innovations
Would love your thoughts, please leave a comment!x