The Hive - Learn, Help, Connect

Data Cleaning Bootcamp - in Excel, Python and R

Chapter 2 - Removing Unwanted Spaces

Lesson 6 - Practice Session - Excel, Python & R

Practicing in Excel

Here's where you get chance to put into practice what you've just learnt.

Hot Tip

Data cleaning isn't an academic exercise - you have to do it and practice it to gain experience.


Don't skip these exercises!


This is your chance to really take your data cleaning skills to the next level!

Before you get started, download the following Excel files to your computer:

Exercise 1: Practical Data Cleaning Excel Template

Go through the Excel file and get familiar with each of the worksheets.

Note how data flows chronologically through the workbook.

Take a sample of your own data and add it into this file (use the data collection best practices from before to guide you).

If you do this correctly, all the descriptive statistics should update automatically.

Exercise 2: Remove Unwanted Spaces

All the Excel code you need is contained within the Excel Practice File file.

Get familiar with the equation for removing unwanted spaces – it's useful to understand how it works.

Practice removing unwanted spaces using the steps outlined in the video lecture, in particular, note the advantages and disadvantages of using it.

Under what circumstances would you use it, and when would you use some other technique?

This is a critically important technique – practice it and master it!

Coding in Python and/or R

In the last lesson you learnt how to use the Excel formulae TRIM, CLEAN and SUBSTITUTE.

Investigate what alternatives exist in your choice of Python and/or R.

Exercise 3: TRIM()

What similar functions exist as an alternative to Excel's TRIM()?

Python hint:

strip()

R hint:

str_trim() and str_squish()

Do these functions trim all spaces from the string or just some of them?

Does this function trim interstitial spaces (more than one space inbetween words)?

Is there some other code you need to write to achieve this?

What else does the function do? Does it trim any non-printing characters or line breaks?

It is important to understand what your chosen function can and cannot do, so you can make any necessary adjustments.

Create an object in Python and/or R that replicates Excel's TRIM() function and submit it to the relevant forum page:

Exercise 4: CLEAN()

What similar functions exist as an alternative to Excel's CLEAN()?

Python hint:

encode("ascii", "ignore") and decode("ascii")

(to encode your data into ascii format to strip out the non-ASCII characters, then decode it back to the original format)

R hint:

iconv(mydata, "latin1", "ASCII", sub="")

Do these functions clean all non-ASCII (non-printing) characters from the string or just some of them?

Which non-printing characters are excluded (if any)?

Is there some other code you need to write to achieve this?

What else does the function do? Does it clean any standard characters or anything else?

It is important to understand what your chosen function can and cannot do, so you can make any necessary adjustments.

Create an object in Python and/or R that replicates Excel's CLEAN() function and submit it to the relevant forum page:

Exercise 5: SUBSTITUTE()

What similar functions exist as an alternative to Excel's SUBSTITUTE()?

Python hint:

replace("old", "new")

R hint:

replace(mydata, "old", "new")

Do these functions substitute the characters that you need to from the string or just some of them?

Which characters are excluded (if any)?

Is there some other code you need to write to achieve this?

What else does the function do?

It is important to understand what your chosen function can and cannot do, so you can make any necessary adjustments.

Create an object in Python and/or R that replicates Excel's SUBSTITUTE() function and submit it to the relevant forum page:

Remember Me
Success message!
Warning message!
Error message!