8 Things You Need To Know About Data Wrangling in R

If you've been following along (or not), I've been learning to program in R with a little help from a friend - Matt Dancho - and writing my findings and experiences into a series of blog posts so you can follow along with my progress and maybe try out R for yourself.

This is the third post in the series, and in this episode I'm taking my first steps in R in data wrangling, aka data munging, or - as I call it - data pre-processing.

You don't have to start at the beginning to learn something from this blog post, but if you'd like to, you can find the first post here:

Getting Started With R Programming For Data Science

Disclosure: we may earn an affiliate commission for purchases you make when using the links to products on this page. As an Amazon Affiliate we earn from qualifying purchases.

Catch Up

For reasons I explained at the beginning, I'm using R-Studio Cloud rather than R and R-Studio.

I've installed a collection of R packages designed for Data Science called Tidyverse, imported three data tables and joined them together using Pipes.

In this blog post I'm talking about working with the data to convert them from what we have into what we need, and there are a few new things to learn.

If you're in a hurry and want to learn how to program in R for Data Science in 7 weeks, Matt's course will teach you just how to do that.

And Matt's been kind enough to offer our readers a 15% discount.

Click To Learn More

Basic Principles of Data Wrangling

Data Scientists (or analysts of any ilk) typically spend 60-80% of their time cleaning and pre-processing data before they can start to analyse them, so it's well worth learning these critical tools of the trade.

Data wrangling is all about preparing your data for analysis, and transforming it from one form (raw, unable to be analysed) to analysis-ready. The basic principles of data wrangling - ones that every self-respecting Data Scientist should know intimately - revolve around creating and destroying data, moving them around and changing them into different forms.

Ultimately, in whatever medium you're working (Excel, R, Python, Minitab or whatever), you should know how do at a bare minimum the following:

Add variables to the dataset
Remove variables from the dataset
Rearrange variables
Create new variables (by calculation)
Separate data into individual columns (parsing)
Select specific variables (create a subset of variables)
Filter rows (create a subset of rows)
Sort rows (ascending and descending)

More advanced techniques would include cleaning data (including typos and mis-spellings), identifying and removing erroneous spaces, and correcting contaminated data, but for now let's focus on the more basic elements of data wrangling.

Data Wrangling in R Programming For Data Scientists #datascience #rstats #tidyverse @eelrekab @chi2innovations

Click to Tweet

Parsing Data Into Individual Columns

Parsing is just the technical term for separating data into its logical component parts, and this is a vitally important tool in any Data Science toolbox, and is the first data wrangling lesson in Matt's course.

In R we can use the function 'separate' to do this.

Let's say that we have a column in a T-shirt dataset that is in a comma-separated form, like this:

Admittedly, these data are actually really well-behaved in that they have consistent order, spelling, case, a consistent separator and there aren't any missing datapoints, but they provide a good starting point.

We can split this column of data into its three components using the 'separate' function (this is the R version of Excel's Text-to-Columns), like this:

myData %>%
separate(Color_Size_Price,
into = c("Color", "Size", "Price"),
sep = ", ",
remove = FALSE)

Note how easy it is to read the syntax - it's almost human:

"Separate my variable into its component parts this, that, and the other, using 'comma-space' as the separator, but don't remove the original data column from the dataset"

I don't know about you, but I like this simplicity and I'm quite enjoying learning R - unlike some of the other languages I've learned. Some of them are like pulling teeth!

Creating New Variables in R

In the next lesson, Matt showed me how to create new variables in R using the function 'mutate'.

Basically, this function is built to allow for computations that create new variables, like this:

myData %>%
mutate(totalPrice = price * quantity)

There are lots of different mathematical operations you can do with mutate, but ultimately it takes a variable (or more than one), performs a calculation and then creates a new variable as the result

Adding, Deleting and Reordering Variables in R

Next up, Matt showed me how to delete unwanted variables in R using the 'select' function, and adding columns using 'bind_cols'.

To delete variables from the dataset, all you need to do is add '-' before the variable name inside the 'select' function. So if I wanted to remove 'price' and 'quantity' from the dataset I would do this:

myData %>%
select(-price, -quantity)

And to add the 'price' variable back into the dataset you use 'bind_cols' like this:

myData %>%
bind_cols(myData ["price"])

To add both 'price' and 'quantity' back in, you would use 'bind_cols' with the concatenate function 'c', like this:

myData %>%
bind_cols(myData [c("price", "quantity")])

Now reordering data in R, we do something that's actually quite cool (can you tell I'm enjoying myself?), we use something called 'everything', like this

myData %>%
select(price, quantity, everything())

What we're doing here is saying "I want to move 'price' to the front, then 'quantity', then have everything else after that". That is just awesome! Have you ever seen anything like that in another programming language?!??

Usually, you would have had to list the names of all the columns you want after 'price' and 'quantity' (and there could be dozens of them), or else create a data subset with 'price' and 'quantity', then another one without 'price' and 'quantity', then concatenate them, but this way is just so cool...

Summary

At the beginning of this post I listed a number of data wrangling operations that every Data Science - trainee or otherwise - should be able to do.

So how did Matt's course fare with this list?

Well, he showed us how to add, remove and rearrange variables to/from the dataset, create new variables and parse variables into their individual constituents, so he gets a tick for the first 5 things on the list.

He didn't explain how to do the final 3 - create a subset of variables, filter rows or sort rows - but in fairness, so far I've only done 10% of the course, so perhaps we'll go over these in another lesson.

We'll see...

Additional reading

Most of the functions used here come from the package 'dplyr', which is included in the Tidyverse package.

The dplyr package is a set of functions for data manipulation, and comprise only 5 functions:

mutate
select
filter
summarise
arrange

I suspect that 'filter' will let you filter by row, 'select' will let you create a subset of variables, and 'arrange' will let you sort your data. I'm off to find out if this is correct!

If you're serious about learning how to manipulate data in R, go through this list of 5 functions and learn how to use them. Then practice, practice, practice until you dream of the Tidyverse...

Check Out The Course

If you're interested in learning R programming for business and follow along with me in this blog series, coding as you go, I highly recommend that you check out Matt's course. It's called Business Analysis With R, and you can check it out below.