2 Data wrangling

In this section we will learn how to handle data in a very efficient way. We will learn to

filter a data set based on variable values,
select variables,
create new variables,
group data based on variables,
dummarise the data, and
join different data sets.

We will obviously use R to solve these problems, but we do have the choice between different coding styles to do it. One way is to only use functions that already ship with R, or we can use functions from additional packages to solve the same problems. We choose the latter, and not only that: We will thoughout this course use a specific set of packages, an ecosystem if you wish (or even a philosophy of data work), called the tidyverse.

The tidyverse (tidyverse.org) is a set of R packages that work very well together, follows a consistent logic, and that enables us to write extremely clean code. We have already touched upon the difference between data wrangling using base R and the tidyverse in Chapter 1.9.

See the video below for some more details regarding the tidyverse. Some formulations in the video gives the impression that this material should be consumed on a specific day, but that are just some residue from a time when this part of the course was given intensively. You can find the cheat sheet here. See also this webpage for further information about one of the most central packages in tidyverse, dplyr.

2.1 The basics of `dplyr`

First, we warm up with a first few functions from dplyr. You can download the data set here: data-geilo.xlsx.

# Load libraries
library(dplyr)
library(tidyr)
library(readxl)

# Import data from Excel-file
sales <- read_excel("data-geilo.xlsx", sheet = "Sales")

# Look at data: 
head(sales)

# A tibble: 6 × 5
  trans customer orange cocoa  swix
  <dbl>    <dbl>  <dbl> <dbl> <dbl>
1     1        6      1     1     1
2     2        3      1     5     0
3     3        9      0     2     1
4     4       NA      2     2     2
5     5       NA      3     1     1
6     6        1      1     3     3

# All sales where cocoa > 0: 
sales[sales$cocoa > 0,]

# A tibble: 18 × 5
   trans customer orange cocoa  swix
   <dbl>    <dbl>  <dbl> <dbl> <dbl>
 1     1        6      1     1     1
 2     2        3      1     5     0
 3     3        9      0     2     1
 4     4       NA      2     2     2
 5     5       NA      3     1     1
 6     6        1      1     3     3
 7     7       NA      1     1     1
 8     8       NA      8     2     1
 9     9       NA      2     1     1
10    10       10      2    14     0
11    11       NA      3     2     2
12    13       NA      2     2     0
13    14       NA      3     4     0
14    15        3      3     3     0
15    16        3      3     1     0
16    17        5      0     5     1
17    18        1      1     3     1
18    20       NA      1     1     2

# Similarly with dplyr has a function for this: 
filter(sales, cocoa > 0)

# A tibble: 18 × 5
   trans customer orange cocoa  swix
   <dbl>    <dbl>  <dbl> <dbl> <dbl>
 1     1        6      1     1     1
 2     2        3      1     5     0
 3     3        9      0     2     1
 4     4       NA      2     2     2
 5     5       NA      3     1     1
 6     6        1      1     3     3
 7     7       NA      1     1     1
 8     8       NA      8     2     1
 9     9       NA      2     1     1
10    10       10      2    14     0
11    11       NA      3     2     2
12    13       NA      2     2     0
13    14       NA      3     4     0
14    15        3      3     3     0
15    16        3      3     1     0
16    17        5      0     5     1
17    18        1      1     3     1
18    20       NA      1     1     2

# or even better: combine dplyr and magrittr: 
sales |> 
  filter(cocoa > 0)

# A tibble: 18 × 5
   trans customer orange cocoa  swix
   <dbl>    <dbl>  <dbl> <dbl> <dbl>
 1     1        6      1     1     1
 2     2        3      1     5     0
 3     3        9      0     2     1
 4     4       NA      2     2     2
 5     5       NA      3     1     1
 6     6        1      1     3     3
 7     7       NA      1     1     1
 8     8       NA      8     2     1
 9     9       NA      2     1     1
10    10       10      2    14     0
11    11       NA      3     2     2
12    13       NA      2     2     0
13    14       NA      3     4     0
14    15        3      3     3     0
15    16        3      3     1     0
16    17        5      0     5     1
17    18        1      1     3     1
18    20       NA      1     1     2

# Some governing principles in with dplyr/tidyverse functions: 
# 
#  - The first argument is a data frame.
#  - The subsequent arguments describe what to do with the data frame. You 
#    can refer to columns in the data frame directly without using $.
#  - The result is a new data frame

# Sorting the data with dplyr: 
sales |> 
  arrange(cocoa) |> 
  head()

# A tibble: 6 × 5
  trans customer orange cocoa  swix
  <dbl>    <dbl>  <dbl> <dbl> <dbl>
1    12       10      3     0     0
2    19        7      4     0     2
3    21       42      5     0     7
4     1        6      1     1     1
5     5       NA      3     1     1
6     7       NA      1     1     1

# Sorting the data with several columns: 
sales |> 
  arrange(cocoa, orange, swix) |> 
  head()

# A tibble: 6 × 5
  trans customer orange cocoa  swix
  <dbl>    <dbl>  <dbl> <dbl> <dbl>
1    12       10      3     0     0
2    19        7      4     0     2
3    21       42      5     0     7
4     1        6      1     1     1
5     7       NA      1     1     1
6    20       NA      1     1     2

Exercise: Report the top of the dataset, sorted by

cocoa (increasing)
swix (decreasing)
orange (decreasing)

You should obtain a result equal to the data frame below. Note we only show the first six entries of the result.

# A tibble: 6 × 5
  trans customer orange cocoa  swix
  <dbl>    <dbl>  <dbl> <dbl> <dbl>
1    21       42      5     0     7
2    19        7      4     0     2
3    12       10      3     0     0
4    20       NA      1     1     2
5     5       NA      3     1     1
6     9       NA      2     1     1

Expand for solution

sales |> 
  arrange(cocoa, desc(swix), desc(orange)) |> 
  head()

# Alternatively: 
 sales |>
   arrange(cocoa, -swix, -orange) |>
   head()
 
# Note however that (-) requires num. vectors. desc can take e.g. factors as well.)

2.2 Selecting and creating new variables, and summarising a data frame

# Selecting some variables: 
head(sales[,c("swix", "cocoa")])

sales |> 
  select(cocoa, swix) |> 
  head()

# create new variables: 
sales |> 
  mutate(items = cocoa+swix+orange) |> 
  head()

sales |> 
  transmute(
    items = cocoa+swix+orange,
    trans = trans) |> 
  head()

# Summarise data: 
sales |> 
  summarise(sum_cocoa = sum(cocoa))

# Assignment 1: how many items were sold in total?
# Assignment 2: What was the max and min number of items bought by
# the people that also bought cocoa?

Exercise:

How many items were sold in total?
What is the min. and max. number of items purchased by customers that also bought at least one cocoa?

Expand for solution

# Exercise 1: 

sales |> 
  transmute(items = cocoa + swix + orange) |> 
  summarise(sum_items = sum(items))

# Exercise 2: 
sales |> 
  filter(cocoa>0) |> 
  transmute(
    items = cocoa + swix + orange) |> 
  summarise(
    max_items = max(items),
    min_items = min(items)
  )

2.3 Groups and summarise

# Compare the output from this command: 
sales

# ..to this one: 
sales  |> group_by(customer)

# Note how we in the second command have added some meta-information on groups
# to the data frame. Groups change the results we get when applying
# functions to the data frame. See below: 

sales |> 
  summarise(
    sum.orange = sum(orange),
    no.trans = n()) 

sales |> 
  group_by(customer) |> 
  summarise(
    sum.orange = sum(orange),
    no.trans = n()) 

# When we apply "summarise" to a data frame we are reducing it 
# to the summary statistics that we list in the call to the function. 
# In the command above, this is a sum and a count. There are many
# such functions we can use - just keep in mind that the function
# should return one item per group (or just one item if you don't
# have groups). 

# Make sure you are aware of the difference between mutate and 
# summarise: mutate *adds* a variable to the data frame, 
# summarise aggregates it. Look at the difference below on 
# the result when we change from "summarise" to "mutate":

sales |> 
  group_by(customer) |> 
  mutate(
    sum_orange = sum(orange),
    no_transactions = n()
  )

# Note also that the tibble we get after the mutate is still 
# grouped by customer. That means that if we apply new
# transformations to the tibble, dplyr-functions will in 
# general still obey the grouping. If you want to remove
# the grouping you can do that by applying ungroup to the 
# data frame: 

sales |> 
  group_by(customer) |> 
  mutate(
    sum_orange = sum(orange),
    no_transactions = n()
  ) |> 
  ungroup()

Exercise: How many cocoas were bought in total by customers who also bought more than two oranges? (Hint: How does mutate work on a grouped data frame?)

Expand for solution

sales |> 
  group_by(customer) |> 
  mutate(sum.orange = sum(orange)) |> 
  filter(sum.orange > 2) |> 
  ungroup() |>
  summarise(sum.cocoa = sum(cocoa))

2.4 Persistent or temporary `group_by`

Applying a group_by to a data frame keeps the data frame grouped until either you have summarized the data frame or you apply ungroup to the data frame. This is important to be aware of, many of the Tidyverse-functions behave differently on grouped and ungrouped data sets, and it can be easy to forget to ungroup-it after it is no longer needed.

Recently however, dplyr-functions support a separate .by-argument (or in some cases, by), that allows you to group the data frame directly in a function call instead of grouping in a separate step before. The grouping will then only apply to the called function, and doesn’t have to be undone with a ungroup afterwards.

As an example - two ways below of summing the number of cocoas bought by customer with more than two transactions. Both give the same results, but maybe in this case the .by-argument makes the code both more readable and condensed.

sales |>
  group_by(customer) |>
  filter(n() > 2) |>
  ungroup() |>
  summarise(sum(cocoa))

# A tibble: 1 × 1
  `sum(cocoa)`
         <dbl>
1           25

sales |>
  filter(n() > 2, .by = customer) |>
  summarise(sum(cocoa))

# A tibble: 1 × 1
  `sum(cocoa)`
         <dbl>
1           25

2.5 Joining data frames

Finally, we will join data frames. See the R4DS-book. Joins are necessary when we want to analyse two or more data sets jointly. In our simple example, we have two data sets: one with transactions, and one with customers:

# Read in both sheets
sales <- read_excel("data-geilo.xlsx", sheet="Sales")
customers <- read_excel("data-geilo.xlsx", sheet="Customers")

# See that sales stores transactions, as well as a link to a 
# customer number: 
head(sales)

# A tibble: 6 × 5
  trans customer orange cocoa  swix
  <dbl>    <dbl>  <dbl> <dbl> <dbl>
1     1        6      1     1     1
2     2        3      1     5     0
3     3        9      0     2     1
4     4       NA      2     2     2
5     5       NA      3     1     1
6     6        1      1     3     3

# In the "customers"-file we find info on each customer: 
head(customers)

# A tibble: 6 × 3
  customer hotel    room
     <dbl> <chr>   <dbl>
1        1 Vestlia   394
2        2 Vestlia   489
3        3 Holms     219
4        4 Holms     155
5        5 Holms     204
6        6 Holms       6

dplyr has six join functions, but of these left_join() and inner_join() are probably the most useful. Both functions take need two data frames as arguments, and joins them together into a single data frame. By default, the functions will match together rows based on column names that occur in both data sets.

left_join() returns all values from the first data frame, with all columns and values from the second data frame where there is a match between the two data sets. This means that we don’t keep any records from the second data set that are unmatched to the first data set.
inner_join() keeps only records that are matched in both data sets.

See examples below, as well as the output produced.

x <- tibble(id=c(1,2,3))
y <- tibble(id=c(1,2,4), value=c("a","b","c"))

x |> left_join(y)

Joining with `by = join_by(id)`

# A tibble: 3 × 2
     id value
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 <NA>

x |> right_join(y)

Joining with `by = join_by(id)`

# A tibble: 3 × 2
     id value
  <dbl> <chr>
1     1 a    
2     2 b    
3     4 c

x |> full_join(y)

Joining with `by = join_by(id)`

# A tibble: 4 × 2
     id value
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 <NA> 
4     4 c

x |> inner_join(y)

Joining with `by = join_by(id)`

# A tibble: 2 × 2
     id value
  <dbl> <chr>
1     1 a    
2     2 b

# Filtering joins: 
x |> semi_join(y)

Joining with `by = join_by(id)`

# A tibble: 2 × 1
     id
  <dbl>
1     1
2     2

x |> anti_join(y)

Joining with `by = join_by(id)`

# A tibble: 1 × 1
     id
  <dbl>
1     3

When joining, you should pay careful attention to the join conditions. In the joins above are equivalent to writing the join condition explicitly - which in the case below means that id from the first data set should be equal to id in the second data set to get a match. These join conditions may be exanded to join on e.g. inequalities and multiple columns with different names in the different data sets.

x |> left_join(y, by=join_by(id==id))

# A tibble: 3 × 2
     id value
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 <NA>

Exercise: Use the dplyr join verbs to create the following four different results:

A dataframe with transactions and customer info filled in
A dataframe with transactions for all registered customers
A dataframe with transactions for customers that are not registered
A dataframe that combines all the information in both data sets

Expand for solution

# 1: 
sales |> 
  left_join(customers, by=join_by(customer==customer)) |> 
  arrange(customer) |> 
  head()

# A tibble: 6 × 7
  trans customer orange cocoa  swix hotel    room
  <dbl>    <dbl>  <dbl> <dbl> <dbl> <chr>   <dbl>
1     6        1      1     3     3 Vestlia   394
2    18        1      1     3     1 Vestlia   394
3     2        3      1     5     0 Holms     219
4    15        3      3     3     0 Holms     219
5    16        3      3     1     0 Holms     219
6    17        5      0     5     1 Holms     204

# 2: 
sales |> 
  semi_join(customers, by=join_by(customer==customer)) |> 
  arrange(customer) |> 
  head()

# A tibble: 6 × 5
  trans customer orange cocoa  swix
  <dbl>    <dbl>  <dbl> <dbl> <dbl>
1     6        1      1     3     3
2    18        1      1     3     1
3     2        3      1     5     0
4    15        3      3     3     0
5    16        3      3     1     0
6    17        5      0     5     1

# 3: (why does this give different res. than is.na()?)
sales |> 
  anti_join(customers, by=join_by(customer==customer)) |> 
  arrange(customer) |> 
  head()

# A tibble: 6 × 5
  trans customer orange cocoa  swix
  <dbl>    <dbl>  <dbl> <dbl> <dbl>
1    21       42      5     0     7
2     4       NA      2     2     2
3     5       NA      3     1     1
4     7       NA      1     1     1
5     8       NA      8     2     1
6     9       NA      2     1     1

# 4:
full <- 
  sales  |> 
  full_join(customers, by=join_by(customer==customer))

2.6 Wrapping up

A word of caution on tidyverse: dplyr is somewhat contested among R-users. Some claim it is very easy to learn (although I’m not aware of any studies). Critics argue that it is slow compared to data.table. data.tablecan (sometimes) beat Python/Pandas in terms of speed not, but dplyr can not do that. Note also that the tidyverse ecosystem in general is continually being updated. This means that if you try to run the commands from this course on a fresh install of R and Tidyverse in a few years, it might not work unless you update the syntax.

Think about your usage of programming: If it is occasional scripting, ad-hoc reports etc, then speed is often not important, and occasionally changing syntax might not be an issue.

However, in my experience the tidyverse is significantly “faster” than alternatives for important use cases - which involve many exploratory analyses, data visualizations, and trying out many different ways of modelling a problem (i.e. usually what constrains your time may be how fast you can express what you want - not how long you need to wait for results).

Let’s wrap up this chapter with a final exercise:

Exercise: Create a summary statistics with the following properties:

Customer on rows, with all customers as well as non-registered customers (non-registered can be in a single row)
In addition to customer numbers, four columns:
- Number of transactions in total
- Total sales of each type of item

Expand for solution

# Why doesn't this work!?
#(check e.g. customer nr 2!):
full |>  
  arrange(customer) |> 
  mutate(customer=replace_na(as.character(customer), "Unregistered")) |> 
  group_by(customer) |> 
  summarise(count         = n(),
            sale.orange   = sum(orange  , na.rm=T),
            sale.cocoa    = sum(cocoa   , na.rm=T),
            sale.swix     = sum(swix    , na.rm=T))

# A tibble: 12 × 5
   customer     count sale.orange sale.cocoa sale.swix
   <chr>        <int>       <dbl>      <dbl>     <dbl>
 1 1                2           2          6         4
 2 10               2           5         14         0
 3 2                1           0          0         0
 4 3                3           7          9         0
 5 4                1           0          0         0
 6 42               1           5          0         7
 7 5                1           0          5         1
 8 6                1           1          1         1
 9 7                1           4          0         2
10 8                1           0          0         0
11 9                1           0          2         1
12 Unregistered     9          25         16        10

# A better way: 
full |> 
  arrange(customer) |> 
  mutate(customer=replace_na(as.character(customer), "Unregistered")) |> 
  group_by(customer) |> 
  summarise(
    count         = sum(!is.na(trans)),
    sale.orange   = sum(orange  , na.rm = T),
    sale.cocoa    = sum(cocoa   , na.rm = T),
    sale.swix     = sum(swix    , na.rm = T)
  )

# A tibble: 12 × 5
   customer     count sale.orange sale.cocoa sale.swix
   <chr>        <int>       <dbl>      <dbl>     <dbl>
 1 1                2           2          6         4
 2 10               2           5         14         0
 3 2                0           0          0         0
 4 3                3           7          9         0
 5 4                0           0          0         0
 6 42               1           5          0         7
 7 5                1           0          5         1
 8 6                1           1          1         1
 9 7                1           4          0         2
10 8                0           0          0         0
11 9                1           0          2         1
12 Unregistered     9          25         16        10

2 Data wrangling

2.1 The basics of dplyr

2.2 Selecting and creating new variables, and summarising a data frame

2.3 Groups and summarise

2.4 Persistent or temporary group_by

2.5 Joining data frames

2.6 Wrapping up

2.1 The basics of `dplyr`

2.4 Persistent or temporary `group_by`