In this section, we’ll discuss the contents of R4DS chapter 27. We have already covered writing loops. In this section we’ll expand on that topic using specialized functions for iterating over objects. These functions are perhaps slightly more abstract than explicit loops. However, by using them we can write scripts that may be easier to understand than the bulkiness we would have had with explicit loops.
We’ll first get accustomed to some of the purrr-functions in a few simple examples, before moving on to applying the functions on a more interesting data set.
7.1 Introductory examples
7.1.1 Iterating over columns
Lets start with creating a data set with some random numbers:
library(purrr) library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Say we would like to calculate the median over all the columns in the data set. We could of course do this by typing up the following
median(df$a)
[1] -0.2549904
median(df$b)
[1] 0.1156531
median(df$c)
[1] -0.7929189
median(df$d)
[1] 0.1210732
median(df$e)
[1] 0.4921389
In addition to being verbose, we would need to make changes to these lines if we were to add or subtract columns to our data frame. And further, if we wanted different metrics (e.g. column means as well as medians), we would have to write even more code.
We can however write a function for iterating over columns. Consider the function below:
col_summary <-function(df, fun) { out <-vector("double", length(df))for (i inseq_along(df)) { out[i] <-fun(df[[i]]) } out}
Note a few items:
The loop iterates over seq_along(df), which in our case evaluates to 1,2,3,4,5. This is slightly more robust than iterating over 1:ncols(df), as also works for edge-cases where there are no columns in the data set.
The second argument is a function name. The function applies the function given to each of the columns
We can now call this function a few times, using different summary functions:
With this method we can get summaries of all columns in the data frame, without needing to changing these lines if we add or subtract columns.
7.1.2map-functions
The package purrr comes with functions specifically designed for iterations. The example above could be solved with the function map. See how this also preserves the names of the columns
The output from map-functions is a list. If you expect the return from the map call to be boolan, integer, double or character vectors, you can ensure you do indeed get such a vector in return by using map_lgl, map_int, map_dbl or map_chr-respectively.
If we want the results as e.g. a data frame, we can combine them using an appropriate function - below all the column means are combined into a data frame with one row. Not also how we can add arguments to the function applied by map by adding arguments to the map-call:
df |>map(mean, trim=.1) |>bind_cols()
# A tibble: 1 × 5
a b c d e
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.321 -0.00464 -0.591 0.259 0.403
When we call map, we need to supply a single function that will be applied to each of the objects we are iterating over. If we want to apply multiple functions at the same time, we could store this as a new function, and call this once. However, we can also call an anonymous function (see also the “Function” chapter of the compendium”):
df |>map( { \(x) mean(x) /sd(x) } ) |>bind_cols()
# A tibble: 1 × 5
a b c d e
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.382 0.0604 -0.762 0.193 0.377
With map functions we can do a lot more interesting stuff than making column summaries. Run the code below yourself line by line. In the example below, we
Use a built-in data frame with cars
Split the data set into a list of data frames, split by the values of the cyl-column
Apply a linear regression model to each of the data frames separately
Summarize each of the regression models
Extract the R^2-statistic from each regression model
The lambda function \(x) x$r.squared returns named element r.squared in x. This is a common operation, so the map-function can also be supplied with a character string instead of a function. The example below does the same as the example below, but with a name-argument instead of a function:
The map-functions also support integer values, in which case e.g. map(x,2) returns the second value of each item of x.
7.1.3 Dealing with errors
The map-functions will generally give an error if they receive an error when applied to any of the items it is applied to. In the example below we get an error due to a character-entry in the list. This might seem frustrating, but it forces you to actively make a choice of how you want to deal with the errors.
x <-list(1, 10, "a")y <- x |>map(log)
Error in `map()`:
ℹ In index: 3.
Caused by error:
! non-numeric argument to mathematical function
We can wrap the function call in the function safely. This may be a useful tool when you are developing code, as it lets you see which records caused errors:
y <- x |>map(safely(log))y |>map("error") |>map(is_null)
[[1]]
[1] TRUE
[[2]]
[1] TRUE
[[3]]
[1] FALSE
possibly is a different option. This function will insert a default value (NA_real below) in case the function call fails.
x |>map(possibly(log, NA_real_))
[[1]]
[1] 0
[[2]]
[1] 2.302585
[[3]]
[1] NA
7.1.4 Iterating over multiple lists
If we want to iterate over two lists (of the same length) simultaneously, we can do so by using map2:
mu <-list(-10000, 0, 10000)sigma <-list(1, 5, 10)map2(mu,sigma,rnorm,n=5)
We can also iterate over more than two lists simultaneously. The lists then need to be combined in a single list, and the names of each list must match the names of the arguments in the function we want to apply.
Let’s apply the purrr-functions on a more realistic data set. We will play with an API from Vegvesenet, the Norwegian governmental road authority. Vegvesenet has an API we can query for data on traffic volumes at many sensor stations in Norway.
The API uses graphQL for requests. Let’s define a function where we can submit queries to an external API (we will not spend time discussing the query language or the API any further..).
library(httr)library(jsonlite)library(ggplot2)library(DescTools)library(tidyverse)library(magrittr)library(rlang)library(lubridate)library(anytime)GQL <-function(query, ...,.token =NULL,.variables =NULL,.operationName =NULL,.url = url) { pbody <-list(query = query,variables = .variables,operationName = .operationName)if (is.null(.token)) { res <-POST(.url, body = pbody, encode ="json", ...) } else { auth_header <-paste("bearer", .token) res <-POST( .url,body = pbody,encode ="json",add_headers(Authorization = auth_header), ... ) } res <-content(res, as ="parsed", encoding ="UTF-8")if (!is.null(res$errors)) {warning(toJSON(res$errors)) } res$data}# The URL we will use is stored below: url <-"https://www.vegvesen.no/trafikkdata/api/"# Let's figure out which sensor stations that are operable. # The query below extracts all the stations, with a date for # when the station was in operation as well as a long/latitude. qry <-'{ trafficRegistrationPoints { id name latestData { volumeByDay } location { coordinates { latLon { lat lon } } } }}'# Allright - let's try submitting the query: stations <-GQL(qry)
We now have the a long list in memory - 14mb! - with just a little information on each station. We can note that this is a list, not a dataframe. For our purposes, it would be better if the list was instead a data frame, with one row pr. sensor station.
Note that the list itself only contains one entry:
length(stations)
[1] 1
…however, this first entry contains 6436 data points. You might get a slightly different answer, as Vegvesenet is changing the number of sensors. Note that when we subset a list, using [[i]] selects the contents of the item i, and [i] is a list.
length(stations[[1]])
[1] 8197
7.2.2 Transforming list-columns
Let’s look at the first entry of this long list. We can see there is a station ID, station name, date time of latest recording from the station and coordinates. This looks like something that could fit well within a data frame, with columns id, name, latestdata, lat, and lon. The question is how! You might want to refer to chapter 24 of R4DS for more on hierarchical data.
We could perhaps hope that we can force this list into a data frame. For this we will use as_tibble:
stations[[1]][[1]] |>as_tibble()
# A tibble: 1 × 4
id name latestData location
<chr> <chr> <named list> <named list>
1 52742V2282262 ØRBEKK EV6 <chr [1]> <named list [1]>
Exercise:
We now want to apply this as_tibble transformation to each of the stations, and combine them in a single data frame. Transform the list into a data frame, with at id and name as columns, and one row per station. We can fix the date time and locations columns later, but use one of the map-functions from purrr.
Expand for solution
Using the map-function we traverse all the entries in the stations list, and transform these lists to data frames. We can then combine the list of data frames into a single data frame by calling list_rbind.
stations[[1]] |>map(as_tibble) |>list_rbind()
# A tibble: 8,197 × 4
id name latestData location
<chr> <chr> <named list> <named list>
1 52742V2282262 ØRBEKK EV6 <chr [1]> <named list [1]>
2 41517V704478 BASTERUD <chr [1]> <named list [1]>
3 01050V1126115 Trollvika <chr [1]> <named list [1]>
4 24748V22148 Pinesund <chr [1]> <named list [1]>
5 11446V1175840 Melsomvik <chr [1]> <named list [1]>
6 99015V249528 Hjelset øst <chr [1]> <named list [1]>
7 78755V249572 Sogge <chr [1]> <named list [1]>
8 83652V319725 Strandgata nord <chr [1]> <named list [1]>
9 51934V2523546 Snarud nord <chr [1]> <named list [1]>
10 65271V443150 Vestby syd ny <chr [1]> <named list [1]>
# ℹ 8,187 more rows
There is still some work left to do with the date time and location columns. As you can see below, they are still in a list format. We can try to pull out the insides of the contents of the latestData-column. It is formatted as a list, but actually only contains one date time entry.
Mutate the contents of the latestData-columns, such that it is in a character format. You don’t have to format it to a proper date time (yet..). This task has two complications: one is to apply transformation to all entries of the list - another is to deal with missing values - that might cause errors.
Expand for solution
We can use map_chr. This function will try its best to return a character vector (other variants support different return types).Below, we are asking map_chr to return the first item of each sub list in latestData. However, this will fail if it meets an entry that does not have anything stored under latestdata!
Error in `mutate()`:
ℹ In argument: `latestData = map_chr(latestData, 1)`.
Caused by error in `map_chr()`:
ℹ In index: 481.
ℹ With name: volumeByDay.
Caused by error:
! Result must be length 1, not 0.
We could write a custom “unlisting”-function. The function below unlists the elements of latestData - if there are any elements there. If it the content is null, the function just returns an empty character string.
# A tibble: 8,197 × 4
id name latestData location
<chr> <chr> <chr> <named list>
1 52742V2282262 ØRBEKK EV6 2024-10-27T00:00:00+02:00 <named list [1]>
2 41517V704478 BASTERUD 2017-11-14T00:00:00+01:00 <named list [1]>
3 01050V1126115 Trollvika 2024-10-27T00:00:00+02:00 <named list [1]>
4 24748V22148 Pinesund 2024-10-27T00:00:00+02:00 <named list [1]>
5 11446V1175840 Melsomvik 2024-10-27T00:00:00+02:00 <named list [1]>
6 99015V249528 Hjelset øst 2023-05-09T00:00:00+02:00 <named list [1]>
7 78755V249572 Sogge 2024-10-27T00:00:00+02:00 <named list [1]>
8 83652V319725 Strandgata nord 2024-08-20T00:00:00+02:00 <named list [1]>
9 51934V2523546 Snarud nord 2024-10-27T00:00:00+02:00 <named list [1]>
10 65271V443150 Vestby syd ny 2024-10-27T00:00:00+02:00 <named list [1]>
# ℹ 8,187 more rows
Alternatively, we can use the defaults in map_chr. It will now have a safe fallback value it can use if it doesn’t find the element we are looking for in latestData. A simple solution is to use the .default-argument, and set this to missing:
# A tibble: 8,197 × 4
id name latestData location
<chr> <chr> <chr> <named list>
1 52742V2282262 ØRBEKK EV6 2024-10-27T00:00:00+02:00 <named list [1]>
2 41517V704478 BASTERUD 2017-11-14T00:00:00+01:00 <named list [1]>
3 01050V1126115 Trollvika 2024-10-27T00:00:00+02:00 <named list [1]>
4 24748V22148 Pinesund 2024-10-27T00:00:00+02:00 <named list [1]>
5 11446V1175840 Melsomvik 2024-10-27T00:00:00+02:00 <named list [1]>
6 99015V249528 Hjelset øst 2023-05-09T00:00:00+02:00 <named list [1]>
7 78755V249572 Sogge 2024-10-27T00:00:00+02:00 <named list [1]>
8 83652V319725 Strandgata nord 2024-08-20T00:00:00+02:00 <named list [1]>
9 51934V2523546 Snarud nord 2024-10-27T00:00:00+02:00 <named list [1]>
10 65271V443150 Vestby syd ny 2024-10-27T00:00:00+02:00 <named list [1]>
# ℹ 8,187 more rows
7.2.3 Transforming time
Next, let’s format the date format. Date formats can be tricky, but is an obstacle you just have to learn to work with. We can reformat the latestData column into a date by simply using as.Date - however - we now have lost information on the time of day. Let’s see if we can retain all the information in the column.
# A tibble: 8,197 × 4
id name latestData location
<chr> <chr> <date> <named list>
1 52742V2282262 ØRBEKK EV6 2024-10-27 <named list [1]>
2 41517V704478 BASTERUD 2017-11-14 <named list [1]>
3 01050V1126115 Trollvika 2024-10-27 <named list [1]>
4 24748V22148 Pinesund 2024-10-27 <named list [1]>
5 11446V1175840 Melsomvik 2024-10-27 <named list [1]>
6 99015V249528 Hjelset øst 2023-05-09 <named list [1]>
7 78755V249572 Sogge 2024-10-27 <named list [1]>
8 83652V319725 Strandgata nord 2024-08-20 <named list [1]>
9 51934V2523546 Snarud nord 2024-10-27 <named list [1]>
10 65271V443150 Vestby syd ny 2024-10-27 <named list [1]>
# ℹ 8,187 more rows
There are several functions we can use to transform the string into a date time variable. as_datetime in lubridate works in this case. Note that the interpretation of dates may be dependent on the time zone settings on your laptop. Here, we are explicitly stating that we want the a Europe/Berlin tz on the variable:
# A tibble: 8,197 × 4
id name latestData location
<chr> <chr> <dttm> <named list>
1 52742V2282262 ØRBEKK EV6 2024-10-27 00:00:00 <named list [1]>
2 41517V704478 BASTERUD 2017-11-14 00:00:00 <named list [1]>
3 01050V1126115 Trollvika 2024-10-27 00:00:00 <named list [1]>
4 24748V22148 Pinesund 2024-10-27 00:00:00 <named list [1]>
5 11446V1175840 Melsomvik 2024-10-27 00:00:00 <named list [1]>
6 99015V249528 Hjelset øst 2023-05-09 00:00:00 <named list [1]>
7 78755V249572 Sogge 2024-10-27 00:00:00 <named list [1]>
8 83652V319725 Strandgata nord 2024-08-20 00:00:00 <named list [1]>
9 51934V2523546 Snarud nord 2024-10-27 00:00:00 <named list [1]>
10 65271V443150 Vestby syd ny 2024-10-27 00:00:00 <named list [1]>
# ℹ 8,187 more rows
Exercise: Finalizing the transformation
Let’s take on the final location variable. Complete the operation by unpacking the location column into two columns: lat and lon. You may use the functions you have already seen, or see of you can find mode specialized functions.
Note: This a nested list i.e. the contents of a cell in location is a list with one entry. This list contains two other lists..
The script should return a data frame similar to the one below (only the first few entries shown).
id
name
latestData
lat
lon
52742V2282262
ØRBEKK EV6
2024-10-27
60.41426
11.241171
41517V704478
BASTERUD
2017-11-14
60.76916
11.171156
01050V1126115
Trollvika
2024-10-27
69.23921
17.981692
24748V22148
Pinesund
2024-10-27
58.79845
9.054728
11446V1175840
Melsomvik
2024-10-27
59.22947
10.338938
99015V249528
Hjelset øst
2023-05-09
62.78733
7.521145
Expand for solution
We can use a similar solution we used before. First we use unlist to remove one level from the list, and then extract the contents using map_dbl - remember these are numbers, not text.