7 Iterations in R

In this section, we’ll discuss the contents of R4DS chapter 27. We have already covered writing loops. In this section we’ll expand on that topic using specialized functions for iterating over objects. These functions are perhaps slightly more abstract than explicit loops. However, by using them we can write scripts that may be easier to understand than the bulkiness we would have had with explicit loops.

We’ll first get accustomed to some of the purrr-functions in a few simple examples, before moving on to applying the functions on a more interesting data set.

7.1 Introductory examples

7.1.1 Iterating over columns

Lets start with creating a data set with some random numbers:

library(purrr) 
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <-
  tibble(
    a = rnorm(10),
    b = rnorm(10),
    c = rnorm(10),
    d = rnorm(10),
    e = rnorm(10)
  )

Say we would like to calculate the median over all the columns in the data set. We could of course do this by typing up the following

median(df$a) 
[1] -0.2549904
median(df$b) 
[1] 0.1156531
median(df$c) 
[1] -0.7929189
median(df$d) 
[1] 0.1210732
median(df$e)
[1] 0.4921389

In addition to being verbose, we would need to make changes to these lines if we were to add or subtract columns to our data frame. And further, if we wanted different metrics (e.g. column means as well as medians), we would have to write even more code.

We can however write a function for iterating over columns. Consider the function below:

col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}

Note a few items:

  • The loop iterates over seq_along(df), which in our case evaluates to 1,2,3,4,5. This is slightly more robust than iterating over 1:ncols(df), as also works for edge-cases where there are no columns in the data set.
  • The second argument is a function name. The function applies the function given to each of the columns

We can now call this function a few times, using different summary functions:

col_summary(df, mean)
[1] -0.35540469  0.07154759 -0.54961928  0.19098639  0.39904506
col_summary(df, median)
[1] -0.2549904  0.1156531 -0.7929189  0.1210732  0.4921389
col_summary(df, sd)
[1] 0.9315551 1.1836507 0.7216736 0.9884983 1.0570810

With this method we can get summaries of all columns in the data frame, without needing to changing these lines if we add or subtract columns.

7.1.2 map-functions

The package purrr comes with functions specifically designed for iterations. The example above could be solved with the function map. See how this also preserves the names of the columns

map(df, mean)
$a
[1] -0.3554047

$b
[1] 0.07154759

$c
[1] -0.5496193

$d
[1] 0.1909864

$e
[1] 0.3990451

The output from map-functions is a list. If you expect the return from the map call to be boolan, integer, double or character vectors, you can ensure you do indeed get such a vector in return by using map_lgl, map_int, map_dbl or map_chr-respectively.

If we want the results as e.g. a data frame, we can combine them using an appropriate function - below all the column means are combined into a data frame with one row. Not also how we can add arguments to the function applied by map by adding arguments to the map-call:

df |> 
  map(mean, trim=.1) |> 
  bind_cols()
# A tibble: 1 × 5
       a        b      c     d     e
   <dbl>    <dbl>  <dbl> <dbl> <dbl>
1 -0.321 -0.00464 -0.591 0.259 0.403

When we call map, we need to supply a single function that will be applied to each of the objects we are iterating over. If we want to apply multiple functions at the same time, we could store this as a new function, and call this once. However, we can also call an anonymous function (see also the “Function” chapter of the compendium”):

df |>
  map(
    {
      \(x) mean(x) / sd(x)
    }
  ) |> 
  bind_cols()
# A tibble: 1 × 5
       a      b      c     d     e
   <dbl>  <dbl>  <dbl> <dbl> <dbl>
1 -0.382 0.0604 -0.762 0.193 0.377

With map functions we can do a lot more interesting stuff than making column summaries. Run the code below yourself line by line. In the example below, we

  1. Use a built-in data frame with cars
  2. Split the data set into a list of data frames, split by the values of the cyl-column
  3. Apply a linear regression model to each of the data frames separately
  4. Summarize each of the regression models
  5. Extract the R^2-statistic from each regression model
  6. Combine the R^2-values into a data frame
mtcars |>                       
  split( ~ cyl) |>              
  map({
    \(x)lm(mpg ~ wt, data = x)
  }) |>
  map(summary) |>
  map({
    \(x) x$r.squared
  }) |>
  bind_cols()
# A tibble: 1 × 3
    `4`   `6`   `8`
  <dbl> <dbl> <dbl>
1 0.509 0.465 0.423

The lambda function \(x) x$r.squared returns named element r.squared in x. This is a common operation, so the map-function can also be supplied with a character string instead of a function. The example below does the same as the example below, but with a name-argument instead of a function:

mtcars |>  
  split(~cyl) |> 
  map({\(x)lm(mpg~wt, data=x) }) |> 
  map(summary) |> 
  map("r.squared")|>
  bind_cols()
# A tibble: 1 × 3
    `4`   `6`   `8`
  <dbl> <dbl> <dbl>
1 0.509 0.465 0.423

The map-functions also support integer values, in which case e.g. map(x,2) returns the second value of each item of x.

7.1.3 Dealing with errors

The map-functions will generally give an error if they receive an error when applied to any of the items it is applied to. In the example below we get an error due to a character-entry in the list. This might seem frustrating, but it forces you to actively make a choice of how you want to deal with the errors.

x <- list(1, 10, "a")
y <- x |> map(log)
Error in `map()`:
ℹ In index: 3.
Caused by error:
! non-numeric argument to mathematical function

We can wrap the function call in the function safely. This may be a useful tool when you are developing code, as it lets you see which records caused errors:

y <- x |> map(safely(log))

y |> 
  map("error") |> 
  map(is_null)
[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] FALSE

possibly is a different option. This function will insert a default value (NA_real below) in case the function call fails.

x |> map(possibly(log, NA_real_))
[[1]]
[1] 0

[[2]]
[1] 2.302585

[[3]]
[1] NA

7.1.4 Iterating over multiple lists

If we want to iterate over two lists (of the same length) simultaneously, we can do so by using map2:

mu <- list(-10000, 0, 10000)
sigma <- list(1, 5, 10)
map2(mu,sigma,rnorm,n=5)
[[1]]
[1]  -9999.896  -9999.235  -9999.259 -10000.069 -10000.398

[[2]]
[1]  6.4291250 -1.1532396  0.1775573  1.8272161 -7.1082074

[[3]]
[1] 10011.649 10008.968  9996.794  9995.780 10001.380

We can also iterate over more than two lists simultaneously. The lists then need to be combined in a single list, and the names of each list must match the names of the arguments in the function we want to apply.

mu    <- list(-10000, 0, 100)
sigma <- list(     1, 5,  10)
n     <- list(     1, 10, 25)

list(mean = mu,
     sd = sigma,
     n = n) |> 
  pmap(rnorm)
[[1]]
[1] -9999.734

[[2]]
 [1] -0.7574731 -1.1369954  0.5353316 12.6889708  2.3795611 -4.1293553
 [7]  3.5308724  3.2065482 -9.2857618 -1.4602046

[[3]]
 [1] 105.09036 111.03809  89.55327  98.01925 105.96586 109.58224 103.05550
 [8] 111.29103 102.94797  88.68532  97.73465 106.88412 109.12837 101.58881
[15]  89.88079  97.03530 104.71919  99.65387  97.19263  86.82698 116.34584
[22]  83.62085  90.35835  77.57743  97.04066

7.2 purrr and Traffic data

7.2.1 Getting some data

Let’s apply the purrr-functions on a more realistic data set. We will play with an API from Vegvesenet, the Norwegian governmental road authority. Vegvesenet has an API we can query for data on traffic volumes at many sensor stations in Norway.

The API uses graphQL for requests. Let’s define a function where we can submit queries to an external API (we will not spend time discussing the query language or the API any further..).

library(httr)
library(jsonlite)
library(ggplot2)
library(DescTools)
library(tidyverse)
library(magrittr)
library(rlang)
library(lubridate)
library(anytime)

GQL <- function(query,
                ...,
                .token = NULL,
                .variables = NULL,
                .operationName = NULL,
                .url = url) {
  pbody <-
    list(query = query,
         variables = .variables,
         operationName = .operationName)
  if (is.null(.token)) {
    res <- POST(.url, body = pbody, encode = "json", ...)
  } else {
    auth_header <- paste("bearer", .token)
    res <-
      POST(
        .url,
        body = pbody,
        encode = "json",
        add_headers(Authorization = auth_header),
        ...
      )
  }
  res <- content(res, as = "parsed", encoding = "UTF-8")
  if (!is.null(res$errors)) {
    warning(toJSON(res$errors))
  }
  res$data
}

# The URL we will use is stored below: 
url <- "https://www.vegvesen.no/trafikkdata/api/"


# Let's figure out which sensor stations that are operable. 
# The query below extracts all the stations, with a date for 
# when the station was in operation as well as a long/latitude. 
qry <-
  '
{
    trafficRegistrationPoints {
        id
        name
        latestData {
            volumeByDay
        }
        location {
            coordinates {
                latLon {
                    lat
                    lon
                }
            }
        }
    }
}
'

# Allright - let's try submitting the query: 
stations <-GQL(qry)

We now have the a long list in memory - 14mb! - with just a little information on each station. We can note that this is a list, not a dataframe. For our purposes, it would be better if the list was instead a data frame, with one row pr. sensor station.

Note that the list itself only contains one entry:

length(stations)
[1] 1

…however, this first entry contains 6436 data points. You might get a slightly different answer, as Vegvesenet is changing the number of sensors. Note that when we subset a list, using [[i]] selects the contents of the item i, and [i] is a list.

length(stations[[1]])
[1] 8197

7.2.2 Transforming list-columns

Let’s look at the first entry of this long list. We can see there is a station ID, station name, date time of latest recording from the station and coordinates. This looks like something that could fit well within a data frame, with columns id, name, latestdata, lat, and lon. The question is how! You might want to refer to chapter 24 of R4DS for more on hierarchical data.

stations[[1]][[1]]
$id
[1] "52742V2282262"

$name
[1] "ØRBEKK EV6"

$latestData
$latestData$volumeByDay
[1] "2024-10-27T00:00:00+02:00"


$location
$location$coordinates
$location$coordinates$latLon
$location$coordinates$latLon$lat
[1] 60.41426

$location$coordinates$latLon$lon
[1] 11.24117

We could perhaps hope that we can force this list into a data frame. For this we will use as_tibble:

stations[[1]][[1]] |>  
  as_tibble()
# A tibble: 1 × 4
  id            name       latestData   location        
  <chr>         <chr>      <named list> <named list>    
1 52742V2282262 ØRBEKK EV6 <chr [1]>    <named list [1]>

Exercise:

We now want to apply this as_tibble transformation to each of the stations, and combine them in a single data frame. Transform the list into a data frame, with at id and name as columns, and one row per station. We can fix the date time and locations columns later, but use one of the map-functions from purrr.

Using the map-function we traverse all the entries in the stations list, and transform these lists to data frames. We can then combine the list of data frames into a single data frame by calling list_rbind.

stations[[1]] |> 
  map(as_tibble) |> 
  list_rbind()
# A tibble: 8,197 × 4
   id            name            latestData   location        
   <chr>         <chr>           <named list> <named list>    
 1 52742V2282262 ØRBEKK EV6      <chr [1]>    <named list [1]>
 2 41517V704478  BASTERUD        <chr [1]>    <named list [1]>
 3 01050V1126115 Trollvika       <chr [1]>    <named list [1]>
 4 24748V22148   Pinesund        <chr [1]>    <named list [1]>
 5 11446V1175840 Melsomvik       <chr [1]>    <named list [1]>
 6 99015V249528  Hjelset øst     <chr [1]>    <named list [1]>
 7 78755V249572  Sogge           <chr [1]>    <named list [1]>
 8 83652V319725  Strandgata nord <chr [1]>    <named list [1]>
 9 51934V2523546 Snarud nord     <chr [1]>    <named list [1]>
10 65271V443150  Vestby syd ny   <chr [1]>    <named list [1]>
# ℹ 8,187 more rows

There is still some work left to do with the date time and location columns. As you can see below, they are still in a list format. We can try to pull out the insides of the contents of the latestData-column. It is formatted as a list, but actually only contains one date time entry.

stations[[1]] |>  
  map(as_tibble) |> 
  list_rbind() |> 
  head(1) |>  
  select(latestData) |>  
  pull()
$volumeByDay
[1] "2024-10-27T00:00:00+02:00"

Exercise:

Mutate the contents of the latestData-columns, such that it is in a character format. You don’t have to format it to a proper date time (yet..). This task has two complications: one is to apply transformation to all entries of the list - another is to deal with missing values - that might cause errors.

We can use map_chr. This function will try its best to return a character vector (other variants support different return types).Below, we are asking map_chr to return the first item of each sub list in latestData. However, this will fail if it meets an entry that does not have anything stored under latestdata!

stations[[1]] |> 
  map(as_tibble) |> 
  list_rbind() |>
  mutate(latestData = map_chr(latestData, 1))
Error in `mutate()`:
ℹ In argument: `latestData = map_chr(latestData, 1)`.
Caused by error in `map_chr()`:
ℹ In index: 481.
ℹ With name: volumeByDay.
Caused by error:
! Result must be length 1, not 0.

We could write a custom “unlisting”-function. The function below unlists the elements of latestData - if there are any elements there. If it the content is null, the function just returns an empty character string.

unlist_safe <- 
  function(x){
    x <- unlist(x)
    if(is.null(x)){
      return(NA_character_)
  }else{
    return(x)
  }
}

stations[[1]] |> 
  map(as_tibble) |> 
  list_rbind() |> 
  mutate(latestData = map_chr(latestData, unlist_safe))
# A tibble: 8,197 × 4
   id            name            latestData                location        
   <chr>         <chr>           <chr>                     <named list>    
 1 52742V2282262 ØRBEKK EV6      2024-10-27T00:00:00+02:00 <named list [1]>
 2 41517V704478  BASTERUD        2017-11-14T00:00:00+01:00 <named list [1]>
 3 01050V1126115 Trollvika       2024-10-27T00:00:00+02:00 <named list [1]>
 4 24748V22148   Pinesund        2024-10-27T00:00:00+02:00 <named list [1]>
 5 11446V1175840 Melsomvik       2024-10-27T00:00:00+02:00 <named list [1]>
 6 99015V249528  Hjelset øst     2023-05-09T00:00:00+02:00 <named list [1]>
 7 78755V249572  Sogge           2024-10-27T00:00:00+02:00 <named list [1]>
 8 83652V319725  Strandgata nord 2024-08-20T00:00:00+02:00 <named list [1]>
 9 51934V2523546 Snarud nord     2024-10-27T00:00:00+02:00 <named list [1]>
10 65271V443150  Vestby syd ny   2024-10-27T00:00:00+02:00 <named list [1]>
# ℹ 8,187 more rows

Alternatively, we can use the defaults in map_chr. It will now have a safe fallback value it can use if it doesn’t find the element we are looking for in latestData. A simple solution is to use the .default-argument, and set this to missing:

stations[[1]] |> 
  map(as_tibble) |> 
  list_rbind() |> 
  mutate(
    latestData = map_chr(latestData,1, .default=NA_character_)
  ) 
# A tibble: 8,197 × 4
   id            name            latestData                location        
   <chr>         <chr>           <chr>                     <named list>    
 1 52742V2282262 ØRBEKK EV6      2024-10-27T00:00:00+02:00 <named list [1]>
 2 41517V704478  BASTERUD        2017-11-14T00:00:00+01:00 <named list [1]>
 3 01050V1126115 Trollvika       2024-10-27T00:00:00+02:00 <named list [1]>
 4 24748V22148   Pinesund        2024-10-27T00:00:00+02:00 <named list [1]>
 5 11446V1175840 Melsomvik       2024-10-27T00:00:00+02:00 <named list [1]>
 6 99015V249528  Hjelset øst     2023-05-09T00:00:00+02:00 <named list [1]>
 7 78755V249572  Sogge           2024-10-27T00:00:00+02:00 <named list [1]>
 8 83652V319725  Strandgata nord 2024-08-20T00:00:00+02:00 <named list [1]>
 9 51934V2523546 Snarud nord     2024-10-27T00:00:00+02:00 <named list [1]>
10 65271V443150  Vestby syd ny   2024-10-27T00:00:00+02:00 <named list [1]>
# ℹ 8,187 more rows

7.2.3 Transforming time

Next, let’s format the date format. Date formats can be tricky, but is an obstacle you just have to learn to work with. We can reformat the latestData column into a date by simply using as.Date - however - we now have lost information on the time of day. Let’s see if we can retain all the information in the column.

stations[[1]] |> 
  map(as_tibble) |> 
  list_rbind() |> 
  mutate(latestData = map_chr(latestData, 1, .default=NA_character_)) |> 
  mutate(latestData = as.Date(latestData))
# A tibble: 8,197 × 4
   id            name            latestData location        
   <chr>         <chr>           <date>     <named list>    
 1 52742V2282262 ØRBEKK EV6      2024-10-27 <named list [1]>
 2 41517V704478  BASTERUD        2017-11-14 <named list [1]>
 3 01050V1126115 Trollvika       2024-10-27 <named list [1]>
 4 24748V22148   Pinesund        2024-10-27 <named list [1]>
 5 11446V1175840 Melsomvik       2024-10-27 <named list [1]>
 6 99015V249528  Hjelset øst     2023-05-09 <named list [1]>
 7 78755V249572  Sogge           2024-10-27 <named list [1]>
 8 83652V319725  Strandgata nord 2024-08-20 <named list [1]>
 9 51934V2523546 Snarud nord     2024-10-27 <named list [1]>
10 65271V443150  Vestby syd ny   2024-10-27 <named list [1]>
# ℹ 8,187 more rows

There are several functions we can use to transform the string into a date time variable. as_datetime in lubridate works in this case. Note that the interpretation of dates may be dependent on the time zone settings on your laptop. Here, we are explicitly stating that we want the a Europe/Berlin tz on the variable:

stations[[1]] |> 
  map(as_tibble) |> 
  list_rbind() |> 
  mutate(latestData = map_chr(latestData, 1, .default = NA_character_)) |> 
  mutate(latestData = as_datetime(latestData, tz = "Europe/Berlin")) 
# A tibble: 8,197 × 4
   id            name            latestData          location        
   <chr>         <chr>           <dttm>              <named list>    
 1 52742V2282262 ØRBEKK EV6      2024-10-27 00:00:00 <named list [1]>
 2 41517V704478  BASTERUD        2017-11-14 00:00:00 <named list [1]>
 3 01050V1126115 Trollvika       2024-10-27 00:00:00 <named list [1]>
 4 24748V22148   Pinesund        2024-10-27 00:00:00 <named list [1]>
 5 11446V1175840 Melsomvik       2024-10-27 00:00:00 <named list [1]>
 6 99015V249528  Hjelset øst     2023-05-09 00:00:00 <named list [1]>
 7 78755V249572  Sogge           2024-10-27 00:00:00 <named list [1]>
 8 83652V319725  Strandgata nord 2024-08-20 00:00:00 <named list [1]>
 9 51934V2523546 Snarud nord     2024-10-27 00:00:00 <named list [1]>
10 65271V443150  Vestby syd ny   2024-10-27 00:00:00 <named list [1]>
# ℹ 8,187 more rows

Exercise: Finalizing the transformation

Let’s take on the final location variable. Complete the operation by unpacking the location column into two columns: lat and lon. You may use the functions you have already seen, or see of you can find mode specialized functions.

Note: This a nested list i.e. the contents of a cell in location is a list with one entry. This list contains two other lists..

The script should return a data frame similar to the one below (only the first few entries shown).

id name latestData lat lon
52742V2282262 ØRBEKK EV6 2024-10-27 60.41426 11.241171
41517V704478 BASTERUD 2017-11-14 60.76916 11.171156
01050V1126115 Trollvika 2024-10-27 69.23921 17.981692
24748V22148 Pinesund 2024-10-27 58.79845 9.054728
11446V1175840 Melsomvik 2024-10-27 59.22947 10.338938
99015V249528 Hjelset øst 2023-05-09 62.78733 7.521145

We can use a similar solution we used before. First we use unlist to remove one level from the list, and then extract the contents using map_dbl - remember these are numbers, not text.

stations[[1]] |> 
  map(as_tibble) |> 
  list_rbind() |> 
  mutate(latestData = map_chr(latestData, 1, .default = ""))  |> 
  mutate(latestData = as_datetime(latestData, tz = "Europe/Berlin"))  |> 
  mutate(location = map(location, unlist)) |>  
  mutate(
    lat = map_dbl(location, "latLon.lat"),
    lon = map_dbl(location, "latLon.lon")
  ) %>% 
  select(-location)
# A tibble: 8,197 × 5
   id            name            latestData            lat   lon
   <chr>         <chr>           <dttm>              <dbl> <dbl>
 1 52742V2282262 ØRBEKK EV6      2024-10-27 00:00:00  60.4 11.2 
 2 41517V704478  BASTERUD        2017-11-14 00:00:00  60.8 11.2 
 3 01050V1126115 Trollvika       2024-10-27 00:00:00  69.2 18.0 
 4 24748V22148   Pinesund        2024-10-27 00:00:00  58.8  9.05
 5 11446V1175840 Melsomvik       2024-10-27 00:00:00  59.2 10.3 
 6 99015V249528  Hjelset øst     2023-05-09 00:00:00  62.8  7.52
 7 78755V249572  Sogge           2024-10-27 00:00:00  62.5  7.71
 8 83652V319725  Strandgata nord 2024-08-20 00:00:00  58.9  5.74
 9 51934V2523546 Snarud nord     2024-10-27 00:00:00  60.8 11.0 
10 65271V443150  Vestby syd ny   2024-10-27 00:00:00  59.6 10.7 
# ℹ 8,187 more rows

Alternatively, we can use unnest_wider twice. This one does some work for us, and gives the same result:

stations[[1]] |> 
  map(as_tibble) |> 
  list_rbind() |> 
  mutate(latestData = map_chr(latestData, 1, .default = NA_character_)) |> 
  mutate(latestData = as_datetime(latestData, tz = "Europe/Berlin"))  |> 
  unnest_wider(location) |> 
  unnest_wider(latLon)
# A tibble: 8,197 × 5
   id            name            latestData            lat   lon
   <chr>         <chr>           <dttm>              <dbl> <dbl>
 1 52742V2282262 ØRBEKK EV6      2024-10-27 00:00:00  60.4 11.2 
 2 41517V704478  BASTERUD        2017-11-14 00:00:00  60.8 11.2 
 3 01050V1126115 Trollvika       2024-10-27 00:00:00  69.2 18.0 
 4 24748V22148   Pinesund        2024-10-27 00:00:00  58.8  9.05
 5 11446V1175840 Melsomvik       2024-10-27 00:00:00  59.2 10.3 
 6 99015V249528  Hjelset øst     2023-05-09 00:00:00  62.8  7.52
 7 78755V249572  Sogge           2024-10-27 00:00:00  62.5  7.71
 8 83652V319725  Strandgata nord 2024-08-20 00:00:00  58.9  5.74
 9 51934V2523546 Snarud nord     2024-10-27 00:00:00  60.8 11.0 
10 65271V443150  Vestby syd ny   2024-10-27 00:00:00  59.6 10.7 
# ℹ 8,187 more rows