Assignment 2

Note

This assignment is to be handed in as an .R-file through Canvas. Your answer will be reviewed by a teaching assistant as well as two fellow students. Do take the time to condsider the feedback. You will also receive two random answers from fellow students for review. Try to find one positive aspect as well as one suggestion for improvement in each of the answers. You must complete both peer reviews in order to pass the assignment.

The first part of the assignment is intended to give you to practice writing pipes using a Tidyverse data set where there are good online resources. You don’t have to hand in the first part. The second part uses a survey data set, and should be handed in.

Part 1

We’ll use a data set from Tidyverse of all flights departing from New York City. The data set can be installed as a package install.packages("nycflights13"). After calling library(nycflights13) you should have a data set flights available.

You may want to skim over R4DS on transformations. This chapter contains both a good introduction to the flights data as well as a recap on many of the concepts used in class on data wrangling.

library(nycflights13) 
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Assignment 1: Complete the first exercise of 4.2.5 in R4DS (repeated below):

In a single pipeline for each condition, find all flights that meet the condition:

  1. Had an arrival delay of two or more hours
  2. Flew to Houston (IAH or HOU)
  3. Were operated by United, American, or Delta
  4. Departed in summer (July, August, and September)
  5. Arrived more than two hours late, but didn’t leave late
  6. Were delayed by at least an hour, but made up over 30 minutes in flight

See also this solution manual for alternative ways of solving the assignments.

# 1
flights |> 
  filter(arr_delay >= 120)
# A tibble: 10,200 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      811            630       101     1047            830
 2  2013     1     1      848           1835       853     1001           1950
 3  2013     1     1      957            733       144     1056            853
 4  2013     1     1     1114            900       134     1447           1222
 5  2013     1     1     1505           1310       115     1638           1431
 6  2013     1     1     1525           1340       105     1831           1626
 7  2013     1     1     1549           1445        64     1912           1656
 8  2013     1     1     1558           1359       119     1718           1515
 9  2013     1     1     1732           1630        62     2028           1825
10  2013     1     1     1803           1620       103     2008           1750
# ℹ 10,190 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# 2
flights |> 
  filter(dest %in% c("IAH", "HOU"))
# A tibble: 9,313 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      623            627        -4      933            932
 4  2013     1     1      728            732        -4     1041           1038
 5  2013     1     1      739            739         0     1104           1038
 6  2013     1     1      908            908         0     1228           1219
 7  2013     1     1     1028           1026         2     1350           1339
 8  2013     1     1     1044           1045        -1     1352           1351
 9  2013     1     1     1114            900       134     1447           1222
10  2013     1     1     1205           1200         5     1503           1505
# ℹ 9,303 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# 3
flights |> 
  filter(carrier %in% c("UA", "AA", "DL"))
# A tibble: 139,504 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      554            600        -6      812            837
 5  2013     1     1      554            558        -4      740            728
 6  2013     1     1      558            600        -2      753            745
 7  2013     1     1      558            600        -2      924            917
 8  2013     1     1      558            600        -2      923            937
 9  2013     1     1      559            600        -1      941            910
10  2013     1     1      559            600        -1      854            902
# ℹ 139,494 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# 4
flights |> 
  filter(month %in% c(7, 8, 9))
# A tibble: 86,326 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     7     1        1           2029       212      236           2359
 2  2013     7     1        2           2359         3      344            344
 3  2013     7     1       29           2245       104      151              1
 4  2013     7     1       43           2130       193      322             14
 5  2013     7     1       44           2150       174      300            100
 6  2013     7     1       46           2051       235      304           2358
 7  2013     7     1       48           2001       287      308           2305
 8  2013     7     1       58           2155       183      335             43
 9  2013     7     1      100           2146       194      327             30
10  2013     7     1      100           2245       135      337            135
# ℹ 86,316 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# 5
flights |> 
  filter(dep_delay <=0 ) |> 
  filter(arr_delay >= 120)
# A tibble: 29 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1    27     1419           1420        -1     1754           1550
 2  2013    10     7     1350           1350         0     1736           1526
 3  2013    10     7     1357           1359        -2     1858           1654
 4  2013    10    16      657            700        -3     1258           1056
 5  2013    11     1      658            700        -2     1329           1015
 6  2013     3    18     1844           1847        -3       39           2219
 7  2013     4    17     1635           1640        -5     2049           1845
 8  2013     4    18      558            600        -2     1149            850
 9  2013     4    18      655            700        -5     1213            950
10  2013     5    22     1827           1830        -3     2217           2010
# ℹ 19 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# 6
flights |> 
  filter(dep_delay >= 60) |> 
  mutate(delay_flight_decrease = dep_delay - arr_delay) |> 
  filter(delay_flight_decrease > 30)
# A tibble: 1,844 × 20
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1     2205           1720       285       46           2040
 2  2013     1     1     2326           2130       116      131             18
 3  2013     1     3     1503           1221       162     1803           1555
 4  2013     1     3     1839           1700        99     2056           1950
 5  2013     1     3     1850           1745        65     2148           2120
 6  2013     1     3     1941           1759       102     2246           2139
 7  2013     1     3     1950           1845        65     2228           2227
 8  2013     1     3     2015           1915        60     2135           2111
 9  2013     1     3     2257           2000       177       45           2224
10  2013     1     4     1917           1700       137     2135           1950
# ℹ 1,834 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, delay_flight_decrease <dbl>

Assignment 2: Answer the questions below

  1. Speed is distance divided by airtime. Use the planes dataframe (included in the nycflights13-package) to find the fastest and slowest aircraft model measured by speed in average km/h. Your answer should be a tibble containing the fastest and slowest plane
# 1
flights |> 
  inner_join(planes, by=join_by(tailnum)) |> 
  mutate(speed_kmh = 1.609344 * distance / (air_time/60)) |> 
  summarise(avg_speed_kmh = mean(speed_kmh, na.rm=TRUE), .by=model) |> 
  filter(avg_speed_kmh %in% c(max(avg_speed_kmh), min(avg_speed_kmh)))
# A tibble: 2 × 2
  model       avg_speed_kmh
  <chr>               <dbl>
1 CL-600-2B19          501.
2 777-222              777.

Part 2: Survey data

For this homework you will practice your data wrangling skills using a survey data set that you can download here: data-ESS8NO.dta. The data comes from the European Social Survey, and is available at [europeansocialsurvey.org] in various formats. Further, the data from ESS contains many variables with abbreviated/encoded names. The file contains survey responses from Norway in 2016. To get you started with the assignment, you may use the commands below to read in the data set. You may need to install the package foreign first in order to use the read.dta function.

library(foreign)

df <- 
  read.dta("data-ESS8NO.dta") %>%
  transmute(
    party = prtvtbno, 
    age = agea, 
    religiosity = rlgdgr, 
    income_decile = hinctnta)

The variables of interest prtvtbno, agea, rlgdgr and hinctnta are defined in the attached documentation, together with more information on the actual questions. To keep it simple, we rename them to party, age, religiosity and income_decile.

  1. Provide summary statistics of age of respondents split by the party the respondents voted for last election. Who has the oldest/youngest voters? Where is the standard deviation of voters age the largest? Do not report numbers for parties with less than 25 respondents.

  2. The variables religiosity, income_decile and party are encoded as factors (take a look at ?factor for en explanation). Further, they contain some non-responses such as “Don’t know”, “Refusal” and “No answer”. Find a method for filtering out all the non-responses. How many respondents did not provide an eligible response to each of the questions? How many answered both the party and income question?

  3. Filter out all ineligible responses for both income and party. Calculate the average religiosity of each party. Provide your answer as a data frame with one row pr. party, sorted by average religiosity.

  4. (Slightly trickier!) For each party with more than 75 voters, calculate the ratio of the number of respondents in income deciles 9 and 10 over the number of respondents in income deciles 1 and 2. Which party has the highest high- to low-income voters?

For completeness: When working with survey data, we usually have to apply weights to ensure estimates are representative of the population. This is because a survey sample may be a non-random sample of the general population. The survey data set provides the weights. You don’t have to worry about weights in this assignment, but please store the link “survey data -> weights?!?” in your mind for future work.