5 Chapter 5

5.1 Chapter 5.2.4 Exercises

5.1.1 Question 1: Find all flights that:

1.1 Had an arrival delay of two or more hours

delay_more_2 <- filter(nycflights13::flights, arr_delay > 120)

10034 flights had an arrival delay of 2 or more hours

1.2 Flew to Houston (IAH or HOU)

flew_to_houston <- filter(nycflights13::flights, dest == "IAH" | dest == "HOU")

9313 flights flew to a Houston airport

1.3 Were operated by United, American, or Delta

un_am_del <- filter(nycflights13::flights, carrier == "AA" | carrier == "DL" | carrier == "UA")

139504 flights were operated by United, American, or Delta

1.4 Departed in summer (July, August, and September)

summer_flight <- filter(nycflights13::flights, month %in% c(7,8,9))

86326 flights departed in summer

1.5 Arrived more than two hours late, but didn’t leave late

arr_2_late_dep_on_time <- filter(nycflights13::flights, dep_delay <= 0 & arr_delay > 120)

29 flights arrived more than 2 hours late, but left on time or early.

1.6 Were delayed by at least an hour, but made up over 30 minutes in flight

made_up_30 <- filter(nycflights13::flights, dep_delay >= 60 & ((dep_delay - arr_delay) > 30))

1844 flights were delayed by at least an hour but made up over 30 minutes in air

1.7 Departed between midnight and 6am (inclusive)

overnight_flight_dep <- filter(nycflights13::flights, dep_time %in% c(12,1,2,3,4,5,6))

177 flights left between midnight and 6am

5.1.2 Question 2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

between() will select the rows of values that fall within a specific range. Must be a numeric vector.You could simplify the last exercise (1.7) by:

overnight_dep <- filter(nycflights13::flights, between(dep_time, 1, 6) | dep_time ==12)

5.1.3 Question 3: How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

missing_dep_time <- filter(nycflights13::flights, is.na(dep_time))

8255 flights have a missing departure time

These flights are also missing a dep_delay and arr_time, so these may represent the flights that were cancelled.

5.1.4 Question 4: Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

NA ^ 0 = 1 because everything to the 0th power is 1. NA | TRUE it’ll still return the the result of the boolean. FALSE & NA will return the result of the boolean. The general rule is that it will return the boolean value. NA*0 = NA because when you try to do math on an NA value, it will return NA

5.2 Chapter 5.3.1 Exercises

5.2.1 Question 1: How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

arrange(flights, desc(is.na(flights)))
## # A tibble: 336,776 × 22
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest 
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>
##  1  2013     1     2       NA           1545        NA       NA           1910        NA AA         133 <NA>    JFK    LAX  
##  2  2013     1     2       NA           1601        NA       NA           1735        NA UA         623 <NA>    EWR    ORD  
##  3  2013     1     3       NA            857        NA       NA           1209        NA UA         714 <NA>    EWR    MIA  
##  4  2013     1     3       NA            645        NA       NA            952        NA UA         719 <NA>    EWR    DFW  
##  5  2013     1     4       NA            845        NA       NA           1015        NA 9E        3405 <NA>    JFK    DCA  
##  6  2013     1     4       NA           1830        NA       NA           2044        NA 9E        3716 <NA>    EWR    DTW  
##  7  2013     1     5       NA            840        NA       NA           1001        NA 9E        3422 <NA>    JFK    BOS  
##  8  2013     1     7       NA            820        NA       NA            958        NA 9E        3317 <NA>    JFK    BUF  
##  9  2013     1     8       NA           1645        NA       NA           1838        NA US         123 <NA>    EWR    CLT  
## 10  2013     1     9       NA            755        NA       NA           1012        NA 9E        4023 <NA>    EWR    CVG  
## # … with 336,766 more rows, and 8 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, speed <dbl>, min_since_mid_dep_time <dbl>, min_since_mid_sched_dep_time <dbl>

5.2.2 Question 2: Sort flights to find the most delayed flights. Find the flights that left earliest.

most_delay <- arrange(flights, desc(dep_delay))

most_delay %>% 
  arrange(dep_time)
## # A tibble: 336,776 × 22
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest 
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>
##  1  2013     4    10        1           1930       271      106           2101       245 UA        1703 N33203  EWR    BOS  
##  2  2013     5    22        1           1935       266      154           2140       254 EV        4361 N27200  EWR    TYS  
##  3  2013     6    24        1           1950       251      105           2130       215 AA         363 N546AA  LGA    ORD  
##  4  2013     7     1        1           2029       212      236           2359       157 B6         915 N653JB  JFK    SFO  
##  5  2013     1    31        1           2100       181      124           2225       179 WN         530 N550WN  LGA    MDW  
##  6  2013     2    11        1           2100       181      111           2225       166 WN         530 N231WN  LGA    MDW  
##  7  2013     3    18        1           2128       153      247           2355       172 B6          97 N760JB  JFK    DEN  
##  8  2013     6    25        1           2130       151      249             14       155 B6        1371 N607JB  LGA    FLL  
##  9  2013     2    24        1           2245        76      121           2354        87 B6         608 N216JB  JFK    PWM  
## 10  2013     1    13        1           2249        72      108           2357        71 B6          22 N206JB  JFK    SYR  
## # … with 336,766 more rows, and 8 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, speed <dbl>, min_since_mid_dep_time <dbl>, min_since_mid_sched_dep_time <dbl>

5.2.3 Question 3: Sort flights to find the fastest (highest speed) flights.

flights <- flights %>% 
  mutate(speed = distance/hour)

fastest_flights <- arrange(flights, desc(speed))

5.2.4 Question 4: Which flights travelled the farthest? Which travelled the shortest?

far <- arrange(flights, desc(distance))

short <- arrange(flights, distance)

5.3 Chapter 5.4.1 Exercises

5.3.1 Question 1: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

flights %>% select(matches("^dep_"),matches("^arr_"))
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows
flights %>% select(dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows
# you can also select by column position number
flights %>%  select(4,6,7,9)
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows

*these are the reasonable ways to do this, you could do ridiculous things like subtracting every name but those you want

5.3.2 Question 2: What happens if you include the name of a variable multiple times in a select() call?

flights %>% select(year, year, month,day, year)
## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows

It will only print the variable one time, regardless of how many times you call the variable name in select()

5.3.3 Question 3: What does the any_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>%  select(any_of(vars))
## # A tibble: 336,776 × 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows

any_of() select variables in a character vector and does not check for missing variables.

5.3.4 Question 4: Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))
## # A tibble: 336,776 × 8
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour           min_since_mid_dep_time min_since_mid_sched_…
##       <int>          <int>    <int>          <int>    <dbl> <dttm>                               <dbl>                 <dbl>
##  1      517            515      830            819      227 2013-01-01 05:00:00                    317                   315
##  2      533            529      850            830      227 2013-01-01 05:00:00                    333                   329
##  3      542            540      923            850      160 2013-01-01 05:00:00                    342                   340
##  4      544            545     1004           1022      183 2013-01-01 05:00:00                    344                   345
##  5      554            600      812            837      116 2013-01-01 06:00:00                    354                   360
##  6      554            558      740            728      150 2013-01-01 05:00:00                    354                   358
##  7      555            600      913            854      158 2013-01-01 06:00:00                    355                   360
##  8      557            600      709            723       53 2013-01-01 06:00:00                    357                   360
##  9      557            600      838            846      140 2013-01-01 06:00:00                    357                   360
## 10      558            600      753            745      138 2013-01-01 06:00:00                    358                   360
## # … with 336,766 more rows

No, as the code prints all the variables that contain the string “time” within it. The default is that ignore.case = TRUE, so the capitalization within the code wouldn’t effect the output. You can change ignore.case = FALSE to make it case dependent.

5.4 5.5.2 Exercises

5.4.1 Question 1: Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

flights <- flights %>% 
  mutate(min_since_mid_dep_time = dep_time %/% 100 * 60 + dep_time %% 100)

flights <- flights %>% 
  mutate(min_since_mid_sched_dep_time = sched_dep_time%/% 100 * 60 + sched_dep_time %% 100)

5.4.2 Question 2: Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

head(flights$air_time)
## [1] 227 227 160 183 116 150
airtime2 <- flights$arr_time - flights$dep_time
head(airtime2)
## [1] 313 317 381 460 258 186

The air_time’s are smaller than the (arr_time-dep_time)’s. This is because the arr_time and dep_time are written not in the minutes since midnight but rather just the hourminutes of time (i.e. 315 = 3:15) together. air_time is the total amount of time spent in the air in minutes. Therefore, to fix this, you should use calculate the minutes