5 Chapter 5
5.1 Chapter 5.2.4 Exercises
5.1.1 Question 1: Find all flights that:
1.1 Had an arrival delay of two or more hours
10034 flights had an arrival delay of 2 or more hours
1.2 Flew to Houston (IAH or HOU)
9313 flights flew to a Houston airport
1.3 Were operated by United, American, or Delta
139504 flights were operated by United, American, or Delta
1.4 Departed in summer (July, August, and September)
86326 flights departed in summer
1.5 Arrived more than two hours late, but didn’t leave late
29 flights arrived more than 2 hours late, but left on time or early.
1.6 Were delayed by at least an hour, but made up over 30 minutes in flight
1844 flights were delayed by at least an hour but made up over 30 minutes in air
1.7 Departed between midnight and 6am (inclusive)
177 flights left between midnight and 6am
5.1.2 Question 2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
between() will select the rows of values that fall within a specific range. Must be a numeric vector.You could simplify the last exercise (1.7) by:
5.1.3 Question 3: How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
8255 flights have a missing departure time
These flights are also missing a dep_delay and arr_time, so these may represent the flights that were cancelled.
5.1.4 Question 4: Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
NA ^ 0 = 1 because everything to the 0th power is 1. NA | TRUE it’ll still return the the result of the boolean. FALSE & NA will return the result of the boolean. The general rule is that it will return the boolean value. NA*0 = NA because when you try to do math on an NA value, it will return NA
5.2 Chapter 5.3.1 Exercises
5.2.1 Question 1: How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).
## # A tibble: 336,776 × 22
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr>
## 1 2013 1 2 NA 1545 NA NA 1910 NA AA 133 <NA> JFK LAX
## 2 2013 1 2 NA 1601 NA NA 1735 NA UA 623 <NA> EWR ORD
## 3 2013 1 3 NA 857 NA NA 1209 NA UA 714 <NA> EWR MIA
## 4 2013 1 3 NA 645 NA NA 952 NA UA 719 <NA> EWR DFW
## 5 2013 1 4 NA 845 NA NA 1015 NA 9E 3405 <NA> JFK DCA
## 6 2013 1 4 NA 1830 NA NA 2044 NA 9E 3716 <NA> EWR DTW
## 7 2013 1 5 NA 840 NA NA 1001 NA 9E 3422 <NA> JFK BOS
## 8 2013 1 7 NA 820 NA NA 958 NA 9E 3317 <NA> JFK BUF
## 9 2013 1 8 NA 1645 NA NA 1838 NA US 123 <NA> EWR CLT
## 10 2013 1 9 NA 755 NA NA 1012 NA 9E 4023 <NA> EWR CVG
## # … with 336,766 more rows, and 8 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, speed <dbl>, min_since_mid_dep_time <dbl>, min_since_mid_sched_dep_time <dbl>
5.2.2 Question 2: Sort flights to find the most delayed flights. Find the flights that left earliest.
## # A tibble: 336,776 × 22
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr>
## 1 2013 4 10 1 1930 271 106 2101 245 UA 1703 N33203 EWR BOS
## 2 2013 5 22 1 1935 266 154 2140 254 EV 4361 N27200 EWR TYS
## 3 2013 6 24 1 1950 251 105 2130 215 AA 363 N546AA LGA ORD
## 4 2013 7 1 1 2029 212 236 2359 157 B6 915 N653JB JFK SFO
## 5 2013 1 31 1 2100 181 124 2225 179 WN 530 N550WN LGA MDW
## 6 2013 2 11 1 2100 181 111 2225 166 WN 530 N231WN LGA MDW
## 7 2013 3 18 1 2128 153 247 2355 172 B6 97 N760JB JFK DEN
## 8 2013 6 25 1 2130 151 249 14 155 B6 1371 N607JB LGA FLL
## 9 2013 2 24 1 2245 76 121 2354 87 B6 608 N216JB JFK PWM
## 10 2013 1 13 1 2249 72 108 2357 71 B6 22 N206JB JFK SYR
## # … with 336,766 more rows, and 8 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, speed <dbl>, min_since_mid_dep_time <dbl>, min_since_mid_sched_dep_time <dbl>
5.2.3 Question 3: Sort flights to find the fastest (highest speed) flights.
5.3 Chapter 5.4.1 Exercises
5.3.1 Question 1: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
*these are the reasonable ways to do this, you could do ridiculous things like subtracting every name but those you want
5.3.2 Question 2: What happens if you include the name of a variable multiple times in a select() call?
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
It will only print the variable one time, regardless of how many times you call the variable name in select()
5.3.3 Question 3: What does the any_of() function do? Why might it be helpful in conjunction with this vector?
## # A tibble: 336,776 × 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # … with 336,766 more rows
any_of() select variables in a character vector and does not check for missing variables.
5.3.4 Question 4: Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
## # A tibble: 336,776 × 8
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour min_since_mid_dep_time min_since_mid_sched_…
## <int> <int> <int> <int> <dbl> <dttm> <dbl> <dbl>
## 1 517 515 830 819 227 2013-01-01 05:00:00 317 315
## 2 533 529 850 830 227 2013-01-01 05:00:00 333 329
## 3 542 540 923 850 160 2013-01-01 05:00:00 342 340
## 4 544 545 1004 1022 183 2013-01-01 05:00:00 344 345
## 5 554 600 812 837 116 2013-01-01 06:00:00 354 360
## 6 554 558 740 728 150 2013-01-01 05:00:00 354 358
## 7 555 600 913 854 158 2013-01-01 06:00:00 355 360
## 8 557 600 709 723 53 2013-01-01 06:00:00 357 360
## 9 557 600 838 846 140 2013-01-01 06:00:00 357 360
## 10 558 600 753 745 138 2013-01-01 06:00:00 358 360
## # … with 336,766 more rows
No, as the code prints all the variables that contain the string “time” within it. The default is that ignore.case = TRUE, so the capitalization within the code wouldn’t effect the output. You can change ignore.case = FALSE to make it case dependent.
5.4 5.5.2 Exercises
5.4.1 Question 1: Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
5.4.2 Question 2: Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
## [1] 227 227 160 183 116 150
## [1] 313 317 381 460 258 186
The air_time’s are smaller than the (arr_time-dep_time)’s. This is because the arr_time and dep_time are written not in the minutes since midnight but rather just the hourminutes of time (i.e. 315 = 3:15) together. air_time is the total amount of time spent in the air in minutes. Therefore, to fix this, you should use calculate the minutes