Crowdly

Add to Chrome

ETC1010 - ETC5510 - Introduction to data analysis - S1 2025

Looking for ETC1010 - ETC5510 - Introduction to data analysis - S1 2025 test answers and solutions? Browse our comprehensive collection of verified answers for ETC1010 - ETC5510 - Introduction to data analysis - S1 2025 at learning.monash.edu.

Get instant access to accurate answers and detailed explanations for your course questions. Our community-driven platform helps students succeed!

This question is about tidy temporal data. Below are total daily pedestrian traffic counts for the month of March for 2020 and 2019.

walk_daily_counts %>% 
  arrange(day, month)

## # A tibble: 62 × 6
##    Date       Count  year month   day wday 
##    <date>     <int> <dbl> <ord> <int> <ord>
##  1 2019-03-01 34485  2019 Mar       1 Fri  
##  2 2020-03-01 26840  2020 Mar       1 Sun  
##  3 2019-03-02 33896  2019 Mar       2 Sat  
##  4 2020-03-02 27900  2020 Mar       2 Mon  
##  5 2019-03-03 27036  2019 Mar       3 Sun  
##  6 2020-03-03 28003  2020 Mar       3 Tue  
##  7 2019-03-04 33865  2019 Mar       4 Mon  
##  8 2020-03-04 27949  2020 Mar       4 Wed  
##  9 2019-03-05 34463  2019 Mar       5 Tue  
## 10 2020-03-05 24936  2020 Mar       5 Thu  
## # … with 52 more rows

To compare daily counts of pedestrians in March for 2019 compared to 2020, we could use a scatterplot. But first we would need to pivot the data to make daily counts for each year as column.

Fill in the blanks for the following code to get our desired output:

walk_daily_counts_wide <- walk_daily_counts_wide %>% 
  # (a) which pivot function
  # (b) which id_cols
  # (c) which column forms names_from
  # (d) which column forms values_from
  pivot_---(id_cols = ---, names_from = ---, values = ---)

## # A tibble: 31 × 3
##      day `2019` `2020`
##    <int>  <int>  <int>
##  1     1  34485  26840
##  2     2  33896  27900
##  3     3  27036  28003
##  4     4  33865  27949
##  5     5  34463  24936
##  6     6  33763  33456
##  7     7  35403  30580
##  8     8  43030  27444
##  9     9  40673  25149
## 10    10  36208  26425
## # … with 21 more rows

pivot_longer()
pivot_wider()

day
month
wday

day
year
Date

wday
Count
year

View this question

The following question is about tidy data. The table below contains looks at crime occurrence in different locations across Victoria:

entry_point	Location	crime_type	count
FRONT DOOR	Oakleigh	violent	67
FRONT DOOR	Clayton	violent	53
WINDOW	Oakleigh	burglary	NA
WINDOW	Clayton	burglary	6
Roof	Oakleigh	Others	17
Roof	Clayton	Others	22

If you would like to calculate the proportion of the different crime types by location which code do you need to use?

Hint: Missing values is typically denoted by "NA" in the dataset, we can ignore these values by passing the option "na.rm = TRUE" to the appropriate R command.

Incorrect answers will be penalised.

crime_data %>% group_by(crime_type, Location) %>% summarise(n = sum(count)) %>% mutate(prop = n / sum(n, na.rm = TRUE))

✅

crime_data %>% group_by(crime_type, Location) %>% summarise(n = sum(count)) %>% mutate(prop = n / sum(n))

❌

crime_data %>% group_by(Location) %>% summarise(n = sum(count)) %>% mutate(prop = n / sum(n, na.rm = TRUE))

❌

crime_data %>% group_by(crime_type) %>% summarise(n = sum(count)) %>% mutate(prop = n / sum(n, na.rm = TRUE))

❌

View this question

This question is about visualising temporal data.

The example data is on pedestrian counts in the city of Melbourne. The below plot looks at distribution of the pedestrian counts over weekdays in March across 24hrs, comparing 2019 to 2020.

ped %>% 
  ggplot(aes(x=Time, y=Count, group=Date, colour=as.factor(year))) +
    geom_boxplot() +
    facet_wrap(~ year, ncol= 1, scales = "free") + 
    scale_colour_brewer("", palette="Dark2") + 
    theme(legend.position="bottom", legend.title = element_blank())

Image failed to load

By looking at the above plots, select all statements that are TRUE.

It would be easier to compare the plots if the y-scales for both years would be the same.

100%

The median values of pedestrian counts over weekdays at 3am are similar between 2019 and 2020

100%

The variability in the pedestrian counts for different hours in 2019 and 2020 is similar

The breaks in the x-axis are adequate

View this question

The following question is about wrangling data. Here is the table flights from the nycflights13 package that we have wrangled previously in class.

glimpse(flights)

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013,…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 5…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 6…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, …
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, …
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, …
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8,…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6",…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 57…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 14…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 2…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, …
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00…

For the following questions, write down the verbs and columns that you would need to use to do the calculations to answer it from the flights table. We provide the code structure and a list of possible verbs for you to select from:

What is the typical daily number of flights that American Airlines flies out of LGA between 7am and 8am?

flights %>% 
  # step 1
  ___(carrier == "AA", origin == "LGA", between(hour, 7, 8)) %>%
  # step 2
  ___(year, month, day) %>% 
  summarise(n =  mean(n))

The verb for the first step is mutate()
The verb for the first step is filter()

The verb for the second step is mutate()
The verb for the second step is count()
The verb for the second step is group_by()

View this question

The following question is about wrangling data. Here is the table flights from the nycflights13 package that we have wrangled previously in class.

glimpse(flights)

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013,…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 5…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 6…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, …
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, …
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, …
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8,…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6",…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 57…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 14…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 2…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, …
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00…

What hour of day should you plan to fly if you want to avoid departure delays as much as possible?

flights %>% 
  # step 1
  ___(hour) %>%
  # step 2
  ___(avg_delay = mean(___, na.rm = TRUE)) %>% 
  # step 3
  ___(avg_delay)

The verb for the first step is group_by()
The verb for the first step is mutate()
The verb for the first step is count()

The verb for the second step is mutate()
The verb for the second step is summarise()

The appropriate column to compute the mean from is arr_delay
The appropriate column to compute the mean from is dep_delay

The verb for the third step is sorty()
The verb for the third step is arrange()

View this question

The following question is about visualisation.

The data shows calories of a selection of chocolate bars, 100g equivalents. Calories mapped to the vertical axis. For the following statement:

Dark chocolates are higher in calories than milk chocolates.

Image failed to load

The statement is false.

The statement is true.

100%

View this question

The following question is about tidy data. The table below contains looks at crime data in different locations across New South Wales:

entry_point	lga	crime_type	count
FRONT DOOR	Paddington	arson	100
FRONT DOOR	CBD	arson	60
FRONT DOOR	Newtown	arson	90
WINDOW	Paddington	burglary	65
WINDOW	CBD	burglary	55
WINDOW	Newtown	burglary	100
ROOF	Paddington	burglary	10
ROOF	CBD	burglary	NA
ROOF	Newtown	burglary	NA

What is the total number of arson crime incidents recorded in this data set for Newtown with entry point being ROOF?

None

FALSE

This question is about working with temporal data. The example data is on pedestrian counts in the city of Melbourne. What time periods of Melbourne pedestrian traffic are NOT extracted by the code below?

Select all answers that apply. Incorrect answers will be penalised.

library(lubridate)
library(rwalkr)
ped_2020 <- melb_walk(from=Sys.Date() - 7L)
ped_2019 <- melb_walk(from=Sys.Date() - 30L - years(1), to=Sys.Date() - years(1))

(today - 1 year and 30days)

(today) through (today - 7 days)

(today - 30days)

✅

(one year ago) through to (today - 30days)

✅

(1 year ago) through to (today - 1 year and 30days

(today - 7 days)

View this question

The following question is about workflow and reproducibility. Suppose you are writing a report with Rmarkdown that will be presented to an important client. You have a time consuming calculation that is required for downstream chunks for making tables and charts but that isn’t necessary to show the client.

Which of the following chunks will compute the output but not print the resulting code in the report? Note there may be more than one correct answer. Incorrect answers are penalised.

{r chunk-A, eval = FALSE, echo = FALSE}

{r chunk-B, eval = FALSE, echo = TRUE}

{r chunk-C, eval = TRUE, echo = FALSE}

{r chunk-D, include = FALSE}

chunk-D

chunk-A

chunk-C

chunk-B

This question is about visualising temporal data.

The example data is on pedestrian counts in the city of Melbourne. The below plot looks at the pedestrian counts over weekdays in March, comparing 2019 to 2020.

ped %>% 
  ggplot(aes(x=Time, y=Count, group=Date, colour=as.factor(year))) +
    geom_line() +
    facet_wrap(~wday, ncol=7) + 
    scale_colour_brewer("", palette="Dark2") + 
    theme(legend.position="bottom", legend.title = element_blank())

Image failed to load

By looking at the above plots, select all statements that are TRUE. Incorrect answers are penalised.

The patterns for Fridays in 2019 are quite different to the rest of the days of the week both in 2020 and 2019.

100%

Weekdays in 2020 look mostly the same.

100%

The daily counts for 2020 and 2019 look the same.

The counts hours of weekdays in 2020 are generally smaller those in 2019.

100%

View this question

Want instant access to all verified answers on learning.monash.edu?

Get Unlimited Answers To Exam Questions - Install Crowdly Extension Now!

Add to Chrome

Telegram Instagram TikTok Question Bank