Шукаєте відповіді та рішення тестів для ETC1010 - ETC5510 - Introduction to data analysis - S1 2025? Перегляньте нашу велику колекцію перевірених відповідей для ETC1010 - ETC5510 - Introduction to data analysis - S1 2025 в learning.monash.edu.
Отримайте миттєвий доступ до точних відповідей та детальних пояснень для питань вашого курсу. Наша платформа, створена спільнотою, допомагає студентам досягати успіху!
The following question is about visualisation.
The data shows calories of a selection of chocolate bars, 100g equivalents. Calories mapped to the vertical axis. If you are wanting the reader to compare the inter quantile range of calories of milk and dark chocolates, which part of the plot do you need to observe?
This question is about tidy temporal data. Below are the data on military expenditure for different countries
head(expenditure)
## # A tibble: 6 × 4
## Entity Code Year military_expenditure
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan AFG 1970 5373185
## 2 Afghanistan AFG 1973 6230685
## 3 Afghanistan AFG 1974 6056124
## 4 Afghanistan AFG 1975 6357396
## 5 Afghanistan AFG 1976 8108200
## 6 Afghanistan AFG 1977 8553767
To compare expenditure counts of Australia and Germany between 2015 and 2020, we are going to select the observations for those two countries and that time range. Then we want change the data format so that the expenditure for Australia and the expenditure for Germany across different years appears in two different columns.
Fill in the blanks for the following code to get our desired output:
exp2 <- expenditure %>%
# (a) which pivot function
# (b) which id_cols
# (c) which column forms names_from
# (d) which colum forms values_from
dplyr::---(Entity %in% c("Australia", "Germany",
Year >= 2015) %>%
pivot_---(id_cols = ---,
names_from = ---,
values_from = ---)
The following question is about tidy data. The table below contains looks at crime occurrence in different locations across Victoria:
entry_point | lga | crime_type | count |
---|---|---|---|
FRONT DOOR | Monash | arson | 70 |
FRONT DOOR | Alpine | arson | 70 |
WINDOW | Monash | burglary | 30 |
WINDOW | Alpine | burglary | 45 |
ROOF | Monash | burglary | 15 |
ROOF | Alpine | burglary | 10 |
What proportion of the crimes were recorded in the Monash LGA?
Incorrect answers are penalised.
The following question is about tidy data. The table below contains looks at crime occurrence in different locations across Victoria:
entry_point | X1 | crime_type | count |
---|---|---|---|
FRONT DOOR | Monash | NA | 70 |
FRONT DOOR | NA | NA | 70 |
WINDOW | Monash | burglary | 30 |
WINDOW | NA | burglary | NA |
NA | Monash | NA | NA |
NA | NA | NA | NA |
What is the proportion of crimes that entered through the front door?
Hint: If a data set contains "NA" values, it means the entries are missing. You can ignore these missing values here.
The following question is about tidy data. The table below contains looks at crime data in different locations across Victoria:
entry_point | X1 | crime_type | count |
---|---|---|---|
FRONT DOOR | Monash | NA | 70 |
FRONT DOOR | NA | NA | 70 |
WINDOW | Monash | burglary | 30 |
WINDOW | NA | burglary | NA |
NA | Monash | NA | NA |
NA | NA | burglary | NA |
As usually we need to first inspect the variables and observations in this data set. What is the dimension of the data set?
This question is about tidy temporal data. Below are total daily pedestrian traffic counts for the month of March for 2020 and 2019.
walk_daily_counts %>%
arrange(day, month)
## # A tibble: 62 × 6
## Date Count year month day wday
## <date> <int> <dbl> <ord> <int> <ord>
## 1 2019-03-01 34485 2019 Mar 1 Fri
## 2 2020-03-01 26840 2020 Mar 1 Sun
## 3 2019-03-02 33896 2019 Mar 2 Sat
## 4 2020-03-02 27900 2020 Mar 2 Mon
## 5 2019-03-03 27036 2019 Mar 3 Sun
## 6 2020-03-03 28003 2020 Mar 3 Tue
## 7 2019-03-04 33865 2019 Mar 4 Mon
## 8 2020-03-04 27949 2020 Mar 4 Wed
## 9 2019-03-05 34463 2019 Mar 5 Tue
## 10 2020-03-05 24936 2020 Mar 5 Thu
## # … with 52 more rows
To compare daily counts of pedestrians in March for 2019 compared to 2020, we could use a scatterplot. But first we would need to pivot the data to make daily counts for each year as column.
Fill in the blanks for the following code to get our desired output:
walk_daily_counts_wide <- walk_daily_counts_wide %>%
# (a) which pivot function
# (b) which id_cols
# (c) which column forms names_from
# (d) which column forms values_from
pivot_---(id_cols = ---, names_from = ---, values = ---)
## # A tibble: 31 × 3
## day `2019` `2020`
## <int> <int> <int>
## 1 1 34485 26840
## 2 2 33896 27900
## 3 3 27036 28003
## 4 4 33865 27949
## 5 5 34463 24936
## 6 6 33763 33456
## 7 7 35403 30580
## 8 8 43030 27444
## 9 9 40673 25149
## 10 10 36208 26425
## # … with 21 more rows
The following question is about tidy data. The table below contains looks at crime occurrence in different locations across Victoria:
entry_point | Location | crime_type | count |
---|---|---|---|
FRONT DOOR | Oakleigh | violent | 67 |
FRONT DOOR | Clayton | violent | 53 |
WINDOW | Oakleigh | burglary | NA |
WINDOW | Clayton | burglary | 6 |
Roof | Oakleigh | Others | 17 |
Roof | Clayton | Others | 22 |
If you would like to calculate the proportion of the different crime types by location which code do you need to use?
Hint: Missing values is typically denoted by "NA" in the dataset, we can ignore these values by passing the option "na.rm = TRUE" to the appropriate R command.
Incorrect answers will be penalised.
This question is about visualising temporal data.
The example data is on pedestrian counts in the city of Melbourne. The below plot looks at distribution of the pedestrian counts over weekdays in March across 24hrs, comparing 2019 to 2020.
ped %>%
ggplot(aes(x=Time, y=Count, group=Date, colour=as.factor(year))) +
geom_boxplot() +
facet_wrap(~ year, ncol= 1, scales = "free") +
scale_colour_brewer("", palette="Dark2") +
theme(legend.position="bottom", legend.title = element_blank())
Image failed to load
By looking at the above plots, select all statements that are TRUE.
The following question is about wrangling data. Here is the table flights
from the nycflights13
package that we have wrangled previously in class.
glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013,…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 5…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 6…
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, …
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, …
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, …
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8,…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6",…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 57…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 14…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 2…
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, …
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00…
For the following questions, write down the verbs and columns that you would need to use to do the calculations to answer it from the flights table. We provide the code structure and a list of possible verbs for you to select from:
What is the typical daily number of flights that American Airlines flies out of LGA between 7am and 8am?
flights %>%
# step 1
___(carrier == "AA", origin == "LGA", between(hour, 7, 8)) %>%
# step 2
___(year, month, day) %>%
summarise(n = mean(n))
The following question is about wrangling data. Here is the table flights
from the nycflights13
package that we have wrangled previously in class.
glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013,…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 5…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 6…
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, …
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, …
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, …
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8,…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6",…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 57…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 14…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 2…
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, …
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00…
For the following questions, write down the verbs and columns that you would need to use to do the calculations to answer it from the flights table. We provide the code structure and a list of possible verbs for you to select from:
What hour of day should you plan to fly if you want to avoid departure delays as much as possible?
flights %>%
# step 1
___(hour) %>%
# step 2
___(avg_delay = mean(___, na.rm = TRUE)) %>%
# step 3
___(avg_delay)
Отримайте необмежений доступ до відповідей на екзаменаційні питання - встановіть розширення Crowdly зараз!