Background
In Australia, we have experienced extreme heat in the year 2019. With the inevitable rise of extreme weather events, it is crucial that we better understand its potential impact on our everyday life. Some consequences of extreme weather events and the climate change were captured in this article:
https:
australasiantransportresearchforum.org.au/wp-content/uploads/2022/03/2007_Rowland_Davey_Freeman_Wishart.pdf
Various weather events may affect the road safety. In this, you will use a dataset based on publicly available data to understand the relationship between weather patterns and number and severity of road traffic accidents. Your analysis could provide crucial knowledge for resource planning of emergency services.
Assignment 1 will focus on the analysis of road traffic accidents data.
Task 1: Road traffic accident dataset (16 points)
The dataset is attached to the assignment. Please download it and place it in a folder where your Rstudio will be able to access it.
· How many rows and columns are in the data? (2 points)
· How many regions are in the data? (2 points)S
· What data types are in the data? (Use data type selection tree and provide detailed explanation) (2 points for data types, 2 points for explanations)
· What time period does the data cover? (2 points)
· What do the variables FATAL, SERIOUS, … represent? (2 points)
· What’s the difference between “FATAL” and “SERIOUS” accidents? (3 points)
Task 2: Tidy data (20 points)
Task 2.1 Cleaning up columns
You may notice that the road traffic accidents csv file has two rows of heading. This is quite common in data generated by BI reporting tools. Let’s clean up the column names.
cav_data_link <- 'car_accidents_victoria.csv'
top_row <- read_csv(cav_data_link, col_names = FALSE, n_max = 1)
second_row <- read_csv(cav_data_link, n_max = 1)
column_names <- second_row %>%
unlist(., use.names=FALSE) %>%
make.unique(., sep = "__") # double underscore
column_names[2:5] <- str_c(column_names[2:5], '0', sep='__')
daily_accidents <-
read_csv(cav_data_link, skip = 2, col_names = column_names)
Now print out a list of regions in the data set. (1 point)
Task 2.2 Tidying data
1. Now we have a data frame. Answer the following questions for this data frame.
· Does each variable have its own column? (1 point)
· Does each observation have its own row? (1 point)
· Does each value have its own cell? (1 point)
2. Use spreading and/or gathering (or their pivot_wider and pivot_longer new equivalents) to transform the data frame into tidy data (6 points). The key is to put data from the same measurement source in a column and to put each observation in a row. Please answer the following questions.
· How many spreading (or pivot_wider) operations do you need? (1 point)
· How many gathering (or pivot_longer) operations do you need? (1 point)
· Explain the steps in detail. (3 points)
3. Are the variables having the expected variable types in R? Clean up the data types. (3 points)
4. Are there any missing values? Fix the missing data. Justify your actions. (2 points)
Task 3: Exploratory Data Analysis (20 points)
It is often a good idea to visually check your data before fitting a model. The purpose is to understand the distribution of different measurements and relations between them.
Task 3.1 Select a region
Select a region and create a dataset for only the selected region. (1 point)
Print out the name of the chosen region (1 point), the number of serious road accidents (1 points), and the total number of road accidents in the region (2 points).
Add "TOTAL_ACCIDENTS" column into the dataset for the selected region. (1 point)
Task 3.2 For the region selected, if we want to compare the number of road accidents across the year, which plot can we use? Show your plot and explain what the plot shows. (3 points)
Task 3.3 How do the road accident numbers change during a week? Show it visually using violin plots (2 points), describe the results (2 points) and provide your interpretation (2 points).
Task 3.4 Use skimrand fitdistrplus li
aries to answer the following questions. Which distributions are appropriate for modelling the number of accidents? (1 point) Which variables meet the assumptions for the Poisson distribution and why? (2 points) To reduce the dependence between consecutive days, randomly sample 200 records out of the whole dataset (all records for the selected region) for modelling (2 points).
Task 4: Fitting distributions (20 points)
As you may have seen in the previous step, although we are dealing with count data, a Poisson distribution may not provide a good fit. Actually, unconditional Poisson distribution is too restrictive for most real-world applications. In this task, we will fit a couple of distributions to the TOTAL_ACCIDENTS data using the same sample of Task 3.4.
Task 4.1: Fitting distributions (4 points)
Fit a Poisson distribution and a negative binomial distribution on TOTAL_ACCIDENTS. You may use functions provided by the package fitdistrplus.
Task 4.2: Compare distributions (6 points)
Compare the log-likelihood of two fitted distributions.
Which distribution fits the data better? Why?
Task 4.3: Try other distributions (research question 1) (10 points)
Find which distributions R stats li
ary includes. Try to fit some of them to different accident types. Analyse and explain the results. Write a short report (200 words).
Task 5: Research question 2 (15 points)
There is more than one way to fit a distribution to a set of numbers. Produce a short literature review on different distribution fitting methods, showing the pros and cons of each method. 5 points will be given to relevance of the literature. 7 points will be given for the quality of comparative analysis of distribution fitting methods. 3 points will be given for the quality of presentation.
Task 6: Ethics question (7 points)
During your work, have you identified any issues that have ethical implications? (2 points) Does it concern security or privacy? (2 points) Was the risk mitigated? (3 points)
Task 7: Reflection (2 points)
Answer the following questions:
1. What help did you receive from other students? What did you learn from them? (1 point)
2. Please estimate the mark that you will receive for assignment 1. Please provide both a point estimate and an interval estimate (a confidence interval). You don’t need to provide a mathematical model, but please explain how do you use conditional information to reach the estimates. Based on the conditional information, explain what you would have done differently to improve that mark? (1 point)
What to submit
By the due date, you are required to submit the following files to the assignment
1. An MS Word or PDF file containing your answers to all the assignment questions.
2. An R Notebook file Assignment1_submission.Rmd filled in with the script for your calculations. The file should be able to run. Include sufficient comments so that the script can be understood by the marker. Indicate all the packages that need to be installed separately.