COMP 5070 Exam SP5 2018
COMP 5070 Statistical Programming for
Data Science
Take Home Exam
DUE: by 11:55 PM (CST), Friday 23rd November
• The take---home exam is worth 30% of your overall grade. The exam is out of 100 marks.
• The exam is to be submitted online as a compressed file (e.g. .zip, .tar.gz, .gz). This
compressed file should include ALL code needed to run your program and any other files you
created yourself. You do NOT need to include any data files provided to you, as it will be
assumed I too have them J
• To obtain the maximum available marks you should aim to:
1. Code all requested components (30%).
2. Use a clear style of code presentation (10%). Code clarity is an important part of your
submission. Thus you should choose meaningful variable names and adopt the use of
comments --- you don't need to comment every single line, as this will affect readability ---
however you should aim to comment at least each section of code.
3. Have the code run successfully (5%).
4. Output the information in a presentable manner and present your written analysis of the
output. (55%).
• Plagiarism is a specific form of academic misconduct. Although the University encourages
discussing work with others and the Social Forum will support this, ultimately this submission is
to represent your individual work. If plagiarism is found, all parties will be penalised. You should
etain copies of all assignment computer files used during development. These files must remain
unchanged after submission, for the purpose of checking if required.
• For the purpose of this exam, a “paragraph” is considered to consist of approximately 6---8 lines.
You are welcome to exceed this amount J
• This exam appears longer than it actually is – explanations are given to help you understand
the requested analyses and I have also provided hints.
• You do not need to write specialised code as you did for the assignments. You should be able
to find nearly all the code you need from the R files provided throughout the course, via case
studies and other examples. If you copy/paste code from the R code I have provided, this
should give you nearly 100% of the code needed for this exam, with a few alterations on your
ehalf (e.g. filenames, variable names etc).
Question 1 (60 Marks)
It’s All in the Taste
Experts vs Amateurs
Who is better at discerning the tastes of
supermarket chocolate? Do you really need
training to know if you like it? Or does it all
just taste really good?
The Experts battle it out against a group of
dedicated chocolate-eating Amateurs!
I would really like to have that job J
The data for this question are the responses to the sensometric qualities of chocolate that can be purchased in
supermarkets. Two groups were asked to rate the qualities of the chocolates: the first group contained a panel
of sensometric experts with responses recorded over 9 different tasting sessions. The accompanying data is in
chocolate_experts.csv.
The second group contained a panel of volunteers chosen to represent ‘regular shoppers’ who underwent a
three-hour sensometric training session before rating the qualities of the chocolate over 2 different tasting
sessions. The accompanying data is in chocolate_amateurs.csv.
The responses were recorded over a continuous scale from 0 to 10 with 0 indicating the absence of the
sensometric quality and 10 indicating fully present. It is of interest to determine if experts perceive supermarket
chocolate differently to non-experts (the amateurs) using 14 sensometric variables (Chocolate Aroma through
to Granular Texture in the data files).
For this question you need to randomly obtain two session ids for the expert responses only by making a call to
sample as shown below. The two numbers that are returned are your session ids that you need to extract for
your analysis.
sample(9,2)
For the expert data you will only need to analyse the responses co
esponding to the two randomly selected
session ids. Amateur data needs to be used in full.
You are asked to compare the responses between the two groups as requested in each part below. A partially written
R script is available as part of the exam package. You must use this script for your analysis and follow the instructions
therein. Any lines marked with
# ### !!! EXAM TIP !!!
equires you to change that line of code to suit your purposes. Further details are provided in the code comments
around that line.
For the purposes of this exam a paragraph is 8-12 lines of text. Specifically, your analysis should include:
i) Initial Data Discussion: Write a short explanation (approximately 1 paragraph) of the analysis to be
performed and an explanation of the data. Include your session IDs for the expert responses, and any data
manipulation performed prior to analysis should you do so.
ii) Exploratory Factor Analysis: conduct two separate exploratory factor analyses: the first for your selected id
sessions for the expert responses, the other for the full set of amateur responses. You may present the
analyses side-by-side or in sequence; however you believe is best. For each Exploratory Factor Analysis you
only need to include the following:
For each Exploratory Factor Analysis you need to include the following:
v If appropriate, Cronbach Alpha output and a short discussion (2---3 lines) of whether
the data is trustworthy and why.
v Co
elation output of your choosing (graphical and/or numerical) with an
accompanying discussion (3---4 lines). If numerical, round the co
elations to 2 digits;
v A single paragraph explaining the outcome of the determinant test, Bartlett’s test of
sphericity and the KMO statistic for both data sets. Do not include R output.
v Your decision regarding the number of factors to estimate (scree plot may be shown,
do not show the R console output).
v The FINAL factor solution. You do not need to discuss results of any of the other solutions,
however you should justify your final factor solution, including loadings, and name the
factors in each analysis. You should also include up to two sentences indicating whether the
test of residuals was passed and whether the factors are co
elated.
v All factors should be named and an explanation as to how you come up with these
names should be included.
v Based on the factor analysis results and your chosen factor names, discuss the factors
that have emerged from the study. What types of differences (if any) exist between
the expert and amateur sensometric ratings?
iii) Conclusions: write 2 paragraphs of conclusions based on your analysis.
Hints:
v To make the co
elation matrix more readable, use the round() command in R, e.g.
ound(cor(df, 2))
will compute the co
elation matrix of the data in the matrix df, to two decimal places. You can use
this tip for any other matrices too.
v The best solution may or may not be the rotated solution, based on your randomly selected
sessions. Choose your solution based on the principles of a good Exploratory Factor
Analysis (EFA).
v If items are not loading on to a factor, one reason could be that you have not extracted
enough factors from the data. Reconsider your analysis if necessary however this may not
solve the problem. Use the principles of EFA to make your final decision.
v While no split loadings are desirable in EFA, a small number may be unavoidable. Again you
should ultimately choose your final solution based on the principles of what constitutes a
good Exploratory Factor Analysis.
v If the co
elations between factors suggest an oblique rotation is required, simply note this
in your discussion. Do not re-run the analysis.
Question 2 (40 Marks)
Are We There Yet?
Clustering Cities Around the World
The data for this question are distances between cities in different regions of the world.
You will need to use the data set individually assigned to you.
The file cities.xlsx on the Assignments page indicates the continent assigned to each student.
Each data set contains a distance matrix and can be found on the assignments page, in a file of the form
RegionCitiesClustering.dat. For example, for the European data the file will be called
EuropeanCitiesClustering.dat. For this question, you are asked to conduct clustering analysis using both
hierarchical and partitional clustering techniques.
For the purposes of this exam a paragraph is 8-12 lines of text. Specifically, your analysis should include:
i) Initial Data Discussion: Write a short explanation (approximately 1 paragraph) of the analysis to be
performed and an explanation of the data including any data manipulation performed prior to clustering.
ii) Hierarchical clustering: conduct hierarchical clustering on the data, choosing an appropriate AGNES-
ased method based on either single, complete, average-linkage or Ward’s method. Ensure you justify
your choice in your write-up and include the resulting dendrogram, as well as a discussion of the
outcomes of hierarchical clustering on your data.
iii) Partitional clustering: conduct a partitional clustering of your data using K-means. Ensure you explain
and include any relevant R output (including graphics) supporting your choice of k, the number of
clusters.
iv) Discussion: (1-2 paragraphs) of your results.
v) Validation: as a form of cluster validation, consider the following:
If there are obvious outliers or distances that should be removed, identify these in your write-up and re-run
your chosen Partitional Clustering algorithm, adjusting k if necessary. Include justification of your choice of
the new value for k.
If there are no obvious outliers/distances that should be removed, then explain this conclusion with
justification. In this case re-run your chosen Partitional Clustering algorithm for a different value of k to that
used in Step 3 above. Include justification of your choice for the new value for k.
vi) Conclusions: write 2 paragraphs of conclusions based on your analysis including a statement regarding which
clustering solution is the better one and why.
Hint:
v For hierarchical clustering, ensure you define the height of the dendrogram according to the size of the values
in the output.