Assessment 1: Naive Bayes classifier and Discriminant Analysis
Issued: of Week 1
Weight: 30 %
Maximum score: Marks
During this assessment you will insert R code and written discussions with justifications to this
template file. This assessment implements and explores techniques mainly covered in Week 1 and
Week 2. The assessment is segmented into three tasks (1) Comparison of classifiers; (2) Application
of a classifier; and (3) Implementation of classifiers.
The purpose of the assignment is to enable you to:
Code and comment R scripts
Implement sub-setting, Bayes classifiers and Discrimina Analysis in RStudio
Compare classification algorithms
Visually present predictions of classifiers in RStudio
Learning outcomes
Related subject learning outcomes:
1. Evaluate, synthesise and apply classic supervised data mining methods for pattern classification.
2. Effectively integrate, execute and apply the studied concepts, algorithms, and techniques to real
datasets using the computer language R and the software environment RStudio.
3. Communicate data concepts and methodologies of data science
Real-world application of classifiers may require that the predictors used for classification be physically
measured and, hence, the inclusion of unnecessary predictors may incur additional costs associated
with sensors, instruments and computing. It should be noted that some variables may even require
human intervention and/or expensive laboratory analyses in order to be measured.
It is important that analysts try to use as few predictors as possible, that is, the smallest set of
predictors that are relevant for the classification task in hand and yet sufficient to provide satisfactory
classification performance. Selecting predictors is an important task called feature selection in data
Assessment submission:
Your submission should include:
A PDF/html file that clearly shows the assignment question, the associated
answers, any relevant R outputs, analyses and discussions.
The assignment should not exceed 8-A4 pages. Appendices do not form part of the page limit.
The assignment must be presented in 12 font on A4 pages using single line spacing.
The task cover sheet.
Upload all submission files in one go. You can upload the assessment up to 3 times, however, only the last
submission is graded.
A word on plagiarism
Plagiarism is the act of using another's words, works or ideas from any source as one's own.
Plagiarism has no place in a University. Student work containing plagiarised material will be subject to
formal university processes
Glenn Fulford
Glenn Fulford
30 marks
Assessment Task 1: Comparison of classifiers
In this task compare the performance of the supervised learning algorithms Linear Discriminant
Analysis, Quadratic Discriminant Analysis and the Naïve Bayes Classifier using a publicly available
Blood Pressure Data. The data to be used for this task is provided in the HBblood.csv file in the
Assessment 1 folder.
The HBblood.csv dataset contains values of the percent HbA1c (a measure of the amount of glucose
and haemoglobin joined together in blood) and systolic blood pressure (SBP) (in mm/Hg) for 1,200
clinically healthy female patients within the ages 60 to 70 years. Additionally, the ethnicity, Ethno,
for each patient was recorded and into three groups, A, B or C, for analysis.
1. Discuss and justify which of the supervised learning algorithms (i.e. Linear Discriminant
Analysis, Quadratic Discriminant Analysis and the Naïve Bayes Classifier) would you choose fo
predicting the response Ethno using HbA1c and SBP as the feature variables. Provide any
plots/images needed to support your discussion.
Hint: Base your answer on the empirical properties of the data
Task 2 on next page
Assessment Task 2: Application of a classifier
Randomly split the dataset into a training subset and a test subset containing 80% and 20%of
the data. Provide your R-code .
a classifi to classify
Implement classifier Question 2 the training data subset
Question 1.
Interpret and discuss the relationships between the predictors and
esponse variables.
Task 3 on next page
Glenn Fulford
Marking Criteria and Ru
ic: MA5810 Assessment 1
Criterion High Distinction Distinction Credit Pass Fail
R code
Code submitted
Code works co
meets the specifications,
produces the co
ect results
and displays them co
Code is exceptionally well
organised and very easy to
follow. Code always very
well commented so the
purpose of each block of
code readily understood
and what question part it
esponds to. Variable
names give the purpose of
the variable.
Code submitted
Code works co
meets the specifications,
and produces co
esults but may not display
all of it co
Code is clean,
understandable and well-
organised, with just some
minor e
ors. Code is well
commented so that there is
very little ambiguity of the
code purpose. One or two
places could benefit from
comments, or the code is
overly commented. Variable
names clearly describe the
purpose of the variable.
Code submitted
Code mostly works
ectly, but functions
ectly on some inputs.
Minor details of the
specification are violated.
Code is fairly easy to read,
although contains at least
one major issue that
detracts from clarity. The
comments leave some code
lock ambiguous as to the
purpose. One or two places
could benefit from
comments, or the code is
overly commented. Variable
names do not describe the
purpose of the variable
Code only provided in
answer document but looks
Code often exhibits
ect behaviour.
Significant details of
specification are violated.
Code contains more than
one major issue that makes
it difficult to read. The code
is readable only by
someone who already
knows what it is supposed
to be doing. Comments not
sufficient to see what the
code is doing. Significant
lack of comments makes it
difficult to understand
Code not submitted
Code not provided in
answer document. Code
produces inco
ect results,
does not compile, or
significant e
ors occur.
Code is poorly organised
and very difficult to read.
Code has no comments.
The methodology
implemented is expertly
documented and justified.
The methodology
implemented reflects a
sophisticated and nuanced
understanding of relevant
concepts. All assumptions
validated and
communicated concisely.
The methodology
implemented is clearly
documented and justified.
The methodology reflects a
highly developed
understanding of relevant
concepts. Most
assumptions validated and
communicated clearly
The methodology
implemented is described,
and may contain minor
ors, or lacking clearly
stated justification. The
methodology is mostly
appropriate, but some
elements could be
improved. Some
assumptions validated.
The methodology
implemented is stated, but
too general, and/or not
justified. Some elements
are satisfactory, but most
elements need improving.
Some model assumptions
The methodology
implemented is not clearly
stated or justified. Very few
elements of the
methodology are
Interpretation is
comprehensive and
Interpretation is accurate,
comprehensive, and highly
detailed. Few inferences or
unjustified positions
Interpretation is accurate,
and for the most part
persuasive. Some
inferences or unjustified
positions presented.
Interpretation is adequate
in most places, and/or more
detail is required in places.
Interpretation is satisfactory
in places but lacks sufficient
and accurate interpretation.
Many inferences or
unjustified positions
Interpretation is lacking in
multiple components.
Major points may be stated
ut are often
Assessment 1: Naive Bayes classifier and Discriminant Analysis
Issued: Sunday of Week 1

)Due: 11:59 PM AEST Sunday of Week 3 Weight: 30 %
Maximum score: 50 Marks
During this assessment you will insert R code and written discussions with justifications to this template file. This assessment implements and explores techniques mainly covered in Week 1 and Week 2. The assessment is segmented into three tasks (1) Comparison of classifiers; (2) Application of a classifier; and (3) Implementation of classifiers.
The purpose of the assignment is to enable you to:
1. Code and comment R scripts
2. Implement sub-setting, Bayes classifiers and Discriminant Analysis in RStudio
3. Compare classification algorithms
4. Visually present predictions of classifiers in RStudio
Learning outcomes
Related subject learning outcomes:
1. Evaluate, synthesise and apply classic supervised data mining methods for pattern classification.
2. Effectively integrate, execute and apply the studied concepts, algorithms, and techniques to real datasets using the computer language R and the software environment RStudio.
3. Communicate data concepts and methodologies of data science
Real-world application of classifiers may require that the predictors used for classification be physically measured and, hence, the inclusion of unnecessary predictors may incur additional costs associated with sensors, instruments and computing. It should be noted that some variables may even require human intervention and/or expensive laboratory analyses in order to be measured.
It is important that analysts try to use as few predictors as possible, that is, the smallest set of predictors that are relevant for the classification task in hand and yet sufficient to provide satisfactory classification performance. Selecting predictors is an important task called feature selection in data mining
Assessment submission:
Your submission should include:
· An output of the PDF/html file that clearly shows the assignment question, the associated answers, any relevant R outputs, analyses and discussions.
· The R−script (code) file as evidence.
· The assignment should not exceed 8-A4 pages. Appendices do not form part of the page limit. The assignment must be presented in 12 font on A4 pages using single line spacing.
· The task cover sheet.
· Note that RMarkdown is not required for this assessment but highly recommended.
Upload all submission files in one go. You can upload the assessment up to 3 times, however, only the last submission is graded.
A word on plagiarism
Plagiarism is the act of using another's words, works or ideas from any source as one's own. Plagiarism has no place in a University. Student work containing plagiarised material will be subject to formal university processes in line with procedure described in the subject outline.
Assessment Task 1: Comparison of classifiers
Marks − 10
In this task compare the performance of the supervised learning algorithms Linear Discriminant Analysis, Quadratic Discriminant Analysis and the Naïve Bayes Classifier using a publicly available Blood Pressure Data. The data to be used for this task is provided in the HBblood.csv file in the Assessment 1 folder.
The HBblood.csv dataset contains values of the percent HbA1c (a measure of the amount of glucose and haemoglobin joined together in blood) and systolic blood pressure (SBP) (in mm/Hg) for 1,200 clinically healthy female patients within the ages 60 to 70 years. Additionally, the ethnicity, Ethno, for each patient was recorded and categorized into three groups, A, B or C, for analysis.
1.        Discuss and justify which of the supervised learning algorithms (i.e. Linear Discriminant Analysis, Quadratic Discriminant Analysis and the Naïve Bayes Classifier) would you choose for predicting the response Ethno using HbA1c, and SBP as the feature variables. Provide any plots/images needed to support your discussion.
Hint: Base your answer on the empirical statistical properties of the data in relation to model assumptions.
In this task compare the performance of the supervised learning algorithms Linear Discriminant Analysis, Quadratic Discriminant Analysis and the Naïve Bayes Classifier using a publicly available Blood Pressure Data.
Data pre processing
data= read.csv(file.choose(), header = T)
data$Ethno <- as.factor(data$Ethno)
# Loading package
# Splitting data into train
# and test data
split <- sample.split(data, SplitRatio = 0.7)
train_cl <- subset(data, split == "TRUE")
test_cl <- subset(data, split == "FALSE")
# Feature Scaling
train_scale <- scale(train_cl[, 2:3])
test_scale <- scale(test_cl[, 2:3])
train_y = train_cl$Ethno
test_y = test_cl$Ethno
First, we will perform linear discriminant analysis and check model performance.
#Linear discriminant analysis - LDA
model <- lda(Ethno~., data = train_cl)
lda(Ethno ~ ., data = train_cl)
Prior probabilities of groups:

