BME 530 Statistics and Machine Learning
Final project
This is a group project for a team of 2-3 people. Each team chooses to work on one dataset to
complete the final project. The details on each dataset can be found on the website in the URL
and references herein. You may also use your research data or other datasets in a public
epository.
1) Diabetic Retinopathy De
ecen Data Set
https:
archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+De
ecen+Data+Set
2) Quality Assessment of Digital Colposcopies Data Set
http:
archive.ics.uci.edu/ml/datasets/Quality+Assessment+of+Digital+Colposcopies
3) LSVT Voice Rehabilitation data set
http:
archive.ics.uci.edu/ml/datasets/LSVT+Voice+Rehabilitation#
4) Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set
http:
archive.ics.uci.edu/ml/datasets/Smartphone-
Based+Recognition+of+Human+Activities+and+Postural+Transitions.
5) Student Performance Data Set
https:
archive.ics.uci.edu/ml/datasets/Student+Performance
6) Cervical cancer (Risk Factors) data set
https:
archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29
7) Parkinson Dataset with replicated acoustic features Data Set
https:
archive.ics.uci.edu/ml/datasets/Parkinson+Dataset+with+replicated+acoustic+fea
tures+
8) HCC Survival Data Set
https:
archive.ics.uci.edu/ml/datasets/HCC+Survival
9) Drug consumption (quantified) Data Set
https:
archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29
10) SCADI Data Set
http:
archive.ics.uci.edu/ml/datasets/SCADI
11) Gene expression cancer RNA-Seq Data Set
http:
archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq
https:
archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+De
ecen+Data+Set
http:
archive.ics.uci.edu/ml/datasets/Quality+Assessment+of+Digital+Colposcopies
http:
archive.ics.uci.edu/ml/datasets/LSVT+Voice+Rehabilitation
http:
archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions
http:
archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions
https:
archive.ics.uci.edu/ml/datasets/Student+Performance
https:
archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29
https:
archive.ics.uci.edu/ml/datasets/Parkinson+Dataset+with+replicated+acoustic+features+
https:
archive.ics.uci.edu/ml/datasets/Parkinson+Dataset+with+replicated+acoustic+features+
https:
archive.ics.uci.edu/ml/datasets/HCC+Survival
https:
archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29
http:
archive.ics.uci.edu/ml/datasets/SCADI
http:
archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq
Tasks to perform:
Task 1: Perform relevant exploratory data analysis to reveal the relationship among features
and the potential hidden structure (but only include the relevant ones in the report). Use
dimension reduction methods to visualize the dataset in 2D or 3D.
Task 2. Use at least two unsupervised machine learning methods to review the group
structures in the dataset. Describe the rationale for determining the number of clusters in your
approach. Discuss how the parameters in the clustering methods are determined. Present the
clustering accuracy using the true labels.
Task 3. Use supervised machine learning models to construct the classification models (each
group member should work on one). You also will need to incorporate at least one feature
selection technique in your model.
Task 4: Evaluate and compare the classification models using appropriate criteria and draw
your conclusion based on the computational evaluation of your models.
Presentation requirements (10 minutes). Presentation time is strict. Please structure
your presentation by including
• Intro to the problem
• Methods
• Experimental protocols
• Results
• Conclusion
What to submit
1. Report (typed, font size 12pt, double spaced). The report should include subsections of
Introduction, methods, results, conclusion and cited references. Contribution of each
group member to the project should be indicated in the final paragraph of the report.
The report should be no more than 12 pages in length. However, if you wish to
include additional figures/tables, please organize them into an appendix.
2. Your group presentation file.
3. Your program used to generate the results (R) and the dataset after cleaning.
Submit to Blackboard with all files in a zipper file (BME530 FinalProject.Names). Only one
submission from a group member is needed. An identical score will be assigned to the
members in a group unless a special reason exists.