COMP 5070 Exam SP5 2018 COMP 5070 Statistical Programming for Data Science...

Question

COMP 5070 Exam SP5 2018


COMP 5070 Statistical Programming for
Data Science
Take    Home    Exam
DUE:    by    11:55    PM    (CST),    Friday    23rd November

• The    take---home    exam    is    worth    30%    of    your    overall    grade.        The    exam    is    out    of    100    marks.

• The    exam    is    to    be    submitted    online    as    a    compressed    file    (e.g.    .zip,    .tar.gz,    .gz).    This
compressed    file    should    include    ALL    code    needed    to    run    your    program    and    any    other    files    you
created    yourself.    You    do    NOT    need    to    include    any    data    files    provided    to    you,    as    it    will    be
assumed    I    too    have    them    J
• To    obtain    the    maximum    available    marks    you    should    aim    to:

1. Code    all    requested    components    (30%).

2. Use    a    clear    style    of    code    presentation    (10%).    Code    clarity    is    an    important    part    of    your
submission.    Thus    you    should    choose    meaningful    variable    names    and    adopt    the    use    of
comments    ---    you    don't    need    to    comment    every    single    line,    as    this    will    affect    readability    ---
however    you    should    aim    to    comment    at    least    each    section    of    code.
3. Have    the    code    run    successfully    (5%).

4. Output    the    information    in    a    presentable    manner    and    present    your    written    analysis    of    the
output.    (55%).
• Plagiarism    is    a    specific    form    of    academic    misconduct.    Although    the    University    encourages
discussing    work    with    others    and    the    Social    Forum    will    support    this,    ultimately    this    submission    is
to    represent    your    individual    work.    If    plagiarism    is    found,    all    parties    will    be    penalised.    You    should
etain    copies    of    all    assignment    computer    files    used    during    development.    These    files    must    remain
unchanged    after    submission,    for    the    purpose    of    checking    if    required.
• For    the    purpose    of    this    exam,    a    “paragraph”    is    considered    to    consist    of    approximately    6---8    lines.
You    are    welcome    to    exceed    this    amount    J
• This    exam    appears    longer    than    it    actually    is    –    explanations    are    given    to    help    you    understand
the    requested    analyses    and    I    have    also    provided    hints.
• You    do    not    need    to    write    specialised    code    as    you    did    for    the    assignments.    You    should    be    able
to    find    nearly    all    the    code    you    need    from    the    R    files    provided    throughout    the    course,    via    case
studies    and    other    examples.    If    you    copy/paste    code    from    the    R    code    I    have    provided,    this
should    give    you    nearly    100%    of    the    code    needed    for    this    exam,    with    a    few    alterations    on    your
ehalf    (e.g.    filenames,    variable    names    etc).

Question    1    (60    Marks)
It’s All in the Taste
                                            Experts    vs    Amateurs



Who    is    better    at    discerning    the    tastes    of
supermarket    chocolate?    Do    you    really    need
training    to    know    if    you    like    it?    Or    does    it    all
just    taste    really    good?

The    Experts    battle    it    out    against    a    group    of
dedicated    chocolate-eating    Amateurs!

I    would    really    like    to    have    that    job    J
The    data    for    this    question    are    the    responses    to    the    sensometric    qualities    of    chocolate    that    can    be    purchased    in
supermarkets.    Two    groups    were    asked    to    rate    the    qualities    of    the    chocolates:    the    first    group    contained    a    panel
of    sensometric    experts    with    responses    recorded    over    9    different    tasting    sessions.        The    accompanying    data    is    in
chocolate_experts.csv.
The    second    group    contained    a    panel    of    volunteers    chosen    to    represent     ‘regular    shoppers’    who    underwent    a
three-hour     sensometric     training     session    before     rating     the    qualities    of     the     chocolate    over     2    different     tasting
sessions.        The    accompanying    data    is    in    chocolate_amateurs.csv.
The     responses     were     recorded     over     a     continuous     scale     from     0     to     10     with     0     indicating     the     absence     of     the
sensometric    quality    and    10    indicating    fully    present.        It    is    of    interest    to    determine    if    experts    perceive    supermarket
chocolate    differently    to    non-experts    (the    amateurs)    using    14    sensometric    variables    (Chocolate    Aroma    through
to    Granular    Texture    in    the    data    files).

For    this    question    you    need    to    randomly    obtain    two    session    ids    for    the    expert    responses    only    by    making    a    call    to
sample    as    shown    below.    The    two    numbers    that    are    returned    are    your    session    ids    that    you    need    to    extract    for
your    analysis.

sample(9,2)

For    the    expert    data    you    will    only    need    to    analyse    the    responses    co
esponding    to    the    two    randomly    selected
session    ids.        Amateur    data    needs    to    be    used    in    full.

You    are    asked    to    compare    the    responses    between    the    two    groups    as    requested    in    each    part    below.        A    partially    written
R    script    is    available    as    part    of    the    exam    package.            You    must    use    this    script    for    your    analysis    and    follow    the    instructions
therein.        Any    lines    marked    with

#    ###    !!!    EXAM    TIP    !!!

equires    you    to    change    that    line    of    code    to    suit    your    purposes.        Further    details    are    provided    in    the    code    comments
around    that    line.

For    the    purposes    of    this    exam    a    paragraph    is    8-12    lines    of    text.            Specifically,    your    analysis    should    include:

i) Initial     Data     Discussion:     Write     a     short     explanation     (approximately     1     paragraph)     of     the     analysis     to     be
performed    and    an    explanation    of    the    data.        Include    your    session    IDs    for    the    expert    responses,    and    any    data
manipulation    performed    prior    to    analysis    should    you    do    so.

ii) Exploratory    Factor    Analysis:    conduct    two    separate    exploratory    factor    analyses:    the    first    for    your    selected    id
sessions     for     the    expert     responses,     the    other     for     the     full     set     of     amateur     responses.          You    may    present     the
analyses    side-by-side    or    in    sequence;    however    you    believe    is    best.        For    each    Exploratory    Factor    Analysis    you
only    need    to    include    the    following:

For    each    Exploratory    Factor    Analysis    you    need    to    include    the    following:

v If    appropriate,    Cronbach    Alpha    output    and    a     short    discussion     (2---3     lines)    of    whether
the    data    is    trustworthy    and    why.

v Co
elation     output     of     your     choosing     (graphical     and/or     numerical)     with     an
accompanying    discussion    (3---4    lines).        If    numerical,    round    the    co
elations    to    2    digits;

v A    single    paragraph    explaining    the    outcome    of    the    determinant    test,    Bartlett’s    test    of
sphericity    and    the    KMO    statistic    for    both    data    sets.    Do    not    include    R    output.

v Your    decision    regarding    the    number    of    factors    to    estimate    (scree    plot    may    be    shown,
do    not    show    the    R    console    output).

v The    FINAL    factor    solution.    You    do    not    need    to    discuss    results    of    any    of    the    other    solutions,
however     you     should     justify     your     final     factor     solution,     including     loadings,     and     name     the
factors    in    each    analysis.    You    should    also    include    up    to    two    sentences    indicating    whether    the
test    of    residuals    was    passed    and    whether    the    factors    are    co
elated.

v All    factors    should    be    named    and    an    explanation    as    to    how    you    come    up    with    these
names    should    be    included.

v Based    on    the    factor    analysis    results    and    your    chosen    factor    names,    discuss    the    factors
that    have    emerged    from    the    study.        What    types    of    differences    (if    any)    exist    between
the    expert    and    amateur    sensometric    ratings?


iii) Conclusions:    write    2    paragraphs    of    conclusions    based    on    your    analysis.

Hints:
v To    make    the    co
elation    matrix    more    readable,    use    the    round() command    in    R,    e.g.
ound(cor(df, 2))
will    compute    the    co
elation    matrix    of    the    data    in    the    matrix    df,    to    two    decimal    places.        You    can    use
this    tip    for    any    other    matrices    too.

v The    best    solution    may    or    may    not    be    the    rotated    solution,    based    on    your    randomly    selected
sessions.        Choose    your    solution    based    on    the    principles    of    a    good    Exploratory    Factor
Analysis    (EFA).


v If    items    are    not    loading    on    to    a    factor,    one    reason    could    be    that    you    have    not    extracted
enough    factors    from    the    data.        Reconsider    your    analysis    if    necessary    however    this    may    not
solve    the    problem.        Use    the    principles    of    EFA    to    make    your    final    decision.

v While    no    split    loadings    are    desirable    in    EFA,    a    small    number    may    be    unavoidable.    Again    you
should    ultimately    choose    your    final    solution    based    on    the    principles    of    what    constitutes    a
good    Exploratory    Factor    Analysis.

v If    the    co
elations    between    factors    suggest    an    oblique    rotation    is    required,    simply    note    this
in    your    discussion.        Do    not    re-run    the    analysis.



Question    2    (40    Marks)
Are We There Yet?
Clustering    Cities    Around    the    World



The    data    for    this    question    are    distances    between    cities    in    different    regions    of    the    world.

You    will    need    to    use    the    data    set    individually    assigned    to    you.
The    file    cities.xlsx    on    the    Assignments    page    indicates    the    continent    assigned    to    each    student.

Each     data     set     contains     a     distance    matrix     and     can     be     found     on     the     assignments     page,     in     a     file     of     the     form
RegionCitiesClustering.dat.     For     example,     for     the     European     data     the     file     will     be     called
EuropeanCitiesClustering.dat.     For     this     question,     you     are     asked     to     conduct     clustering     analysis     using     both
hierarchical    and    partitional    clustering    techniques.

For    the    purposes    of    this    exam    a    paragraph    is    8-12    lines    of    text.        Specifically,    your    analysis    should    include:

i) Initial     Data     Discussion:     Write     a     short     explanation     (approximately     1     paragraph)     of     the     analysis     to     be
performed    and    an    explanation    of    the    data    including    any    data    manipulation    performed    prior    to    clustering.

ii) Hierarchical     clustering:     conduct     hierarchical     clustering     on     the     data,     choosing     an     appropriate     AGNES-
ased    method    based    on    either    single,    complete,    average-linkage    or    Ward’s    method.        Ensure    you    justify
your     choice     in     your     write-up     and     include     the     resulting     dendrogram,     as     well     as     a     discussion     of     the
outcomes    of    hierarchical    clustering    on    your    data.

iii) Partitional    clustering:    conduct    a    partitional    clustering    of    your    data    using    K-means.        Ensure    you    explain
and     include     any     relevant     R     output     (including     graphics)     supporting     your     choice     of     k,     the     number     of
clusters.

iv) Discussion:    (1-2    paragraphs)    of    your    results.


v) Validation:    as    a    form    of    cluster    validation,    consider    the    following:

If    there    are    obvious    outliers    or    distances    that    should    be    removed,    identify    these    in    your    write-up    and    re-run
your    chosen    Partitional    Clustering    algorithm,    adjusting    k    if    necessary.        Include    justification    of    your    choice    of
the    new    value    for    k.

If     there     are     no     obvious     outliers/distances     that     should     be     removed,     then     explain     this     conclusion     with
justification.        In    this    case    re-run    your    chosen    Partitional    Clustering    algorithm    for    a    different    value    of    k    to    that
used    in    Step    3    above.        Include    justification    of    your    choice    for    the    new    value    for    k.

vi) Conclusions:    write    2    paragraphs    of    conclusions    based    on    your    analysis including    a    statement    regarding    which
clustering    solution    is    the    better    one    and    why.


Hint:

v For    hierarchical    clustering,    ensure    you    define    the    height    of    the    dendrogram    according    to    the    size    of    the    values
in    the    output.

qfile_636774191842309722_123130_1.pdf

Aakarsh · Accepted Answer

Ques1/Chocolate.pdf
Sensometric qualities of Chocolates
Two groups were asked to rate the qualities of the chocolates:
The responses were recorded over a continuous scale from 0 to 10 with 0 indicating the absence of the sensometric quality and 10
indicating fully present.
The first group contained a panel of sensometric experts with responses recorded over 9 different tasting sessions.
The second group contained a panel of volunteers chosen to represent ‘regular shoppers’ who underwent a three-hour sensometric training
session before rating the qualities of the chocolate over 2 different tasting sessions
Let’s determine if experts perceive supermarket chocolate differently to non-experts (the amateurs) using 14 sensometric variables.
Initial Data Discussion
Following sensometric variables of chocolate quality were responded by group of experts and amatuers on the scale of 1 to 10.
##  [1] "Chocolate.Aroma"   "Milk.Aroma"        "Sweetness"        
##  [4] "Acidity"           "Bitterness"        "Chocolate.Flavour"
##  [7] "Milk.Flavour"      "Caramel.Flavour"   "Vanilla.Flavour"  
## [10] "Astringency"       "Crispy.Texture"    "Melting.Texture"  
## [13] "Sticky.Texture"    "Granular.Texture"
Lets do Exploratory Factor Analysis for both the groups seperately and find out how they are related to each other. We will try to find most
useful sensometric variables selected by users and compare results for both experts and amateurs. We will check whether data is
trustworthy and how variables are correlated using various statistical methods and tests. Also visualise them using plots for better analysis.
Finally we will find some conclusions based on the analysis performed.
Exploratory Factor Analysis
For experts data
Cronbach Alpha output
Cronbach’a alpha is the measure of the reliability and consistency of the sampling instrument and examine whether all the data is
measuring the same underlying construct.
## 
## Reliability analysis   
## Call: alpha(x = choc_e_sess)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.46      0.48    0.73     0.061 0.91 0.046  3.7 0.78    0.054
## 
##  lower alpha upper     95% confidence boundaries
## 0.37 0.46 0.55 
## 
##  Reliability if an item is dropped:
##                   raw_alpha std.alpha G6(smc) average_r  S/N alpha se
## Chocolate.Aroma        0.41      0.44    0.69     0.056 0.78    0.051
## Milk.Aroma             0.50      0.51    0.73     0.074 1.04    0.042
## Sweetness              0.49      0.49    0.73     0.068 0.96    0.043
## Acidity                0.42      0.45    0.72     0.059 0.81    0.051
## Bitterness             0.47      0.49    0.72     0.068 0.95    0.046
## Chocolate.Flavour      0.44      0.46    0.71     0.062 0.87    0.049
## Milk.Flavour           0.49      0.48    0.70     0.066 0.91    0.043
## Caramel.Flavour        0.44      0.43    0.70     0.056 0.77    0.048
## Vanilla.Flavour        0.43      0.42    0.71     0.054 0.74    0.049
## Astringency            0.36      0.41    0.69     0.050 0.68    0.056
## Crispy.Texture         0.44      0.47    0.72     0.063 0.88    0.048
## Melting.Texture        0.46      0.47    0.73     0.064 0.89    0.046
## Sticky.Texture         0.41      0.42    0.72     0.053 0.73    0.050
## Granular.Texture       0.45      0.48    0.74     0.065 0.91    0.048
##                   var.r   med.r
## Chocolate.Aroma   0.102  0.0388
## Milk.Aroma        0.097  0.0604
## Sweetness         0.110  0.0604
## Acidity           0.118  0.0604
## Bitterness        0.098  0.0604
## Chocolate.Flavour 0.098  0.0604
## Milk.Flavour      0.091  0.0604
## Caramel.Flavour   0.106  0.0604
## Vanilla.Flavour   0.116  0.0604
## Astringency       0.114  0.0309
## Crispy.Texture    0.108  0.0322
## Melting.Texture   0.117  0.0604
## Sticky.Texture    0.123 -0.0033
## Granular.Texture  0.117  0.0322
## 
##  Item statistics 
##                     n raw.r std.r r.cor  r.drop mean  sd
## Chocolate.Aroma   318  0.47  0.44  0.43  0.2980  6.1 2.2
## Milk.Aroma        318  0.11  0.16  0.12 -0.0857  2.1 2.1
## Sweetness         318  0.21  0.25  0.16 -0.0033  4.3 2.3
## Acidity           318  0.45  0.40  0.33  0.2594  3.1 2.3
## Bitterness        318  0.34  0.26  0.22  0.0958  4.2 2.7
## Chocolate.Flavour 318  0.38  0.34  0.32  0.2012  6.2 2.1
## Milk.Flavour      318  0.22  0.29  0.29  0.0110  1.9 2.3
## Caramel.Flavour   318  0.37  0.45  0.43  0.2067  1.6 1.8
## Vanilla.Flavour   318  0.39  0.48  0.42  0.2648  1.3 1.4
## Astringency       318  0.60  0.53  0.51  0.4162  3.6 2.6
## Crispy.Texture    318  0.37  0.33  0.26  0.1788  5.9 2.2
## Melting.Texture   318  0.29  0.31  0.22  0.0940  4.8 2.2
## Sticky.Texture    318  0.46  0.48  0.38  0.2764  3.7 2.2
## Granular.Texture  318  0.33  0.30  0.18  0.1393  2.9 2.1
Alpha value is around 50 % that is acceptable but weak and even dropping any variable won’t make much effect in its value therefore
keeping it as usual. This shows data is not much reliable.
Correlation Matrix
Here correlation is represented using color intensity.
##                   Chocolate.Aroma Milk.Aroma Sweetness Acidity Bitterness
## Chocolate.Aroma              1.00      -0.56     -0.26    0.28       0.48
## Milk.Aroma                  -0.56       1.00      0.30   -0.05      -0.41
## Sweetness                   -0.26       0.30      1.00   -0.22      -0.51
## Acidity                      0.28      -0.05     -0.22    1.00       0.42
## Bitterness                   0.48      -0.41     -0.51    0.42       1.00
## Chocolate.Flavour            0.72      -0.49     -0.43    0.24       0.61
## Milk.Flavour                -0.42       0.77      0.42   -0.13      -0.50
## Caramel.Flavour             -0.20       0.48      0.30   -0.03      -0.31
## Vanilla.Flavour             -0.01       0.28      0.21   -0.07      -0.21
## Astringency                  0.34      -0.21     -0.15    0.49       0.59
## Crispy.Texture               0.60      -0.47     -0.06    0.11       0.33
## Melting.Texture             -0.11       0.28      0.38   -0.24      -0.19
## Sticky.Texture               0.05       0.13      0.31    0.01      -0.21
## Granular.Texture             0.27      -0.24     -0.07    0.21       0.19
##                   Chocolate.Flavour Milk.Flavour Caramel.Flavour
## Chocolate.Aroma                0.72        -0.42           -0.20
## Milk.Aroma                    -0.49         0.77            0.48
## Sweetness                     -0.43         0.42            0.30
## Acidity                        0.24        -0.13           -0.03
## Bitterness                     0.61        -0.50           -0.31
## Chocolate.Flavour              1.00        -0.47           -0.24
## Milk.Flavour                  -0.47         1.00            0.70
## Caramel.Flavour               -0.24         0.70            1.00
## Vanilla.Flavour               -0.06         0.45            0.61
## Astringency                    0.33        -0.26           -0.09
## Crispy.Texture                 0.48        -0.46           -0.32
## Melting.Texture               -0.30         0.38            0.27
## Sticky.Texture                -0.03         0.27            0.25
## Granular.Texture               0.35        -0.29           -0.19
##                   Vanilla.Flavour Astringency Crispy.Texture
## Chocolate.Aroma             -0.01        0.34           0.60
## Milk.Aroma                   0.28       -0.21          -0.47
## Sweetness                    0.21       -0.15          -0.06
## Acidity                     -0.07        0.49           0.11
## Bitterness                  -0.21        0.59           0.33
## Chocolate.Flavour           -0.06        0.33           0.48
## Milk.Flavour                 0.45       -0.26          -0.46
## Caramel.Flavour              0.61       -0.09          -0.32
## Vanilla.Flavour              1.00        0.01          -0.16
## Astringency                  0.01        1.00           0.26
## Crispy.Texture              -0.16        0.26           1.00
## Melting.Texture              0.20        0.00           0.00
## Sticky.Texture               0.26        0.08           0.07
## Granular.Texture            -0.13        0.30           0.26
##                   Melting.Texture Sticky.Texture Granular.Texture
## Chocolate.Aroma             -0.11           0.05             0.27
## Milk.Aroma                   0.28           0.13            -0.24
## Sweetness                    0.38           0.31            -0.07
## Acidity                     -0.24           0.01             0.21
## Bitterness                  -0.19          -0.21             0.19
## Chocolate.Flavour           -0.30          -0.03             0.35
## Milk.Flavour                 0.38           0.27            -0.29
## Caramel.Flavour              0.27           0.25            -0.

COMP 5070 Exam SP5 2018 COMP 5070 Statistical Programming for Data Science Take Home Exam DUE: by 11:55 PM (CST), Friday 23rd November • The take---home exam is worth 30% of your overall grade....

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment