(1) Understanding Dataset: UNSW-NB15
The raw network packets of the UNSW-NB15 | dataset was created by the IXIA PerfectStorm
tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) for generating
a hy
id of real modem normal activities and synthetic contemporary attack behaviours.
Tepdump tool used to capture 100 GB of the raw traffic (e.g., Pcap files). This data set has
nine types of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic,
Reconnaissance, Shellcode and Worms. The Argus and Bro-IDS tools are used and twelve
algorithms are developed to generate totally 49 features with the class label.
a) The features are described here.
) The number of attacks and their sub-categories is described here,
©) In this coursework, we use the total number of 10-million records that was stored in
the CSV file (download). The total size is about 600MB, which is big enough to
employ big data methodologies for analytics. As a big data specialist, firstly, we
would like to read and understand its features, then apply modeling techniques. If
you want to see a few records of this dataset, you can import t into Hadoop HDF,
then make a Hive query for printing the first 5-10 records for your understanding.
(2) Big Data Query & Analysis by Apache Hive [30 marks]
This task is using Apache Hive for converting big raw data into useful information for the
end users. To do so, firstly understand the dataset carefully. Then, make at least 4 Hive
Queries (refer to the marking scheme). Apply appropriate visualization tools to present
your findings numerically and graphically. Interpret shortly your findings.
Einally, take screenshot of your outcomes (e.q.. tables and plots) together with the
scripts/queries into the report.