Assignment Instructions For The Data Analytics Report
Assignment Instructions For The Data Analytics Report
Assignment Instructions For The Data Analytics Report
You are required to access a large data set and apply the CRISP-DM methodology to
meaningfully clean, transform, analyse and evaluate it. As part of this process, you are
required to subsequently apply one or more machine learning technique(s) of your choice to
perform classification, association, numerical prediction and/or clustering tasks (or
combinations thereof).
You will present the outcome of the above tasks in the form of a technical report containing
the five sections listed in Table 1.
As shown in Table 1, a page limit of 10 pages is recommended. The report, in total, however,
must not exceed 13 pages (excluding title page, contents page, references, bibliography and
appendices) with a minimum font size of 10 pitch. A penalty of a single grade will be
incurred if you exceed the 13-page limit. Further information (supporting experimental
results) can be added as appendices.
You are free to select the style of the report (i.e., section headings and format, etc.) although
it must obviously address the content listed in Table 1.
Training, validation and test sets (before and after pre-processing). Note that if cross-
validation is used, only the training and test sets are required;
Report (MUST be in MS Word format).
The remainder of this section provides you with detailed requirements for each area of
content – you should READ IT VERY CAREFULLY.
Select a data set consisting of at least 2000 observations/records and preferably above
10,000. You are strongly encouraged to identify an anonymised data set relevant to
either your role at work or, more broadly, the strategic objectives of the business.
However, if this is not possible then you are advised to select a data set from one of
the following sources:
o https://www.kaggle.com/datasets
o http://www.cs.waikato.ac.nz/ml/weka/datasets.html
o http://yann.lecun.com/exdb/mnist/
o https://www.springboard.com/blog/free-public-data-sets-data-science-project/
Briefly describe your data set and reference its origin.
If you have 15 or less attributes, table your attributes with attribute name, description
and data type and then show the minimum/average/maximum and stdev values for the
training set and test set. For nominal variables, then show the most and least
frequently occurring nominal value(s). If you have more than 15 attributes, then group
attributes into themes (e.g., customer, orders, employees) and describe the type of
information and data types in each theme including the number of each variable type
(e.g., nominal, interval, ratio, etc.). You may want to highlight significant variables
identified by some attribute selection algorithm.
Briefly table the following characteristics of the entire data set: number of instances,
patterns per target class (if classification), limitations such as possible conflicting
patterns, missing values, outliers/erroneous values.
Explain how you have sampled your data to create the ‘in sample’ and ‘out of sample’
data sets. If you have used instance weightings to balance your data set(s), explain
how the weightings were determined.
Provide a statistical summary in tabular form for the resulting ‘in sample’
(training/validation set) and ‘out of sample’ (test set). Also, state whether or not there
was any overlap in training and test set instances and if so, justify why your test set is
not compromised.
What pre-processing and transformation was performed on the variables and why?
(e.g., standardising numerical variables and/or using scaling, taking logs to reduce
skewness, or log differences to reduce non-stationarity; converting numerical
variables to discrete ones; converting numerical or symbolic patterns into bit patterns;
removing patterns with missing or outlier values; adding noise or jitter to patterns to
expand the data set; adding instance weightings or replicating certain pattern classes
to improve class distributions; transforming time-series data into static training/test
patterns).
How did you ensure that your pre-processing did not compromise your test set (e.g.,
use of standardisation).
For those seeking a higher Distinction, you must clearly show how you have
addressed the ‘curse of dimensionality’ issue, i.e., if you reduced the number of
dimensions (e.g., from 30 attributes to 10 attributes), how did you do this?
Autoencoder? PCA? Filter using InfoGain measurement? A clusterer? How do these
methods work and what are their advantages /disadvantages? Also, if you increased
the number of training instances, how did you do this?
Clearly state the machine learning methods you will be using and the function(s) you
will be expecting them to perform (e.g., classification, association, regression,
clustering or combinations thereof for self-supervised learning). You must describe
the expected ‘input to’ and ‘output from’ each model.
Explain and justify the machine learning method(s) chosen for the task. You must also
use a simple benchmark model with which to compare your chosen machine learning
model(s) (e.g., benchmark a neural network trained with back-propagation against a
simple OneR or Naive Bayes approach).
Briefly highlight the strengths and weaknesses of the chosen learning method(s).
Describe your ‘model fitting’ and ‘model selection’ process (e.g., leave-one-out
validation, cross-validation, bagging and boosting, etc.). You must state and justify
the hyper-parameters used for model fitting and how ‘over-training’ will be
minimised.
Describe the tool you used to implement the machine learning method(s) (e.g.,
Weka/Java).
Use advanced features of the chosen analytics tool including (though not limited to) clear
evidence of meaningful programming/scripting activity to use machine learning and/or pre-
processing tools in a bespoke way (e.g., install and use advanced Weka packages via Package
Manager – examples might be simple recurrent networks, convolutional neural networks,
self-organising maps, time series processing with ARIMA models). If you are not using
Weka then a clear explanation of what you have developed is required with all source code
and build files being uploaded.
OR
Table the resulting ‘in sample’ (training) and ‘out of sample’ (test) performance of
your model for the different model configurations and trial runs (e.g., a neural net
with different numbers of hidden nodes, different random starting weights and or
different learning rates). You should use (at least) one or more of the performance
metrics (as appropriate):
o Percent correct/incorrect
o Confusion matrix
o Recall and precision
o Evaluating numeric prediction (e.g., mean squared error (MSE), root mean
squared error (RMSE), correlation coefficients)
o ROC curve?
Critically review the performance of the different models. Which type of pre-
processing appeared to be most advantageous and why? For each model, which
hyper-parameter settings (e.g., learning rate, prune tree, momentum term) were most
effective?
Critically compare models – was there a model or model class whose performance on
the test set was statistically significantly better than the other models/model classes
(with a p-value < 0.05) (may be using Experimenter in Weka)?
Section 5: Discussion
Briefly summarise your task and your findings (i.e., whether the model learnt the
problem).
How do your findings relate to similar tasks found in the relevant industry or
academic literature?
Did you gain the insight you intended to? If not, what else could you do to enhance
the usefulness of your analytics?
How did you decide on the most appropriate machine learning method and what do
you understand about appropriateness?
Finally, briefly state how you are going to use the knowledge and skills you have
developed in the module to further your professional ambitions and/or the strategic
objectives of your organisation.