Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
100 views

Data Processing For Big Data Report

Proposed a big data solution for an investment banker that would enable them to detect fraudulent transaction in the banking sector as well as improve business profitability by investing in airports through data analytics. The report consists of the technical design and costs for the business solution.

Uploaded by

Varun Mathur
Copyright
© © All Rights Reserved
0% found this document useful (0 votes)
100 views

Data Processing For Big Data Report

Proposed a big data solution for an investment banker that would enable them to detect fraudulent transaction in the banking sector as well as improve business profitability by investing in airports through data analytics. The report consists of the technical design and costs for the business solution.

Uploaded by

Varun Mathur
Copyright
© © All Rights Reserved
You are on page 1/ 24

FIT 5202

Technical Design

LAB 05 - Group 02

Name StudentID Authcate Contribution

Faik Canberk 28525450 fayd0001 0.2

Jaideep Singh Dang 30068347 jdan0014 0.2

Xiangtan 29436559 xlin0021 0.2

Varun Mathur 28954114 vmat0002 0.2

Aayush Kapoor 28980875 akap0002 0.2

Word Count (excluding tables and references): 2631


Due Date: October 29, 2018


LAB 05 - GROUP 02 FINAL PROJECT 1


Index
Heading Page #

Business Requirement Briefs 2

Business Case 2

Design Analyses 3

Costing 6

References 10

Minutes of Meeting 11

Reflective Diaries 16

Business Requirement Briefs

In this proposal we are suggesting a suitable framework that can manage and run the required
operations of the given Investment Banking Enterprise at minimised costs with efficient results.

The proposal provides information about the advised technologies to settle the three main areas of
concern that are:

Use the streaming data to analyse and monitor active credit cards in real time to combat
possible frauds.
To have a suitable set up to store and analyse any relevant publicly available datasets in order
to extract insight from them to assist in investment decisions.
To upgrade the IBE’s computer hardware in order to accommodate required capacity to store
and execute advanced analytics to further aid business decisions.

Business Case

The proposed solutions are constructed in a way that they achieve the defined business goals in
the most efficient form with minimal costs possible. The aim of the project is to understand the
three points of interest carefully and design a strategic solution that comforts all the problems with
minimum required architecture, faster integration, possible scalability and well built security.

The proposed project keeps the data ingestion and processing for fraud detection (i.e. real-time
action requirements) and analysing of publicly available datasets (i.e. non real-time, static action)
separate from each other. This way, bringing in possible scalability and alterations in either of the
two without bothering one another. This provides the IBE with enough room to make flexible
changes to the any of the two seperate systems going in the future.

It is well understood that the customers of the IBE are a very important part of the stakeholders.
With that in mind, the risks associated with the foundations of the project have been considered.
Integrating the afore mentioned split system architecture design (real-time action and static action)

LAB 05 - GROUP 02 FINAL PROJECT 2


solution with the already existing technologies in the IBE might be an area of interest. It has to be
taken care of that there is no loss of data or operations while bringing the integration into action.
The risks of data privacy (be it company or costumer data) have been addressed by proposing the
system to be IBE owned not rented. Also when it comes to making any decision based on
investing, we cannot foresee any possible misfortunes or undesired circumstances concerning the
loss of data or loss of analytics.

Once carried out well, the whole project will bring a beneficial change in the IBE. This will help in
increasing the fraud detection by using suitable algorithms with appropriate hardware. This system
architecture will also allow the static action design to be able to deploy state of the art advanced
data analytics with open source datasets, utilising them to the fullest in order to gain insight to
guide IBE to make the correct investment decision via predictions and inferences.

Design Analyses
- IDENTIFICATION OF DATA SOURCES

As the organisation in concern is a large investment banking enterprise, it is supposedly assumed


that it holds a large database of each service it offers. All credit card transactions and details are
generally stored and it is the best advise to utilise this data to train our model to recognise any
potential threat or fraud. Once our model is trained and tested using the previous cases of fraud
and transactions, the model then can be used to monitor the streaming data to check for frauds in
real time.

For the airport data we have multiple sources to refer to which are publicly available. The project
proposes US Bureau of Statistics’ data to analyse the history and statistics of the airports and the
possible revenue models these airports follow. The data holds information like number of arrivals
and departures, number of scheduled flights, number of carriers, top destination, etc., for each US
airport for every passing month since December 2006[1].

This static data can be utilised to interpret information required to make investment decisions in
the sector.

Additional datasources that may influence business decisions could be a historical track of the US
census (https://www.census.gov/) in which very detailed demographic and social information are
open source. Linking these to US shape files could give big insight in the geospatial aspect.

- PROCESSING FRAMEWORK AND DATA INGESTION

For the purpose of this project we propose to use HADOOP(with SPARK) environment as the
underlying structure and to run the required computations. As the organisation has to deal with
importing both batch and streaming data, we consider this process separately for both. As
discussed the proposal suggests to keep the ingestion and processing of two data types separate
by using two Hadoop clusters(HDFS-A and HDFS-B).

LAB 05 - GROUP 02 FINAL PROJECT 3


(Figure shows the functioning at a macro level)
HDFS-A is designed to focus on computing heavy and streaming data analysis, while HDFS-B is
designed to be more storage centric. The idea is to not depend on rental cloud technologies as
organisation has to deal with not only upscaling in near future but also has to be concerned about
the data and operational security. As discussed earlier, the costumers are important stakeholders,
and their financial and personal information are an important risk. Thus it is important that the data
is hosted in a controlled environment that is completely influenced by the bank itself.

Therefore, HDFS-A is used to process and monitor the streaming credit card data. Generally Real
Time Fraud detection can be categorised in to 2 steps:
The first step includes using the historical data to analysis and build the machine learning
model.
The second step is to use the model in production to make predictions or decisions on live
events.

The given image shows flow for the two steps.[2]

LAB 05 - GROUP 02 FINAL PROJECT 4


In case of credit card frauds, logistic regression is a suitable approach to measure the relationship
between the probability of fraud and the deciding features such as transaction amount, type of
second party, location variability, and time gap between transactions. As the foundation being used
is Spark, it provides suitable methods to operate logistic regression using the mllib module.

Thus the supposed architecture checks the events and frauds by Spark Streaming using the Spark
Machine Learning with deployed model.
The project also suggests use of Kafka as a high-throughput distributed message system to
deliver the information to streaming. It is suitable as it can be used with some external modules in
Spark and is completely functional when it comes to the streaming of large data in a fast paced
environment.
For the batch processing of the static data, HDFS-B is used to store and process the information.

- DATA STORAGE DESIGN CONSIDERATIONS

For this project, HDFS-A and HDFS-B hold different requirements when it comes to data storage.
As HDFS-A has to deal with high-throughput data it requires storage that is physically compatible
its needs. Therefore, when talking about high speed data streaming, SSDs are suggested to settle
and avoid the concerning issues. As the enterprise is large it must face millions of transactions per
day. This requires suitably faster and compatible disk space. FlashBlades are not suggested due
exceptionally high rates as compared to SSD storages. Though, FlashBlades can surely be
integrated in the future if the need be.

On the other hand HDFS-B are not required to have high speed data storage or processing and
can be run on ordinary disks to deal with the analysing of suggested datasets.

Logically, the clusters will work as described and take care of the assigned duties whilst running on
the specified hardware. The data is supposed to be converted into Parquet form so that Spark can
use it more efficiently when turning it in for modelling and processing.

- DATA PROCESSING CONSIDERATIONS & OUTPUT PROVISIONS

Expanding into the are of processing, we first define the processing requirements of which falls
under two categories;

a) Fraud Detection on Continues Data


b) Analytics on Static Data

Fraud detection is a classic supervised classification problem. Given an attribute vector X of


previous transactions (which needs to be mined from internal data sources) labeled as a response
variable of array Y (fraud or normal). One can train an SVM to do this classification. The streamed
data points can then be assessed in a binary fashion via the trained SVM.

Static data analytics for our business case can be thought of as classic BI. However with the idea
of processing with parallelism. Given we have a vast amount of publicly available datapoint of
which we would like to run clustering, regression, classifications we need parallelism integrated
into the processing environment.

Given that HDSF-A is more powerful, streaming the data into the SVM classifier will be done in this
cluster. The data for static operations will reside in HDSF-B. This cluster can indeed handle not

LAB 05 - GROUP 02 FINAL PROJECT 5


computationally heavy tasks, but intense operations (i.e. k-folds validations operations made for
some prediction) can be transferred to the HDSF-A cluster to execute.

Given these requirements, the parallelism requires GraphX for graph related computations. Also
needed is Spark Streaming in order to do the fraud detection task, and also one major module
needed is MLlib for machine learning and advanced analytics. This module also integrates Python
and R into the system which are state of the art in data science methodologies and deep learning
(TensorFlow).

Costing
As the organisation tends to use the data for future decisions and investments, it safely assumed
that the data is to remain as a private entity. Considering the given fact and the possible situations
of upscaling in the near future, it is a better option to set up own computational and storage
environment and not rely on the commercial rental clouds. Also assuming that this data will be
entirely restricted within safe parameters under the influence of the enterprise itself. For the
foundation of the processing framework, Hadoop with Spark are a comfortable approach as they
can accommodate both static and streaming data. Also they can provide with an ability to integrate
more technologies, both limited and public licensed softwares.

Also the Hadoop’s storage costs are significantly less than any other solution and provide a more
failure tolerant storage. Any Hadoop Cluster is constructed using racks of Intel servers. Each of
this server has an internal hard disk storage. Considering that enterprise already satisfies the
requirement for sufficient storage space to accommodate more hardware. This means that about
30U (units of rack space) in one cage, or sum total of 30U in multiple cages. It also assumed that
the enterprise already uses ordinary storage for retaining long term data. This may conclude to
more than 350-400 TB of disk storage. Also the electricity costs are calculated on the basis of the
current figures maintained by the US Departments of Energy. This is understandable by the fact
that the enterprise has its roots in the US and is going to operate there. As of the figures for
October 2018 the cost of energy is seen to be around $0.0135/kWh and the rates are only going to
increase as per the current trends[4]. As the enterprise is said to be large, the services for HDFS-A
are supposed to run 24 hours a day for all 7 days. While HDFS-B is supposed to run for around 18
hours a day at 95% TDP(Thermal Design Power).

To maintain the whole architecture a Data Scientist and a Data Engineer are proposed. The
additional salaries add up to the costs by around US$132,699 and US$103,788[5].
To satisfy the connection requirements, Dual 10 Gigabit Ethernet is preferred as the enterprise is
going to deal with large Input and Output operations where each can do a sub-10min import/
duplication job on a dataset of size 300GB or more[6].

LAB 05 - GROUP 02 FINAL PROJECT 6


The option to opt for the rental cloud services is not considered keeping in mind several factors
such as security and related issues, cost comparison, and networking constraints. This keeps a
client dependent of the host(providing the cloud service) for the maintained networking speed and
cannot argue over low performance. So in case this happens, it is a major risk issue for the
streaming data analyses which can lead to low computational performance and ultimately lead to
more revenue loss for the enterprise. Also the data to be handled is very sensitive and should not
be hosted somewhere outside the premises of the enterprise. Most importantly, on site clusters
cost about same or even less than the existing major players in the rental cloud sector. These
include Microsoft Azure’s HDInsight, Alibaba Cloud E-MapReduce, Amazon Web Services’s EC2,
and Google’s Cloud Compute platform.

CLUSTER COST COMPARISON(1 Year)

Cluster Provider Instance(s) Number of USD/month Annual Cost


Instances (USD)

Azure A7 (Worker Node) 6 $1,091.50


HDInsights[7]
D13 v2 (Head 2 $4,432.56
Node)

Premium 80 cores $748.87


Surcharge

Professional $1,000.00
Support

Total Cost $75,274.4

AWS EC2[8] R4.8xlarge 5 $7,660.8 $91,929.6

Google Cloud N1-highmem-32 5 $6,819.84 $81,838.0


Compute[9]

Alibaba Cloud ecs.n2.7xlarge 10 $4,881.6 $58,579.2


EmapReduce[10]

On the other hand the proposed clusters are expected to cost just around US$70,000 for the
purchase of necessary equipment.

Finally, on looking at these costs, it is also important to register that these figures only take into
account the direct usage and do not note the data transfer to/from the cluster, which would then
likely increase the costs a bit more.

Also the costs for setting up the on-site clusters are to occur only once until any upgradation or
upscaling during which the costs for only the supporting factors(like power, networks, technical
support) will see a rise.

HDFS-B, which is concerned with handling the static data is designed to handle low complexity
calculations, or in case of heavy computational requirements for HDFS-A, it can stream data to
HDFS-A. Thus, Cluster HDFS-B is more storage centric than in terms of computational power but
can work either way if the need be. To justify these situations, HDFS-B is supported with up to
100TB of data: making it sufficient enough to handle all the data.

LAB 05 - GROUP 02 FINAL PROJECT 7


Also as Hadoop replicates a given data 3 times to maintain failure support, the data sets will
require just around 40TB of the available 300TB. Also each data-node will be running on a Xeon
e5-2630 v3 and will have access to 32GB of DDR4 2133MHz RAM. This will be enough to
compute complete 10TB of datasets in 24 hours.
Moreover, HDFS-B will utilise 10GbE to set up a high speed connection between the two clusters
in case the computation tasks are required to take place on HDFS-A. These may include any time-
sensitive or critical operations.

HDFS-A, on the other hand is to be designed to settle high-complexity computations, including


operations on high speed incoming data in the form of streams. For HDFS-A each data node is to
run on the same Xeon e5-2630 v3 as HDFS-2’s data nodes, along with 64GB of RAM and 10TB of
SSD storage. SSDs are chosen over HDDs as they can provide an additional speed. These can
provide much better performance when compared to the ordinary HDDs due to their structural
design.

As the streaming datasets can involve processing of up to 1000 datasets per minute and more,
with each dataset carrying a size of around 100s of KB, the cluster is required to be suitable fast.
Thus, it is safe to assume that around 100GB of RAM is suitable for the cluster performance.
HDFS-A will be able to hold around a 18 hours of streaming data with an access to 5TB storage.
Thus, when looked at, the 8-core Xeon CPUs are suitable to withstand high computation tasks and
are adjustable to upscaling, making them capable of sub-5-second processing.

The said model can be upscaled, upgraded and utilised as comfortable and does not require any
additional cost coverage. These are fully capable cluster designed to run the specific operations
but can be adjusted to support other functionalities

Detailed Composition of Nodes(1 Year)

Node/ Item Composition Number of Units Unit Cost Total

HDFS-A Name CPU: Xeon 2 $975.99 $1,951.98


Node E5-2640 V4
Power usage
@400W Mobo: SuperMicro 1 $329.99 $329.99
Cost per node: MBD-x10DRI-T-O
~$250/yr RAM: Kingston 2 $1,541.18 $3,082.36
Number of units: 2 ValueRam 4*32GB
DDR4-ECC

Rack: SuperMicro 1 $569.99 $569.99


CSE-825TQ-
R700LPB 2U

Drives: Kingston 2 $397.99 $795.98


KC400 1TB SSD

$6,730.3

HDFS-B Name CPU: Xeon 2 $975.99 $1,951.98


Node E5-2640 V4
Power usage
Mobo: SuperMicro 1 $329.99 $329.99
@400W
MBD-x10DRI-T-O
Cost per node:
~$250/yr

LAB 05 - GROUP 02 FINAL PROJECT 8


Detailed Composition of Nodes(1 Year)
Cost per node:
~$250/yr RAM: Kingston 2 $877.00 $1,754.00
Number of units: 2 ValueRam 4*16GB
DDR4-ECC

Rack: SuperMicro 1 $569.99 $569.99


CSE-825TQ-
R700LPB 2U

Drives: WD Gold 4 $414.99 $1,659.96


10TB HDD

$6,265.92

HDFS-A Data CPU: Xeon 1 $765.01 $765.01


Node E5-2630 V4
Power usage
Mobo: SuperMicro 1 $329.99 $329.99
@220W
MBD-x10DRI-T-O
Cost per node:
~$150/yr RAM: Crucial 1 $651.99 $651.99
Number of units: 2 4*16GB DDR4

Rack: SuperMicro 1 $569.99 $569.99


CSE-825TQ-
R700LPB 2U

Drives: Kingston 4 $397.99 $1,591.96


KC400 1TB SSD

$3,908.94

HDFS-B Data CPU: Xeon 1 $419.99 $419.99


Node E5-2620 V4
Power usage
@300W Mobo: SuperMicro 1 $329.99 $329.99
Cost per node: MBD-x10DRI-T-O
~$200/yr RAM: Crucial 1 $309.23 $309.23
Number of units: 3 2*16GB DDR4

Rack: SuperMicro 1 $569.99 $569.99


CSE-825TQ-
R700LPB 2U

Drives: WD Gold 8 $414.99 $3,319.92


10TB HDD

$3,908.94

LAB 05 - GROUP 02 FINAL PROJECT 9


Detailed Composition of Nodes(1 Year)

NAS/Raid Unit: Buffalo 1 $2,999.00 $2,999.00


Power usage TS51210RH1604 16
@300W TB Rack inc 4x4TB
Cost per node: HDD
~$200/yr
Number of units: 2 Extra HDD: WD 4 $174.99 $699.96
Red 4TB NAS
HDD

$3,698.96

Set-up Costs SSD to HDD: 10 $18.99 $189.90


Power usage ICYDOCK
@50W MB882SP 2.5” to
Cost per node: 3.5” Converter
~$25/yr
UPS: APC Smart- 4 $1,060.93 $4,243.72
UPS X SMX2000LV
4U rack

Switch: NETGEAR 2 $1,715.91 $3,431.82


ProSafe 16-Port 1
—Gb Managed
Switch

$7,854.44

GRAND TOTAL $70,764.4

References
[1] The U.S. Department of Transportation’s Bureau of Transportation Statistics (BTS)
retrieved from: https://www.bts.gov/
[2] MapR [Carol McDonald]. 2016, Real Time Credit Card Fraud Detection with Apache Spark and
Event Streaming
Retrieved from: https://mapr.com/blog/real-time-credit-card-fraud-detection-apache-spark-and-
event-streaming/
[4] Electricchoice. 2018, Electricity Rates by State 
Retrieved from: https://www.electricchoice.com/electricity-prices-by-state/
[5] Indeed. 2018, Big Data Salaries in the United States
Retrieved from: https://www.indeed.com/salaries/Big-Data-Salaries
[6] Arista. 2012, Hadoop* Clusters Built on 10 Gigabit Ethernet
Retrieved: https://www.arista.com/assets/data/pdf/Whitepapers/Hadoop_WP_final.pdf
[7] Microsoft Azure
Retrieved from: https://azure.microsoft.com/en-au/pricing/calculator/?service=hdinsight
[8] AWS
Retrieved from: https://aws.amazon.com/ec2/pricing/on-demand/
[9] Google Cloud
Retrieved from: https://cloud.google.com/compute/pricing
[10] Alibaba Cloud
Retrieved from: https://www.alibabacloud.com/product/e-mapreduce?spm=a3c0i.
7911826.1160486.dproducth1.2d85737bARb2gy#pricing

LAB 05 - GROUP 02 FINAL PROJECT 10


Minutes of Meeting

Mee#ng No: 1
Mee#ng Date: 04/10/2018
Loca#on: MPA Lounge, Building H, Monash University, Caulfield
A0ending: Aayush Kapoor, Jaideep Singh Dang, Xiangtan Lin, Varun Mathur, Faik Canberk
Apologies: None
Mee#ng start #me: 2:00 PM
Ma0ers arising from previous minutes: YES NO ☑
Confirma#on of minutes from last mee#ng: YES NO ☑

Outcome of Mee#ng:
ISSUE DISCUSSION IN BRIEF OUTCOME ACTION: NAME &
TIMELINE

Meeting Schedule Setup meeting schedule for Meeting days will be Team
the entire project Thursday, Friday as
decided

Team Leader Selecting a team leader Jaideep Singh Dang Project leader will be
will lead the project same throughout the
assessment

Programmer Selecting a team for Proof of Xiangtan and Varun Programmers will be
Concepts will explore coding same throughout the
domain assessment and
coordinate with the
other members for
designing proof of
concepts.

Coordinator and Selecting a team coordinator Aayush will Coordinator and


Author and report author coordinate team Authors will be same
deliverable and Ezra throughout the
and Jaideep will be assessment.
taking care of the
report.

Ac#ons in brief: Team leader and various roles finalised, meeOng days selected when all the 5 members are 

available. Assessment discussed in details to know the areas that needs more focus and Ome.

Mee#ng Closed at: 04:00 PM

Next mee#ng date, #me, and loca#on: 11/10/2018, 2:00 PM at MPA Lounge, Building H, Monash
University, Caulfield

LAB 05 - GROUP 02 FINAL PROJECT 11


Mee#ng No: 2
Mee#ng Date: 11/10/2018
Loca#on: MPA Lounge, Building H, Monash University, Caulfield
A0ending: Aayush Kapoor, Jaideep Singh Dang, Xiangtan Lin, Varun Mathur, Faik Canberk
Apologies: None
Mee#ng start #me: 2:00 PM
Ma0ers arising from previous minutes: YES NO ☑
Confirma#on of minutes from last mee#ng: YES ☑ NO

Outcome of Mee#ng:
ISSUE DISCUSSION IN BRIEF OUTCOME ACTION: NAME &
TIMELINE

Business Various aspects of Scope management Completed in meeting.


Requirements requirements were analysed techniques applied to
to ensure scope of projects is ensure proper
taken into consideration. requirements are
gathered

Business Case To ensure that business case All the IBE In progress to be
is streamlined with bank’s stakeholders must get finalised by 18th
strategic vision/direction. the best services/ October.
results available.

PoC01 Task and structure that Team further scoped In progress to be


needed to be followed for into K means model. finalised by 18th October
successful completion of it tentatively.
were discussed.

PoC02 Task and structure for Team reviewed the In progress to be


discussed for PoC02. dataset and worked finalised by 18th October
on the code. tentatively.

Ac#ons in brief: Majority of Ome spent on understanding the requirements and proposing opOmum
soluOon for the given situaOon with investment bank. PoC02 task was divided among team mates to ensure
everyone parOcipated to ensure deliverables are completed by next meeOng.

Mee#ng Closed at: 04:00 PM

Next mee#ng date, #me, and loca#on: 18/10/2018, 2:00 PM at MPA Lounge, Building H, Monash
University, Caulfield

LAB 05 - GROUP 02 FINAL PROJECT 12


Mee#ng No: 3
Mee#ng Date: 18/10/2018
Loca#on: MPA Lounge, Building H, Monash University, Caulfield
A0ending: Aayush Kapoor, Jaideep Singh Dang, Xiangtan Lin, Varun Mathur, Faik Canberk
Apologies: None
Mee#ng start #me: 2:00 PM
Ma0ers arising from previous minutes: YES NO ☑
Confirma#on of minutes from last mee#ng: YES ☑ NO

Outcome of Mee#ng:
ISSUE DISCUSSION IN BRIEF OUTCOME ACTION: NAME &
TIMELINE

PoC01 The structure and content of PoC01 almost In-progress, to be


reporting finalised, a dry pre- completed and completed by 22nd
run was performed to check Xiangton to prepare October 2018 tentatively.
code implementation. the finalised version
and report.

PoC02 Worked on content and Varun to collate the In-progress, to be


structure of report to ensure divided work for this completed by 22nd
proper understanding of the task and produce October 2018 tentatively.
functionality of the code by finalised version.
tutor/faculty.

Design Analysis Identification of data sources Team will work on In-progress, to be


to be completed by Ezra, individual tasks and completed by 22nd
Framework and data collate and review the October 2018 tentatively.
ingestion to be completed by final work.
Jaideep and Data storage to
be handled by Aayush.

Business Case Finalised the business Added to the final Completed.


proposal for IBE. report.

Ac#ons in brief: Minor work on PoC01 and PoC02 leY, Xiangtan and Varun will work on the remaining part
and complete it by next meeOng. Business requirements, business case completed.

Mee#ng Closed at: 04:00 PM

Next mee#ng date, #me, and loca#on: 22/10/2018, 2:00 PM at MPA Lounge, Building H, Monash
University, Caulfield

LAB 05 - GROUP 02 FINAL PROJECT 13


Mee#ng No: 4
Mee#ng Date: 22/10/2018
Loca#on: MPA Lounge, Building H, Monash University, Caulfield
A0ending: Aayush Kapoor, Jaideep Singh Dang, Xiangtan Lin, Varun Mathur, Faik Canberk
Apologies: None
Mee#ng start #me: 2:00 PM
Ma0ers arising from previous minutes: YES NO ☑
Confirma#on of minutes from last mee#ng: YES ☑ NO

Outcome of Mee#ng:
ISSUE DISCUSSION IN BRIEF OUTCOME ACTION: NAME &
TIMELINE

PoC01 Completed and draft report Report created, In-Progress, to be


created for the same. formatting need to be completed by 24th
updated. October.

PoC02 Completed but images and Needs minor changes. In-Progress, to be


process still need to be completed by 24th
updated in report. October.

Business Completed and finalised. Added to final report. Completed.


Requirements

Business Case Completed and finalised. Added to final report. Completed.

Design Analysis Completed and finalised. Added to final report. Completed.

Cost Management Jaideep and Ezra validate all Proposed components In-progress, to be
the researched cost options shown in a tabular completed by 24th
available for bank format along with the October.
prices.

Ac#ons in brief:Majority of Ome spend in compleOng cost management, the whole team worked on
content creaOon and researching different viable opOons available in current market scenario and would
suit our client/investment bank the best given scenario. Team decided to meet on 24th October 2018 again
to finalise the content and make final changes in format of report.

Mee#ng Closed at: 06:00 PM

Next mee#ng date, #me, and loca#on: 24/10/2018, 2:00 PM at MPA Lounge, Building H, Monash
University, Caulfield

LAB 05 - GROUP 02 FINAL PROJECT 14


Mee#ng No: 5
Mee#ng Date: 24/10/2018
Loca#on: MPA Lounge, Building H, Monash University, Caulfield
A0ending: Aayush Kapoor, Jaideep Singh Dang, Xiangtan Lin, Varun Mathur, Faik Canberk
Apologies: None
Mee#ng start #me: 2:00 PM
Ma0ers arising from previous minutes: YES NO ☑
Confirma#on of minutes from last mee#ng: YES ☑ NO

Outcome of Mee#ng:
ISSUE DISCUSSION IN BRIEF OUTCOME ACTION: NAME &
TIMELINE

PoC01 Finalised and formatted in Completed Completed


report.

PoC02 Images and process chart Completed Completed


added.

Cost Management Finalised tabular format and Added to the main Completed
optimum cost options shown. report

Reflective Diaries Reflective Diaries handed Added to the main Completed


over by each member to the report.
leader.

Ac#ons in brief: Finalized cosOng opOmum cost opOon available in market and forma^ed the report for
deliery.

Mee#ng Closed at: 06:00 PM

Next mee#ng date, #me, and loca#on: Only electric mode of communicaOon if any further changes
required in the report or content.

LAB 05 - GROUP 02 FINAL PROJECT 15


Reflective Diaries
Diary One
• Full name (as it appears in Moodle)
o Jaideep Singh Dang

• Your role within the group (e.g. coordinator, programmer, designer,


author)
o Leader, Coordinator, Author, Designer

• The activities/responsibilities associated with that role


o Reviewing and Understanding the Project requirements
o Deciding the distribution of project work based on inputs.
o Managing the team meetings and coordinating with each member on their
allocated work.
o Reviewing each member’s work and confirming its suitability with the project
requirements.
o Suggesting design approach and planning build up.

• Your contribution to the group


o Providing with a basic design approach
o Finalising the suggested frameworks and design algorithms.
o Researching all the possible hardware components for the proposed framework
and analysing the costs and expenses.
o Test Proof of Concepts and help with design, storage and ingestion proposals.

• What you feel that you have learnt from this group work?
o The First thing I learnt was to manage a team and bringing all of them together on
the final decisions. Also It helped me learn task allocations and working as a team.
Every member thinks in a different and an out of the box manner and a leader is
responsible to get the most out of these thoughts and turn it into something
dependable.

o Secondly, I learnt a lot about hardwares when I began researching on the possible
component options for the proposed hardware framework. Also the project not
only helped me solidify my understanding on the basic concepts taught in our
lectures and through our lab assessments but also helped me learn something out
of the curriculum, that was how an on-site storage and processing framework is
generally built and maintained.

• How have you learnt? (i.e. learning techniques)


o Researching about the different possible components that can be used to build an
on-site Hadoop framework for a company using big data at its core.
o Thinking in line with every team member as an individual and as a whole team
helped me learn better ways to solve the problem statement.

LAB 05 - GROUP 02 FINAL PROJECT 16


o Learning different possible ML algorithms and how they can be implemented for. a
particular problem in Spark.

• What went well about this project?


o Coming up with a framework with minimised costs and maximised results and
performance.
o Learning and physically applying the concepts made it easier to understand them.

• What went wrong?


o It was made sure that everybody attends the meetings and provides useful inputs.
It was difficult at moments but nothing went wrong as far as managing the team is
concerned and everybody took equal responsibility to take this project forward.

• Issues? Steps taken to resolve these?


o The team faced some issues on understanding the project at a micro level and a lot
of discussion went into clarifying the real purpose and requirements of the project.
But thinking together and brainstorming in similar direction helped us understand
the problem and also with inputs from tutors, we finally could decide on the
solutions. Also making sure that every member can give some time to the project
was a difficult task given their jobs and other assessments. But working together
and coming to a common meeting time that was comfortable for all helped resolve
it.

• Your overall conclusion about the project. How would you do it, if asked to
do it again?
o In my view, every team member has given their 100% to carry this project forward
and complete it. If the project was to be done again, I would not change anything as
the imperfections are made into perfections when we work as team, and this is
what happened. Helping each other and coming to a common conclusion that all
can agree on makes any good project and we made sure that we use every
members input to decide on out solution. I feel it had a perfect blend of business
case and programming to build a wholesome learning experience.

Diary Two

• Full name (as it appears in Moodle)


o Faik Canberk AYDIN

• Your role within the group (e.g. coordinator, programmer, designer,


author)
o Designer, Author

LAB 05 - GROUP 02 FINAL PROJECT 17


• The activities/responsibilities associated with that role o Identifying data
sources
o Understanding and applying past knowledge of ML on the tasks at hand
o Considering and identifying hardware and cluster necessities
o Styling, editing and formatting the report

• Your contribution to the group


o Providing ML/Advanced Analytics Methodologies, Software and design
o Report Formatting / Editing
o Data source research

• What you feel that you have learnt from this group work?
o Getting a chance to apply the concepts learned in class and labs (especially labs), to
the project really made me understand what these concepts practically do. Its ok to
understand and memorise concepts, but this work especially in a group setting
(invoked discussion) cemented what these things are and how they harmonise
with each other to solve potential solve real world problems.

• How have you learnt? (i.e. learning techniques)


o Re discussing with the group, upon initially looking over the problems on my own
sparked constructive decision altering for me. Thinking about concepts
collaboratively gave angles not seen by me prior
o Research on data processing in a parallel fashion (ML models I hadn’t used in
parallel before) gave me better understanding on how complex models run on
Spark.

• What went well about this project?


o The work flow was very smooth and not very problematic
o Group was very nice and constructive also easy to communicate with
o As mentioned earlier, the application of concepts to a potential real-life situation
created a bridge between the concepts and how they physically work.

• What went wrong?


o Not much went wrong. Personally, I had issues with time with other things.

• Issues? Steps taken to resolve these?


o Mentioned prior, I had issues with time. I do research and work part time so I had
trouble balancing things, but communicating this to my team, they were flexible
with meeting times, which was nice for me.

• Your overall conclusion about the project. How would you do it, if asked to
do it again?
o I think it was very useful and a very pivotal part of the unit in terms of connecting
theory and applied. My only feedback would be, it could’ve been spread between
week 6 and 12. I felt like this was much more concentrated and I think a project

LAB 05 - GROUP 02 FINAL PROJECT 18


idea like this should take a much more ‘larger’ chunk of the unit. As I feel like I
learned a lot just by the project at this state, if it were a more comprehensive
project, I think it could’ve thought me a lot.
o I think we did our absolute best. But if we could do it again, I would personally
invest more time on funnelling the research core to the pith of the business cases,
because I spent time on looking over irrelevant details that really wasn’t the core
reason why the assessment was given (i.e. getting stuck on minute details of the
project when I was personally low on time).

Diary Three

• Full name (as it appears in Moodle)


o Varun Mathur

• Your role within the group (e.g. coordinator, programmer, designer,


author)
o Programmer

• The activities/responsibilities associated with that role o Identifying data


sources
o Identifying the coding requirements.
o Creation of the framework to get the required output.
o Identifying, understanding and using the necessary API’s for the relevant
programming task.
o Making code re-usable by using best industry practices.

• Your contribution to the group


o Provision of a good coding framework.
o Discussing the best approach for the technique with the team and then explaining
the different programming techniques used for the task.

• What you feel that you have learnt from this group work?
o I feel it is important to work in a group when the time and the resources are
limited. A good teamwork is the key to succeeding in design/ programming
exercises. Many different ideas could be generated when everyone sat together, and
this really helped and motivated me to contribute well in the project.
o Solving the tutorial exercises over the different weeks has really helped me to
develop my programming skills. Every design has its own pros and cons and
working together as a group can really help to discover them.
o All the group members were able to challenge each other’s opinions and
preconceptions about what can and cannot work for the project.

• How have you learnt? (i.e. learning techniques)

LAB 05 - GROUP 02 FINAL PROJECT 19


o Various requirements of the framework for the proposal were jot down by
researching about the variety and velocity of data in different sectors.
o Regular discussion among the team helped in giving better insights of the problem
statement.
o The tutors helped us a lot in guiding us throughout the project and showing us the
correct path for our research.

• What went well about this project?


o The flow of the work was quite comfortable and smooth.
o Communication with the group was quite streamlined and easy.
o Coming up with different ideas and opinions with the team gave us better
understanding of the requirements of the project.
o Understanding the concepts learnt in the tutorial and applying the practical
knowledge for the project really helped all of us to improve upon our coding
skills/ creative thinking abilities.

• What went wrong?


o Not many issues. Since time was restricted and members have different units and
work, so sometimes bringing all group members together for a meeting seemed
difficult. Rest everything went well.

• Issues? Steps taken to resolve these?


o Since time was the only issue, all group members were understanding, and
everyone kept each other updated about the meetings and other updates

• Your overall conclusion about the project. How would you do it, if asked to
do it again?
o Our team put in a good effort to come up with the deliverables as and when the
project progressed.
o Personally, I feel if the project was given a bit earlier, then more research could have
been done to come up with better solutions and better designs. But other than that,
our team have given their 100% for the success of this project.

Diary Four
• Full name (as it appears in Moodle)
o Aayush Kapoor

• Your role within the group (e.g. coordinator, programmer, designer,


author)
o Coordinator and author

• The activities/responsibilities associated with that role o Identifying data


sources

LAB 05 - GROUP 02 FINAL PROJECT 20


o Maintain central calendars. By doing so, promote effective use of time and keep
everyone informed on what is going on daily and in the future.
o Look at what needs to be done and assign tasks to team members
o Offer routine instructions to team members on job responsibilities
o Understand the market trends in terms of costing, vendors and technology in use
o Understand the business requirements and creating business plan

• Your contribution to the group


o Providing an environment of professionalism ensuring everyone is aware of their
roles and responsibilities
o Happy to indulge in every task of the team and keeping healthy environment
among the group
o Content writing, formatting report and sharing a new point of view to ensure we
cover every situation while offering services to investment bank

• What you feel that you have learnt from this group work?
o The most important thing that I found was the benefit of working in a team. I
learned that If you are aiming for success, then you should use teamwork as the
key when time and resources are restricted. As everyone had their own view,
different ideas could be exchanged, and we were able to work on those ideas.
Working in team made me more involved on assessment as every team member
was very helpful.
o I explored what I learned in labs and lectures while working on the assessment, it
was beyond the classroom peripheral experience to know data engineers would
have to work and research if they are planning to implement big data technologies
for client.

• How have you learnt? (i.e. learning techniques)


o Brainstorming among the team helped in giving better insights about the problem
statement
o Understood the big data technologies like Spark, Hadoop etc. in a whole new way
while working on report. Learned the big vendors in market and the costing of
hardware and software.

• What went well about this project?


o Every teammate was more than happy to help and worked tirelessly to achieve its
completion
o The work was very smooth as everyone participated in the very early stage of
assignment release

• What went wrong?


o Nothing went wrong the team cooperated very well respecting the time of each
team member.

LAB 05 - GROUP 02 FINAL PROJECT 21


• Issues? Steps taken to resolve these?
o Conflict of opinions can either be taken as a debate or can help us understand two
different sides of the coins. The team had a lot of different opinions on how the task
should be performed, but with a healthy debate and a lot of brainstorming we were
able to come up with a common proposal which could not have happened if there
were no conflict over opinions.

• Your overall conclusion about the project. How would you do it, if asked to
do it again?
o I believe we as a team put in a lot of efforts in producing the deliverables and we
took care of all the small factors for getting the best output. I explored the role of
coordinator in more deeper sense i.e., laying out the information before time to
ensure team performed well
o I think we really did great as team. If I could do it again, I would be more than
happy to invest time and energy to the project and I would explore more technical
domain rather than business domain

Diary Five

LAB 05 - GROUP 02 FINAL PROJECT 22


Lessons learnt:
I have further learned and gained insights into the big data technologies e.g Spark,
Hadoop and MLlib in a holistic manner while working on the technical analysis of the
business specification, and have built a good understanding of the current big data
market, players, costing etc by doing research in the market players, hardware and
software costings.
By doing the PoC01, I have further sharpened up my Spark/Scala programming skills,
which enhances my capability to implement similar solution at work. The team formed a
common understanding of what we should achieve at the end of the project. When
implementing the solution, I had a clear understanding that the solution should align with
the business requirements, such as data storage and data retention policy, so that I

LAB 05 - GROUP 02 FINAL PROJECT 23


implemented the prototype to cluster streaming data as well as to save data into HDFS for
further analysis.
The project went well, each team member contributed proactively to the project and
shared progress and knowledge with each other. No real issues encountered during the
course of the project. The team collaborated very well and worked out the solutions.
Conclusion:
In conclusion, I strongly believe that this project is a success. The team has done well in
terms of communication and technical capabilities. The outcome of the project is valuable
and has potential to expand into a commercial solution. The project would benefit from
interviewing Subject Matter Experts if such opportunities present.

LAB 05 - GROUP 02 FINAL PROJECT 24

You might also like