Data Processing For Big Data Report
Data Processing For Big Data Report
Technical Design
LAB 05 - Group 02
Business Case 2
Design Analyses 3
Costing 6
References 10
Minutes of Meeting 11
Reflective Diaries 16
In this proposal we are suggesting a suitable framework that can manage and run the required
operations of the given Investment Banking Enterprise at minimised costs with efficient results.
The proposal provides information about the advised technologies to settle the three main areas of
concern that are:
Use the streaming data to analyse and monitor active credit cards in real time to combat
possible frauds.
To have a suitable set up to store and analyse any relevant publicly available datasets in order
to extract insight from them to assist in investment decisions.
To upgrade the IBE’s computer hardware in order to accommodate required capacity to store
and execute advanced analytics to further aid business decisions.
Business Case
The proposed solutions are constructed in a way that they achieve the defined business goals in
the most efficient form with minimal costs possible. The aim of the project is to understand the
three points of interest carefully and design a strategic solution that comforts all the problems with
minimum required architecture, faster integration, possible scalability and well built security.
The proposed project keeps the data ingestion and processing for fraud detection (i.e. real-time
action requirements) and analysing of publicly available datasets (i.e. non real-time, static action)
separate from each other. This way, bringing in possible scalability and alterations in either of the
two without bothering one another. This provides the IBE with enough room to make flexible
changes to the any of the two seperate systems going in the future.
It is well understood that the customers of the IBE are a very important part of the stakeholders.
With that in mind, the risks associated with the foundations of the project have been considered.
Integrating the afore mentioned split system architecture design (real-time action and static action)
Once carried out well, the whole project will bring a beneficial change in the IBE. This will help in
increasing the fraud detection by using suitable algorithms with appropriate hardware. This system
architecture will also allow the static action design to be able to deploy state of the art advanced
data analytics with open source datasets, utilising them to the fullest in order to gain insight to
guide IBE to make the correct investment decision via predictions and inferences.
Design Analyses
- IDENTIFICATION OF DATA SOURCES
For the airport data we have multiple sources to refer to which are publicly available. The project
proposes US Bureau of Statistics’ data to analyse the history and statistics of the airports and the
possible revenue models these airports follow. The data holds information like number of arrivals
and departures, number of scheduled flights, number of carriers, top destination, etc., for each US
airport for every passing month since December 2006[1].
This static data can be utilised to interpret information required to make investment decisions in
the sector.
Additional datasources that may influence business decisions could be a historical track of the US
census (https://www.census.gov/) in which very detailed demographic and social information are
open source. Linking these to US shape files could give big insight in the geospatial aspect.
For the purpose of this project we propose to use HADOOP(with SPARK) environment as the
underlying structure and to run the required computations. As the organisation has to deal with
importing both batch and streaming data, we consider this process separately for both. As
discussed the proposal suggests to keep the ingestion and processing of two data types separate
by using two Hadoop clusters(HDFS-A and HDFS-B).
Therefore, HDFS-A is used to process and monitor the streaming credit card data. Generally Real
Time Fraud detection can be categorised in to 2 steps:
The first step includes using the historical data to analysis and build the machine learning
model.
The second step is to use the model in production to make predictions or decisions on live
events.
Thus the supposed architecture checks the events and frauds by Spark Streaming using the Spark
Machine Learning with deployed model.
The project also suggests use of Kafka as a high-throughput distributed message system to
deliver the information to streaming. It is suitable as it can be used with some external modules in
Spark and is completely functional when it comes to the streaming of large data in a fast paced
environment.
For the batch processing of the static data, HDFS-B is used to store and process the information.
For this project, HDFS-A and HDFS-B hold different requirements when it comes to data storage.
As HDFS-A has to deal with high-throughput data it requires storage that is physically compatible
its needs. Therefore, when talking about high speed data streaming, SSDs are suggested to settle
and avoid the concerning issues. As the enterprise is large it must face millions of transactions per
day. This requires suitably faster and compatible disk space. FlashBlades are not suggested due
exceptionally high rates as compared to SSD storages. Though, FlashBlades can surely be
integrated in the future if the need be.
On the other hand HDFS-B are not required to have high speed data storage or processing and
can be run on ordinary disks to deal with the analysing of suggested datasets.
Logically, the clusters will work as described and take care of the assigned duties whilst running on
the specified hardware. The data is supposed to be converted into Parquet form so that Spark can
use it more efficiently when turning it in for modelling and processing.
Expanding into the are of processing, we first define the processing requirements of which falls
under two categories;
Static data analytics for our business case can be thought of as classic BI. However with the idea
of processing with parallelism. Given we have a vast amount of publicly available datapoint of
which we would like to run clustering, regression, classifications we need parallelism integrated
into the processing environment.
Given that HDSF-A is more powerful, streaming the data into the SVM classifier will be done in this
cluster. The data for static operations will reside in HDSF-B. This cluster can indeed handle not
Given these requirements, the parallelism requires GraphX for graph related computations. Also
needed is Spark Streaming in order to do the fraud detection task, and also one major module
needed is MLlib for machine learning and advanced analytics. This module also integrates Python
and R into the system which are state of the art in data science methodologies and deep learning
(TensorFlow).
Costing
As the organisation tends to use the data for future decisions and investments, it safely assumed
that the data is to remain as a private entity. Considering the given fact and the possible situations
of upscaling in the near future, it is a better option to set up own computational and storage
environment and not rely on the commercial rental clouds. Also assuming that this data will be
entirely restricted within safe parameters under the influence of the enterprise itself. For the
foundation of the processing framework, Hadoop with Spark are a comfortable approach as they
can accommodate both static and streaming data. Also they can provide with an ability to integrate
more technologies, both limited and public licensed softwares.
Also the Hadoop’s storage costs are significantly less than any other solution and provide a more
failure tolerant storage. Any Hadoop Cluster is constructed using racks of Intel servers. Each of
this server has an internal hard disk storage. Considering that enterprise already satisfies the
requirement for sufficient storage space to accommodate more hardware. This means that about
30U (units of rack space) in one cage, or sum total of 30U in multiple cages. It also assumed that
the enterprise already uses ordinary storage for retaining long term data. This may conclude to
more than 350-400 TB of disk storage. Also the electricity costs are calculated on the basis of the
current figures maintained by the US Departments of Energy. This is understandable by the fact
that the enterprise has its roots in the US and is going to operate there. As of the figures for
October 2018 the cost of energy is seen to be around $0.0135/kWh and the rates are only going to
increase as per the current trends[4]. As the enterprise is said to be large, the services for HDFS-A
are supposed to run 24 hours a day for all 7 days. While HDFS-B is supposed to run for around 18
hours a day at 95% TDP(Thermal Design Power).
To maintain the whole architecture a Data Scientist and a Data Engineer are proposed. The
additional salaries add up to the costs by around US$132,699 and US$103,788[5].
To satisfy the connection requirements, Dual 10 Gigabit Ethernet is preferred as the enterprise is
going to deal with large Input and Output operations where each can do a sub-10min import/
duplication job on a dataset of size 300GB or more[6].
Professional $1,000.00
Support
On the other hand the proposed clusters are expected to cost just around US$70,000 for the
purchase of necessary equipment.
Finally, on looking at these costs, it is also important to register that these figures only take into
account the direct usage and do not note the data transfer to/from the cluster, which would then
likely increase the costs a bit more.
Also the costs for setting up the on-site clusters are to occur only once until any upgradation or
upscaling during which the costs for only the supporting factors(like power, networks, technical
support) will see a rise.
HDFS-B, which is concerned with handling the static data is designed to handle low complexity
calculations, or in case of heavy computational requirements for HDFS-A, it can stream data to
HDFS-A. Thus, Cluster HDFS-B is more storage centric than in terms of computational power but
can work either way if the need be. To justify these situations, HDFS-B is supported with up to
100TB of data: making it sufficient enough to handle all the data.
As the streaming datasets can involve processing of up to 1000 datasets per minute and more,
with each dataset carrying a size of around 100s of KB, the cluster is required to be suitable fast.
Thus, it is safe to assume that around 100GB of RAM is suitable for the cluster performance.
HDFS-A will be able to hold around a 18 hours of streaming data with an access to 5TB storage.
Thus, when looked at, the 8-core Xeon CPUs are suitable to withstand high computation tasks and
are adjustable to upscaling, making them capable of sub-5-second processing.
The said model can be upscaled, upgraded and utilised as comfortable and does not require any
additional cost coverage. These are fully capable cluster designed to run the specific operations
but can be adjusted to support other functionalities
$6,730.3
$6,265.92
$3,908.94
$3,908.94
$3,698.96
$7,854.44
References
[1] The U.S. Department of Transportation’s Bureau of Transportation Statistics (BTS)
retrieved from: https://www.bts.gov/
[2] MapR [Carol McDonald]. 2016, Real Time Credit Card Fraud Detection with Apache Spark and
Event Streaming
Retrieved from: https://mapr.com/blog/real-time-credit-card-fraud-detection-apache-spark-and-
event-streaming/
[4] Electricchoice. 2018, Electricity Rates by State
Retrieved from: https://www.electricchoice.com/electricity-prices-by-state/
[5] Indeed. 2018, Big Data Salaries in the United States
Retrieved from: https://www.indeed.com/salaries/Big-Data-Salaries
[6] Arista. 2012, Hadoop* Clusters Built on 10 Gigabit Ethernet
Retrieved: https://www.arista.com/assets/data/pdf/Whitepapers/Hadoop_WP_final.pdf
[7] Microsoft Azure
Retrieved from: https://azure.microsoft.com/en-au/pricing/calculator/?service=hdinsight
[8] AWS
Retrieved from: https://aws.amazon.com/ec2/pricing/on-demand/
[9] Google Cloud
Retrieved from: https://cloud.google.com/compute/pricing
[10] Alibaba Cloud
Retrieved from: https://www.alibabacloud.com/product/e-mapreduce?spm=a3c0i.
7911826.1160486.dproducth1.2d85737bARb2gy#pricing
Mee#ng No: 1
Mee#ng Date: 04/10/2018
Loca#on: MPA Lounge, Building H, Monash University, Caulfield
A0ending: Aayush Kapoor, Jaideep Singh Dang, Xiangtan Lin, Varun Mathur, Faik Canberk
Apologies: None
Mee#ng start #me: 2:00 PM
Ma0ers arising from previous minutes: YES NO ☑
Confirma#on of minutes from last mee#ng: YES NO ☑
Outcome of Mee#ng:
ISSUE DISCUSSION IN BRIEF OUTCOME ACTION: NAME &
TIMELINE
Meeting Schedule Setup meeting schedule for Meeting days will be Team
the entire project Thursday, Friday as
decided
Team Leader Selecting a team leader Jaideep Singh Dang Project leader will be
will lead the project same throughout the
assessment
Programmer Selecting a team for Proof of Xiangtan and Varun Programmers will be
Concepts will explore coding same throughout the
domain assessment and
coordinate with the
other members for
designing proof of
concepts.
Ac#ons in brief: Team leader and various roles finalised, meeOng days selected when all the 5 members are
available. Assessment discussed in details to know the areas that needs more focus and Ome.
Next mee#ng date, #me, and loca#on: 11/10/2018, 2:00 PM at MPA Lounge, Building H, Monash
University, Caulfield
Outcome of Mee#ng:
ISSUE DISCUSSION IN BRIEF OUTCOME ACTION: NAME &
TIMELINE
Business Case To ensure that business case All the IBE In progress to be
is streamlined with bank’s stakeholders must get finalised by 18th
strategic vision/direction. the best services/ October.
results available.
Ac#ons in brief: Majority of Ome spent on understanding the requirements and proposing opOmum
soluOon for the given situaOon with investment bank. PoC02 task was divided among team mates to ensure
everyone parOcipated to ensure deliverables are completed by next meeOng.
Next mee#ng date, #me, and loca#on: 18/10/2018, 2:00 PM at MPA Lounge, Building H, Monash
University, Caulfield
Outcome of Mee#ng:
ISSUE DISCUSSION IN BRIEF OUTCOME ACTION: NAME &
TIMELINE
Ac#ons in brief: Minor work on PoC01 and PoC02 leY, Xiangtan and Varun will work on the remaining part
and complete it by next meeOng. Business requirements, business case completed.
Next mee#ng date, #me, and loca#on: 22/10/2018, 2:00 PM at MPA Lounge, Building H, Monash
University, Caulfield
Outcome of Mee#ng:
ISSUE DISCUSSION IN BRIEF OUTCOME ACTION: NAME &
TIMELINE
Cost Management Jaideep and Ezra validate all Proposed components In-progress, to be
the researched cost options shown in a tabular completed by 24th
available for bank format along with the October.
prices.
Ac#ons in brief:Majority of Ome spend in compleOng cost management, the whole team worked on
content creaOon and researching different viable opOons available in current market scenario and would
suit our client/investment bank the best given scenario. Team decided to meet on 24th October 2018 again
to finalise the content and make final changes in format of report.
Next mee#ng date, #me, and loca#on: 24/10/2018, 2:00 PM at MPA Lounge, Building H, Monash
University, Caulfield
Outcome of Mee#ng:
ISSUE DISCUSSION IN BRIEF OUTCOME ACTION: NAME &
TIMELINE
Cost Management Finalised tabular format and Added to the main Completed
optimum cost options shown. report
Ac#ons in brief: Finalized cosOng opOmum cost opOon available in market and forma^ed the report for
deliery.
Next mee#ng date, #me, and loca#on: Only electric mode of communicaOon if any further changes
required in the report or content.
• What you feel that you have learnt from this group work?
o The First thing I learnt was to manage a team and bringing all of them together on
the final decisions. Also It helped me learn task allocations and working as a team.
Every member thinks in a different and an out of the box manner and a leader is
responsible to get the most out of these thoughts and turn it into something
dependable.
o Secondly, I learnt a lot about hardwares when I began researching on the possible
component options for the proposed hardware framework. Also the project not
only helped me solidify my understanding on the basic concepts taught in our
lectures and through our lab assessments but also helped me learn something out
of the curriculum, that was how an on-site storage and processing framework is
generally built and maintained.
• Your overall conclusion about the project. How would you do it, if asked to
do it again?
o In my view, every team member has given their 100% to carry this project forward
and complete it. If the project was to be done again, I would not change anything as
the imperfections are made into perfections when we work as team, and this is
what happened. Helping each other and coming to a common conclusion that all
can agree on makes any good project and we made sure that we use every
members input to decide on out solution. I feel it had a perfect blend of business
case and programming to build a wholesome learning experience.
Diary Two
• What you feel that you have learnt from this group work?
o Getting a chance to apply the concepts learned in class and labs (especially labs), to
the project really made me understand what these concepts practically do. Its ok to
understand and memorise concepts, but this work especially in a group setting
(invoked discussion) cemented what these things are and how they harmonise
with each other to solve potential solve real world problems.
• Your overall conclusion about the project. How would you do it, if asked to
do it again?
o I think it was very useful and a very pivotal part of the unit in terms of connecting
theory and applied. My only feedback would be, it could’ve been spread between
week 6 and 12. I felt like this was much more concentrated and I think a project
Diary Three
• What you feel that you have learnt from this group work?
o I feel it is important to work in a group when the time and the resources are
limited. A good teamwork is the key to succeeding in design/ programming
exercises. Many different ideas could be generated when everyone sat together, and
this really helped and motivated me to contribute well in the project.
o Solving the tutorial exercises over the different weeks has really helped me to
develop my programming skills. Every design has its own pros and cons and
working together as a group can really help to discover them.
o All the group members were able to challenge each other’s opinions and
preconceptions about what can and cannot work for the project.
• Your overall conclusion about the project. How would you do it, if asked to
do it again?
o Our team put in a good effort to come up with the deliverables as and when the
project progressed.
o Personally, I feel if the project was given a bit earlier, then more research could have
been done to come up with better solutions and better designs. But other than that,
our team have given their 100% for the success of this project.
Diary Four
• Full name (as it appears in Moodle)
o Aayush Kapoor
• What you feel that you have learnt from this group work?
o The most important thing that I found was the benefit of working in a team. I
learned that If you are aiming for success, then you should use teamwork as the
key when time and resources are restricted. As everyone had their own view,
different ideas could be exchanged, and we were able to work on those ideas.
Working in team made me more involved on assessment as every team member
was very helpful.
o I explored what I learned in labs and lectures while working on the assessment, it
was beyond the classroom peripheral experience to know data engineers would
have to work and research if they are planning to implement big data technologies
for client.
• Your overall conclusion about the project. How would you do it, if asked to
do it again?
o I believe we as a team put in a lot of efforts in producing the deliverables and we
took care of all the small factors for getting the best output. I explored the role of
coordinator in more deeper sense i.e., laying out the information before time to
ensure team performed well
o I think we really did great as team. If I could do it again, I would be more than
happy to invest time and energy to the project and I would explore more technical
domain rather than business domain
Diary Five