Banking Data Analysis On Hadoop
Banking Data Analysis On Hadoop
Presentation on
Data Processing in
Data Warehousing
Aim: - To develop a software framework that
allow us to analyse Bank’s data in Data
Warehousing.
We uses Hadoop as the Data Warehouse tool to fulfil our aim by performing the following tasks:-
1. Extract the data from the bank which are provided in different formats and store these data in the
single node cluster.
3. Load the data in high-level platform to merge into datalake and apply different queries to analyse
the data.
Hardware Requirements : - A desktop / laptop with the basic configuration i.e 4GB
RAM, 500GB Hard Disk Space.
Software requirements : -
2. Utility Software : - Apache Pig, Apache Hive, Apache Flume, Sqoop, Oozie.
DATA WAREHOUSE
Operational System
Analysis
Operational System
Reporting
Mining
• AS WE HAVE SEEN EARLIER WE ARE USING
HADOOP AS DATA WAREHOUSE TOOL.
• Sqoop automates most of this process, relying on the database to describe the
schema for the data to be imported.
• Sqoop uses MapReduce to import and export the data, which provides parallel
operation as well as fault tolerance.
Data Flow
Graphical
data
Input Files visualization
JSON
XML
PIPE FLUME Datalake
Pig Hive
MySQL SQOOP
Tableau
Oracle
Apache Oozie is a workflow scheduler for Hadoop. It is a system which
runs workflow of dependent jobs. Here, users are permitted to
create Directed Acyclic Graphs of workflows, which can be run in
parallel and sequentially in Hadoop.
It consists of two parts:
Workflow engine : Responsibility of a workflow engine is to store and
run workflows composed of Hadoop jobs e.g., MapReduce, Pig, Hive.
Coordinator engine: It runs workflow jobs based on predefined
schedules and availability of data.
Life Cycle Of Oozie
Example of Time Scheduler in Oozie
Resultant Graphical
Representation on
Tableau After Applying
Hive Queries on Merged
Data
1. Find out how many customers purchased
more than equal to 10 products.
2. Which product is in most demand
overall.
3. Get those products list which were
sold out in maximum quantity in Q1.
4. Find out those customers who are not
active since last 3 months.
Summary
This project was undertaken as a Live Project
from Technogeeks. The main objective of this
project is to collect data from different
sources, transform semi-structured data into
structured data, merge and store these data
and analyze these data to retrieve information
in accordance to the requirements of the bank.
THANK YOU