Banking Data Analysis On Hadoop

Banking Data Analysis on Hadoop

Uploaded by

Shantanu

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

453 views

Banking Data Analysis On Hadoop

Banking Data Analysis on Hadoop

Uploaded by

Shantanu

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

PowerPoint

Presentation on
Data Processing in
Data Warehousing
Aim: - To develop a software framework that
allow us to analyse Bank’s data in Data
Warehousing.
We uses Hadoop as the Data Warehouse tool to fulfil our aim by performing the following tasks:-

1. Extract the data from the bank which are provided in different formats and store these data in the
single node cluster.

2. Transform the semi-structured data into structured data.

3. Load the data in high-level platform to merge into datalake and apply different queries to analyse
the data.

4. Automate ETL process using Oozie.

5. Visualize the output in graphical form.

Requirements

 Hardware Requirements : - A desktop / laptop with the basic configuration i.e 4GB
RAM, 500GB Hard Disk Space.

 Software requirements : -

1. Open Source Software : - Cloudera Hadoop, Oracle, Tableau.

2. Utility Software : - Apache Pig, Apache Hive, Apache Flume, Sqoop, Oozie.
DATA WAREHOUSE

IN COMPUTING, A DATA WAREHOUSE (DW OR DWH), ALSO KNOWN

AS AN ENTERPRISE DATA WAREHOUSE (EDW), IS A SYSTEM USED
FOR REPORTING AND DATA ANALYSIS, AND IS CONSIDERED A CORE
COMPONENT OF BUSINESS INTELLIGENCE. DWS ARE CENTRAL
REPOSITORIES OF INTEGRATED DATA FROM ONE OR MORE
DISPARATE SOURCES. THEY STORE CURRENT AND HISTORICAL DATA
IN ONE SINGLE PLACE THAT ARE USED FOR CREATING ANALYTICAL
REPORTS FOR WORKERS THROUGHOUT THE ENTERPRISE.
Architecture Of Data
Warehouse
Data Sources Data Staging Warehouse Data Marts Users

Operational System
Analysis

Operational System
Reporting

Mining
• AS WE HAVE SEEN EARLIER WE ARE USING
HADOOP AS DATA WAREHOUSE TOOL.

• HADOOP IS USED FOR PERFORMING THE ETL

OPERATIONS.

• WE ALSO HAVE VARIOUS TOOLS OF HADOOP.

APACHE PIG
• PIG IS AN OPEN SOURCE HIGH LEVEL DATAFLOW
SYSTEM.

• IT PROVIDES SIMPLE LANGUAGE FOR QUERIES AND

DATA MANIPULATION THAT IS COMPILED INTO MAP-
REDUCE JOBS THAT ARE RUN ON HADOOP.
• Apache Hive is a data warehouse software project built on top
of Hadoop for providing data query and analysis.
• The Apache Hive ™ data warehouse software facilitates reading, writing,
and managing large datasets residing in distributed storage using SQL.
• Hive has three main functions: data summarization, query and analysis. It
supports queries expressed in a language called HiveQL.
• Apache Flume is a distributed, reliable, and available software for
efficiently collecting, aggregating, and moving large amounts of log
data.
• It is robust and fault tolerant with tuneable reliability mechanisms and
many failover and recovery mechanisms.
• It uses a simple extensible data model that allows for online analytic
application.
• Sqoop is a command-line interface application for transferring data
between relational databases and Hadoop.

• Sqoop automates most of this process, relying on the database to describe the
schema for the data to be imported.

• Sqoop uses MapReduce to import and export the data, which provides parallel
operation as well as fault tolerance.
Data Flow
Graphical
data
Input Files visualization

JSON
XML
PIPE FLUME Datalake
Pig Hive
MySQL SQOOP
Tableau
Oracle
Apache Oozie is a workflow scheduler for Hadoop. It is a system which
runs workflow of dependent jobs. Here, users are permitted to
create Directed Acyclic Graphs of workflows, which can be run in
parallel and sequentially in Hadoop.
It consists of two parts:
Workflow engine : Responsibility of a workflow engine is to store and
run workflows composed of Hadoop jobs e.g., MapReduce, Pig, Hive.
Coordinator engine: It runs workflow jobs based on predefined
schedules and availability of data.
Life Cycle Of Oozie
Example of Time Scheduler in Oozie
Resultant Graphical
Representation on
Tableau After Applying
Hive Queries on Merged
Data
1. Find out how many customers purchased
more than equal to 10 products.
2. Which product is in most demand
overall.
3. Get those products list which were
sold out in maximum quantity in Q1.
4. Find out those customers who are not
active since last 3 months.
Summary
This project was undertaken as a Live Project
from Technogeeks. The main objective of this
project is to collect data from different
sources, transform semi-structured data into
structured data, merge and store these data
and analyze these data to retrieve information
in accordance to the requirements of the bank.
THANK YOU

Snowflake Certification
No ratings yet
Snowflake Certification
102 pages
How To Make RDWeb SSO Works
No ratings yet
How To Make RDWeb SSO Works
5 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Big Data Architectures For Investment Banking - Session I
No ratings yet
Big Data Architectures For Investment Banking - Session I
56 pages
Quiz #5
No ratings yet
Quiz #5
2 pages
Making Embedded Systems
No ratings yet
Making Embedded Systems
6 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
CA Clarity PPM For Agile Development Projects
No ratings yet
CA Clarity PPM For Agile Development Projects
11 pages
Alteryx Topic
No ratings yet
Alteryx Topic
2 pages
Solution DWDM
No ratings yet
Solution DWDM
44 pages
Velocity v8 Data Warehousing Methodology
No ratings yet
Velocity v8 Data Warehousing Methodology
1,106 pages
Data Warehouse and Design Presentation
No ratings yet
Data Warehouse and Design Presentation
11 pages
Snowflakes Beginner To Intermediate Path Updated
No ratings yet
Snowflakes Beginner To Intermediate Path Updated
4 pages
Business Analytics
No ratings yet
Business Analytics
9 pages
Scope, and The Inter-Relationships Among These Entities
No ratings yet
Scope, and The Inter-Relationships Among These Entities
12 pages
Rapidminer
No ratings yet
Rapidminer
8 pages
SAS For Managers Lol
No ratings yet
SAS For Managers Lol
24 pages
Vertipaq Vs OLAP - Change Your Data Modeling Approach - Marco Russo
No ratings yet
Vertipaq Vs OLAP - Change Your Data Modeling Approach - Marco Russo
10 pages
MSTR Architect Project Design Essentials: Course Contents: Basic and Advanced
No ratings yet
MSTR Architect Project Design Essentials: Course Contents: Basic and Advanced
3 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
DBMS Vs DataWarehouse
No ratings yet
DBMS Vs DataWarehouse
2 pages
Talend Open Studio For Master Data Management: A Practical Starter Guide 2nd Edition
No ratings yet
Talend Open Studio For Master Data Management: A Practical Starter Guide 2nd Edition
100 pages
ERStudioDA 9.7 QuickStart en
No ratings yet
ERStudioDA 9.7 QuickStart en
63 pages
List of ETL Tools
No ratings yet
List of ETL Tools
2 pages
Big Data
No ratings yet
Big Data
11 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Qlik Sense Cmdlet For Powershell: Sokkorn Cheav
No ratings yet
Qlik Sense Cmdlet For Powershell: Sokkorn Cheav
11 pages
Simon Looker - Curriculum Vitae
No ratings yet
Simon Looker - Curriculum Vitae
2 pages
Introduction To Splunk
No ratings yet
Introduction To Splunk
7 pages
DBT Analytics Engineering
No ratings yet
DBT Analytics Engineering
23 pages
XPath Presentation
No ratings yet
XPath Presentation
14 pages
BI Projects
No ratings yet
BI Projects
17 pages
(Ebook) Azure Data Factory Cookbook: Data engineers guide to build and manage ETL and ELT pipelines with data integration , 2nd Edition by Dmitry Foshin, Tonya Chernyshova, Dmitry Anoshin, Xenia Hertzenberg ISBN 9781803246598, 1803246596 2024 Scribd Download
100% (9)
(Ebook) Azure Data Factory Cookbook: Data engineers guide to build and manage ETL and ELT pipelines with data integration , 2nd Edition by Dmitry Foshin, Tonya Chernyshova, Dmitry Anoshin, Xenia Hertzenberg ISBN 9781803246598, 1803246596 2024 Scribd Download
65 pages
Sap Bo Universe Designer Guide
No ratings yet
Sap Bo Universe Designer Guide
76 pages
Vijay Kumar Singh (Oracle DBA)
No ratings yet
Vijay Kumar Singh (Oracle DBA)
7 pages
Micro Strategy Material
No ratings yet
Micro Strategy Material
298 pages
Datagrokr Internship Technical Assignment - 20201017
No ratings yet
Datagrokr Internship Technical Assignment - 20201017
3 pages
Exercise XD01
No ratings yet
Exercise XD01
8 pages
Data Preparation With Tableau
No ratings yet
Data Preparation With Tableau
11 pages
Bussiness Intelligence
No ratings yet
Bussiness Intelligence
6 pages
Data Warehouse Design For E-Commerce Environment
No ratings yet
Data Warehouse Design For E-Commerce Environment
26 pages
Uday Kumar: Professional Summary
No ratings yet
Uday Kumar: Professional Summary
3 pages
What's A Data Warehouse
No ratings yet
What's A Data Warehouse
24 pages
Datawarehouse Tools
No ratings yet
Datawarehouse Tools
8 pages
Foundations of Business Intelligence
100% (1)
Foundations of Business Intelligence
42 pages
Capstone Project Power BI
No ratings yet
Capstone Project Power BI
8 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
71 pages
Dbms Presentation
No ratings yet
Dbms Presentation
43 pages
Methodology For Data Validation v1.0 Rev-2016-06 Final
No ratings yet
Methodology For Data Validation v1.0 Rev-2016-06 Final
76 pages
Basit MSTR Resume
No ratings yet
Basit MSTR Resume
8 pages
Big Data Platforms
No ratings yet
Big Data Platforms
8 pages
Alteryx + Snowflake Retail Solutions
No ratings yet
Alteryx + Snowflake Retail Solutions
19 pages
Sales Amount by Month - Sort It by The Correct Month Order, Not Alphabetical Order
No ratings yet
Sales Amount by Month - Sort It by The Correct Month Order, Not Alphabetical Order
6 pages
Course12 2 PDF
No ratings yet
Course12 2 PDF
36 pages
BUSINESS ANALYTICS WITH POWER BI
No ratings yet
BUSINESS ANALYTICS WITH POWER BI
35 pages
Answers. The Questions Presented in This Blog Are Collected Based On The Opinion of
No ratings yet
Answers. The Questions Presented in This Blog Are Collected Based On The Opinion of
39 pages
Low Level Design
No ratings yet
Low Level Design
23 pages
SQL Replication Basic
No ratings yet
SQL Replication Basic
22 pages
Mahesh ETL
No ratings yet
Mahesh ETL
5 pages
Etl Design Document Template
No ratings yet
Etl Design Document Template
3 pages
Cloud Data Warehouse
No ratings yet
Cloud Data Warehouse
7 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Tut 01
No ratings yet
Tut 01
2 pages
Old EJB
No ratings yet
Old EJB
4 pages
CustomCodeMigration OP1809
No ratings yet
CustomCodeMigration OP1809
42 pages
Subroutines and Functions
No ratings yet
Subroutines and Functions
11 pages
Partizan Access Control Management SDK User Manual: Version 1.0.2, 09 February 2016
No ratings yet
Partizan Access Control Management SDK User Manual: Version 1.0.2, 09 February 2016
28 pages
Assembly Language in 8086 Memory Architecture: Short Overview
No ratings yet
Assembly Language in 8086 Memory Architecture: Short Overview
18 pages
Log
No ratings yet
Log
3 pages
React Context API
No ratings yet
React Context API
16 pages
Laboratory 2 Error Correction - v4
No ratings yet
Laboratory 2 Error Correction - v4
2 pages
Lab 2 - at Home
No ratings yet
Lab 2 - at Home
20 pages
Implementing Data Security For Financials in Fusion SaaS R12
No ratings yet
Implementing Data Security For Financials in Fusion SaaS R12
9 pages
Bresenham
No ratings yet
Bresenham
73 pages
Final Report Book
No ratings yet
Final Report Book
43 pages
Systems Analysis and Design in A Changing World, Fourth Edition
No ratings yet
Systems Analysis and Design in A Changing World, Fourth Edition
41 pages
Combined Solution
No ratings yet
Combined Solution
12 pages
DBMS Unit-1
No ratings yet
DBMS Unit-1
55 pages
4
No ratings yet
4
11 pages
Industrial Training Report
No ratings yet
Industrial Training Report
48 pages
Jenkins
No ratings yet
Jenkins
48 pages
Web Services Spring Boot JPA Hibernate
No ratings yet
Web Services Spring Boot JPA Hibernate
9 pages
What Is Object Oriented Programming?: Ch-1 OOP in Python Updated & Revised by Dr. Ra'ed M. Al-Khatib (2019)
No ratings yet
What Is Object Oriented Programming?: Ch-1 OOP in Python Updated & Revised by Dr. Ra'ed M. Al-Khatib (2019)
45 pages
Struts Dispatch Action Example
No ratings yet
Struts Dispatch Action Example
7 pages
Using CATIA STEP To Exchange 3D Models With PMI or Composites Data - Inceptra
No ratings yet
Using CATIA STEP To Exchange 3D Models With PMI or Composites Data - Inceptra
7 pages
latestlog-2
No ratings yet
latestlog-2
19 pages
SAP Note PDF Print
No ratings yet
SAP Note PDF Print
2 pages
Cyberbullying Detection Based On Semantic Enhanced Marginalised Denoising Autoencoder - Report
No ratings yet
Cyberbullying Detection Based On Semantic Enhanced Marginalised Denoising Autoencoder - Report
71 pages
Decision Control STMT
60% (5)
Decision Control STMT
17 pages
Spam Spam: Python Basics Math Operators Highest Lowest Data Types Data Type Examples
No ratings yet
Spam Spam: Python Basics Math Operators Highest Lowest Data Types Data Type Examples
42 pages