Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DATA WAREHOUSE - Imp

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 76

DATA WAREHOUSE & BUSINESS INTELLIGENCE

AN OVERVIEW
CONTENTS
1 Database and Data Warehouse: An Introduction

2 History of Data Warehouse

3 Evolution in organization use of Data Warehouse

4 Benefits of Data Warehouse

5 Data Warehouse Architecture

6 Data Quality and ETL process

7 Dimensional Modeling of Data Warehouse

8 Online Analytical Processing - OLAP

9 Business intelligence
Data, Data everywhere…
• I can’t find the data I need
– data is scattered over the network
– many versions, subtle differences

• I can’t get the data I need


– need an expert to get the data

• I can’t understand the data I found


– available data poorly documented

• I can’t use the data I found


– results are unexpected
– data needs to be transformed from
one form to other
What is a Data Warehouse?

A single, complete and


consistent store of data
obtained from a variety of
different sources made
available to end users for use
in a business context.
Data warehouse is….
• Subject Oriented: Data that gives information about a particular
subject instead of about a company's ongoing operations.

• Integrated: Data that is gathered into the data warehouse from


a variety of sources and merged into a coherent whole.

• Time-variant: All data in the data warehouse is identified with a


particular time period.

• Non-volatile: Data is stable in a data warehouse. More data is


added but data is never removed.
Transactional Data Hub vs Data Warehouse
Transactional data hub Warehouse Data
 Application Oriented  Subject Oriented
 Used to capture business data  Used to analyse Business
 Detailed data  Summarized and refined
 Time variant data in
 Current up to date Consolidated form
 Operational User  Knowledge User
 Large volumes accessed at a
 Few Records accessed at a time time
 Read/Update Access  Read only Access
Database Size Petabytes and
 Database Size 100MB -100 GB  ZetaBytes
Evolution: Organizational use of Data Warehouses
• Off line Operational Database 
Data warehouses in this initial stage are developed by simply copying
the data off an operational system to another server where the
processing load of reporting against the copied data does not impact
the operational system's performance.
• Off line Data Warehouse 
Data warehouses at this stage are updated from data in the
operational systems on a regular basis and the data warehouse data is
stored in a data structure designed to facilitate reporting.
• Real Time Data Warehouse 
Data warehouses at this stage are updated every time an operational
system performs a transaction (e.g. an order or a delivery or a
booking.)
• Integrated Data Warehouse 
Data warehouses at this stage are updated every time an operational
system performs a transaction. The data warehouses then generate
transactions that are passed back into the operational systems.
History of Data Warehousing

• The concept of data warehousing dates back to the late 1980s


when IBM researchers Barry Devlin and Paul Murphy
developed the "business data warehouse".
• 1960s: General Mills and Dartmouth College, in a joint
research project, develop the terms dimensions and facts.
• 1970s: ACNielsen and IRI provide dimensional data marts
for retail sales.
• 1970s: Bill Inmon begins to define and discuss the term:
Data Warehouse.
• 1983: Teradata introduces a database management
system specifically designed for decision support.
History of Data Warehousing (Contd…)
• 1988 - Barry Devlin and Paul Murphy published the article An
architecture for a business and information systems in IBM
Systems Journal, introduced the term "business data
warehouse".

• 1996 — Ralph Kimball publishes the book The Data


Warehouse Toolkit.

• 2012 — Bill Inmon developed and made public technology


known as "textual disambiguation".
Data Marts
Data marts are analytical data stores designed to focus on
specific business functions for a specific community within an
organization.
Reasons for creating a Data Mart
• Easy access to frequently needed data
• Creates collective view by a group of users
• Improves end-user response time
• Ease of creation in less time
• Lower cost than implementing a full Data warehouse
• Potential users are more clearly defined than in a full Data
warehouse
Inmon’s top-down approach

• Inmon defines data warehouse as


a centralized repository for the entire
enterprise.

• Data warehouse stores the ‘atomic’


data at the lowest level of detail. 

• Dimensional data marts are created


only after the complete data
warehouse has been created..
Kimball’s bottom-up approach

• Kimball defines data warehouse as “A


copy of transaction data specifically
structured for query and analysis.“
Also know as BUS architecture.

• Dimensional modelling focuses on


ease of end user accessibility and
provides a high level of performance
to the data warehouse.
Inmon’s & Kimball’s Data Warehouse Model

Ralph Kimball Data Warehouse


Inmon vs Kimball
OLTP to OLAP
DW: Business Advantages and Disadvantages
• Provides customer-centric • Not the optimal environment
view of data. for unstructured data.

• Consolidates data. • An element of latency present


in data warehouse data.
• Removes barriers among • Maintenance costs are high.
functional areas
• Data warehouses can get
• It reports on trends across outdated relatively quickly.
multidivisional,
multinational operating • Duplicate, expensive
units. functionality may be
developed.
Cloud based Data Warehouse

Highly Scalable
&
Provides true Business Agility
Gartner 2016 Magic Quadrant for Data Warehouse
DATA WAREHOUSE ARCHITECTURE
Outline

• What is a Data Warehouse Architecture.


• Five Main Data Warehouse Architectures.
• Factors That Affect Choosing A Data Warehouse Architecture.
Data Warehouse Architecture
Data Warehouse Architecture….?

• Primarily based on the business processes of a business


enterprise.
• Conceptualization of how the data warehouse is built.
Five Main Data Warehouse Architectures

• Independent Data Marts


• Data Mart Bus Architecture
• Hub-and-Spoke Architecture
• Centralized Data Warehouse

• Federated Architecture
Independent Data Marts
(a) Independent Data Marts Architecture

ETL
End user
Source Staging Independent data marts
access and
Systems Area (atomic/summarized data)
applications

• Data marts that are independent to each other.


• Often created by organization units.
• Inconsistent data definitions and different dimensions & measures.
Data Mart Bus Architecture

(b) Data Mart Bus Architecture with Linked Dimensional Datamarts

ETL
Dimensionalized data marts End user
Source Staging
linked by conformed dimentions access and
Systems Area
(atomic/summarized data) applications

• Creation starts with a business requirements analysis .


• One mart is created for a single business process.
• Additional marts are developed using the conformed dimensions of
the first mart.
Hub-and-Spoke
• Developed after an enterprise-level analysis of data requirements.
• Developed in an iterative manner.
• Dependent data marts obtain the data from the warehouse
• Consist of a centralized hub that accepts requests from multiple
applications that are connected through spokes.
(c) Hub and Spoke Architecture (Corporate Information Factory)

ETL
End user
Source Staging Normalized relational
access and
Systems Area warehouse (atomic data)
applications

Dependent data marts


(summarized/some atomic data)
Centralized Data Warehouse
• Similar to the hub-and-spoke architecture except there are
no dependent data marts.
• Contains atomic-level data, some summarized data, and
logical dimensional view of the data.
• Queries and applications access data.

(d) Centralized Data Warehouse Architecture

ETL
Normalized relational End user
Source Staging
warehouse (atomic/some access and
Systems Area
summarized data) applications
Federated Architecture
• Leaves existing decision-support structures in place
• Shares information among a number of different systems.
• Data is either logically or physically integrated
– Shared keys
– Global metadata
– Distributed queries

(e) Federated Architecture

Data mapping / metadata


End user
Logical/physical integration of access and
Existing data warehouses
common data elements applications
Data marts and legacy systmes
Factors that affect choosing a DW Architecture

• Information interdependence between organizational units


• Upper management’s information needs
• Urgency of need for a data warehouse
• Nature of end-user tasks
• Constraints on resources
• Strategic view of the data warehouse prior to implementation
• Compatibility with existing systems
• Perceived ability of the in-house IT staff
• Technical issues
• Social/political factors
DATA QUALITY MANAGEMENT
Data Quality
• Good data is your most valuable asset, and bad data can seriously
harm your business and credibility.
• Data quality is a perception or an assessment of data’s fitness to
serve its purpose in a given context.
DIMENSIONS OF DATA QUALITY
Correctness / Accuracy : Accuracy of data is the degree to which the
captured data correctly describes the real world entity.
Consistency: This is about the single version of truth. Consistency
means data throughout the enterprise should be sync with each
other.
Completeness: It is the extent to which the expected attributes of
data are provided.
Timeliness: Right data to the right person at the right time is
important for business.
How do we define Data Quality
Measurable Dimensions for Data Quality Intangible Dimensions
Dimension Description Dimension Description

Does the data accurately represent reality


Accuracy or a verifiable source? Every piece of information stored is
important in order to get a true
Do broken links exist between data Relevance business representation
Integrity that should be related?

Is there a single representation The stored information is usable by the


Consistency of data? Usability organization with ease

Is any key information missing?


Completeness The stored information is applicable for
Usefulness organization
Is the data value unique? i.e. no duplicate
Uniqueness values or records The level to which the data is regarded
Is the data easily accessible,
Believability as
true and credible
Accessibility understandable, and used consistently?
Each piece of data has a unique
Is data stored with the precision required Unambiguous meaning, and can be easily
Precision by the business? comprehended
Data is objective, unbiased & impartial
Is the information update frequency i.e., it does not depend on the
Timeliness adequate to meet the Objectivity judgment, interpretation, or evaluation
business requirements? of people
Data Quality – Decomposition of Activities
Data quality results from the process of going through the
data and scrubbing it, standardizing it, and de duplicating
records, as well as doing some of the data enrichment.
Discovery

Profiling
1. Maintain complete data.
Standardization 2. Clean data by standardizing it using rules.
3. Use Fuzzy algorithms to detect duplicates.
4. Avoid entry of duplicate.
Cleansing 5. Merge existing duplicate records.
6. Introduce Data Governance
Deduplication

Enrichment
Data Profiling
• It is the process of statistically examining and analyzing the content
in a data source, and hence collecting information about the data.
• It consists of techniques used to analyze the data we have for
accuracy and completeness.
• Data profiling helps us make a thorough assessment of data quality.
• It assists the discovery of anomalies in data.
• It helps us understand content, structure, relationships, etc. about
the data in the data source we are analyzing.
• It helps us know whether the existing data can be applied to other
areas or purposes.
• It helps us understand the various issues/challenges we may face in
a database project much before the actual work begins. This enables
us to make early decisions and act accordingly.
• It is also used to assess and validate metadata.
ETL (Extract-Transform-Load)
• ETL comes from Data Warehousing and stands for Extract-Transform-
Load
• Extract covers the data extraction from the source system and
makes it accessible for further processing. The main objective of the
extract step is to retrieve all the required data from the source
system with as little resources as possible.
• Clean The cleaning step is one of the most important as it ensures
the quality of the data in the data warehouse. 
• Transform The transform step applies a set of rules to transform the
data from the source to the target. This includes converting any
measured data to the same dimension (i.e. conformed dimension)
using the same units so that they can later be joined. 
• Load During the load step, it is necessary to ensure that the load is
performed correctly and with as little resources as possible.
ETL (Extract-Transform-Load)

Extraction, Transformation, and Load (ETL) Process

Packaged Transient
application data source

Data
warehouse

Legacy
Extract Transform Cleanse Load
system

Data mart
Other internal
applications
DATA WAREHOUSING - SCHEMAS
Schemas used in Data Warehouse
• Schema is a logical description of the entire database. It
includes the name and description of records of all record
types including all associated data-items and aggregates.
Much like a database, a data warehouse also requires to
maintain a schema. A database uses relational model,
while a data warehouse uses:

– Star schema
– Snowflake schema
– Fact Constellation schema.
Star Schema
• Only one Fact table with Measure.
• Multiple dimension table contains the set of attributes.
Snowflake Schema
A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape
similar to snowflake

Unlike Star schema, the dimensions


table in a snowflake schema are
normalized.
Snowflake Schema
Fact constellations

• Multiple fact tables


share dimension
tables, viewed as a
collection of stars,
therefore called
galaxy schema or fact
constellation
• It is also known as
Galaxy Schema.
Online Analytical Processing (OLAP)
Online Analytical Processing (OLAP)

OLAP is an acronym for Online Analytical Processing. 


OLAP performs multidimensional analysis of business data
and provides the capability for complex calculations,
trend analysis, and sophisticated data modeling.
Purpose of OLAP:
• To derive summarized information from large volume
database
• To generate automated reports for human view
Choosing a Reporting Architecture

• Business needs
Good
• Potential for growth
MOLAP
• Interface
Query
• Enterprise Architecture Performance
• Network architecture ROLAP
• Speed of access OK
• Openness
Simple Complex
Analysis
Multidimensional Data

• Sales volume as a function of product, month, and region

Dimensions: Product, Location, Time


Hierarchical summarization paths
on
gi

Industry Region Year


Re

Category Country Quarter


Product

Product City Month Week

Office Day

Month
Data Cube

Date
1Qtr 2Qtr 3Qtr 4Qtr sum
t

TV
uc
od

PC Portugal
Pr

VCR
sum

Country
Spain

Germany

sum
Slicing and Dicing
Typical OLAP Operations
Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes.
Other operations
– drill across: involving (across) more than one fact table
– drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
Implementation

Nowadays systems can be divided in three categories:


– ROLAP (Relational OLAP)
• OLAP supported on top of a relational database
– MOLAP (Multi-Dimensional OLAP)
• Use of special multi-dimensional data structures
– HOLAP: (Hybrid) combination of previous two
MOLAP

The database is stored in a special, usually proprietary, structure


that is optimized for multidimensional analysis.
• Advantages:
Very fast query response time because data is mostly pre-
calculated
• Disadvantages:
Practical limit on the size because the time taken to calculate
the database and the space required to hold these pre-
calculated values.
ROLAP

The database is a standard relational database and the


database model is a multidimensional model, often referred
to as a star or snowflake model or schema.
• Advantages:
More scalable solution.
• Disadvantages:
Performance of the queries will be largely governed by the
complexity of the SQL and the number and size of the tables
being joined in the query.
HOLAP

HOLAP (hybrid online analytical processing) is a combination


of ROLAP (relational OLAP) and MOLAP (multidimensional
OLAP) which are other possible implementations of OLAP. 
HOLAP allows storing part of the data in a MOLAP store and
another part of the data in a ROLAP store, allowing a tradeoff
of the advantages of each.
Developing an OLAP Cube by Using BIDS

SQL Server Business Intelligence Development Studio


Developing an OLAP Cube by Using MS BIDS
Developing an OLAP Cube by Using SSIS
BUSINESS INTELLIGENCE
What is Business Intelligence

Originally a term coined by the Gartner Group in 1993,


Business Intelligence (BI) is a broad range of software and
solutions aimed at collection, consolidation, analysis and
providing access to information that allows users across the
business to make better decisions.

 The technology includes software for database query and


analysis, multidimensional databases or OLAP tools, data
warehousing and data mining, and web enabled reporting
capabilities
Why do companies need BI?

What’s the best that can happen?


Optimization ANALYTICS
What will happen next? (Tactical &
Predictive Modeling Strategic)
Competitive Advantage

What if these trends continue?


Forecasting/extrapolation
Why is this happening?
Statistical analysis
DATA
Alerts ACCESS &
What actions are needed?
REPORTING
Query/drill down (Operational)
Where exactly is the problem?
Ad hoc reports
How many, how often, where?
Standard reports
What happened?

Sophistication o Intelligence
Stages in Business Intelligence
The characteristics of a business
intelligence solution
• Single point of access to information
• Timely answers to business questions
• Using BI in all departments of an organization.
• “BI today is like reading the newspaper”

– BI reporting tool on top of a data warehouse that loads


nightly and produces historical reporting

• BI tomorrow will focus more on real-time events and


predicting tomorrow’s headlines
Common Pain Points...
• Data everywhere, information no where
• Different users have different needs
– Excel versus PDF
– On demand – on schedule
– Your format – my format
• Takes too long – wasted resources/efforts
• Security
• Technical “mumbo jumbo” … Why I just can’t get it to you
when you want it.
Improving organizations by providing
business insights to all employees leading to
better, faster, more relevant decisions

Advanced Analytics
Self Service Reporting
End-User Analysis
Business Performance
Management
Operational Applications
Embedded analytics
MODULES
• Graphical OLAP
• Key Performance Indicators
• Dash boards
• Forecasting
• Graphical Reporting
Retail Analytics
• Market Basket Analytics
• Text Analytics
• Customer Segmentation/Clustering
• Tailored Product Assortments
• Inventory Forecasting
BI Reporting types
• KPI(Key Performance Indicator)
• Dashboards
• Ad-hoc reports
• Multidimensional OLAP reports
Key Performance Indicator (KPI)
• Is a measurable value that demonstrates how effectively a
company is achieving key business objectives
• Organizations use KPIs to evaluate their success at reaching
targets
Dashboards
• Provides at-a-glance views of KPIs 
• 4 key elements to a good dashboard
 Simple, communicates easily
 Minimum distractions
 Supports organized business with meaning and useful data
 Applies human visual perception to visual presentation of
information
Ad-hoc Reporting
• Simple reports created by the end users on demand
• Designed from scratch or using a standard report as a
template 
• Sample applications: Cognos Analysis Studio
Multidimensional OLAP reports
• Usually provide more general information - using dynamic
drill-down, slicing, dicing and filtering users can get the
information they need
• Reports with fixed design defined by a report designer
• Usually are made available on the web server or a shared
drive 
• Sample applications: Cognos PowerPlay, Business Objects,
Pentaho Mondrian
QUESTIONS?

You might also like