DATA WAREHOUSE - Imp
DATA WAREHOUSE - Imp
DATA WAREHOUSE - Imp
AN OVERVIEW
CONTENTS
1 Database and Data Warehouse: An Introduction
9 Business intelligence
Data, Data everywhere…
• I can’t find the data I need
– data is scattered over the network
– many versions, subtle differences
Highly Scalable
&
Provides true Business Agility
Gartner 2016 Magic Quadrant for Data Warehouse
DATA WAREHOUSE ARCHITECTURE
Outline
• Federated Architecture
Independent Data Marts
(a) Independent Data Marts Architecture
ETL
End user
Source Staging Independent data marts
access and
Systems Area (atomic/summarized data)
applications
ETL
Dimensionalized data marts End user
Source Staging
linked by conformed dimentions access and
Systems Area
(atomic/summarized data) applications
ETL
End user
Source Staging Normalized relational
access and
Systems Area warehouse (atomic data)
applications
ETL
Normalized relational End user
Source Staging
warehouse (atomic/some access and
Systems Area
summarized data) applications
Federated Architecture
• Leaves existing decision-support structures in place
• Shares information among a number of different systems.
• Data is either logically or physically integrated
– Shared keys
– Global metadata
– Distributed queries
Profiling
1. Maintain complete data.
Standardization 2. Clean data by standardizing it using rules.
3. Use Fuzzy algorithms to detect duplicates.
4. Avoid entry of duplicate.
Cleansing 5. Merge existing duplicate records.
6. Introduce Data Governance
Deduplication
Enrichment
Data Profiling
• It is the process of statistically examining and analyzing the content
in a data source, and hence collecting information about the data.
• It consists of techniques used to analyze the data we have for
accuracy and completeness.
• Data profiling helps us make a thorough assessment of data quality.
• It assists the discovery of anomalies in data.
• It helps us understand content, structure, relationships, etc. about
the data in the data source we are analyzing.
• It helps us know whether the existing data can be applied to other
areas or purposes.
• It helps us understand the various issues/challenges we may face in
a database project much before the actual work begins. This enables
us to make early decisions and act accordingly.
• It is also used to assess and validate metadata.
ETL (Extract-Transform-Load)
• ETL comes from Data Warehousing and stands for Extract-Transform-
Load
• Extract covers the data extraction from the source system and
makes it accessible for further processing. The main objective of the
extract step is to retrieve all the required data from the source
system with as little resources as possible.
• Clean The cleaning step is one of the most important as it ensures
the quality of the data in the data warehouse.
• Transform The transform step applies a set of rules to transform the
data from the source to the target. This includes converting any
measured data to the same dimension (i.e. conformed dimension)
using the same units so that they can later be joined.
• Load During the load step, it is necessary to ensure that the load is
performed correctly and with as little resources as possible.
ETL (Extract-Transform-Load)
Packaged Transient
application data source
Data
warehouse
Legacy
Extract Transform Cleanse Load
system
Data mart
Other internal
applications
DATA WAREHOUSING - SCHEMAS
Schemas used in Data Warehouse
• Schema is a logical description of the entire database. It
includes the name and description of records of all record
types including all associated data-items and aggregates.
Much like a database, a data warehouse also requires to
maintain a schema. A database uses relational model,
while a data warehouse uses:
– Star schema
– Snowflake schema
– Fact Constellation schema.
Star Schema
• Only one Fact table with Measure.
• Multiple dimension table contains the set of attributes.
Snowflake Schema
A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape
similar to snowflake
• Business needs
Good
• Potential for growth
MOLAP
• Interface
Query
• Enterprise Architecture Performance
• Network architecture ROLAP
• Speed of access OK
• Openness
Simple Complex
Analysis
Multidimensional Data
Office Day
Month
Data Cube
Date
1Qtr 2Qtr 3Qtr 4Qtr sum
t
TV
uc
od
PC Portugal
Pr
VCR
sum
Country
Spain
Germany
sum
Slicing and Dicing
Typical OLAP Operations
Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes.
Other operations
– drill across: involving (across) more than one fact table
– drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
Implementation
Sophistication o Intelligence
Stages in Business Intelligence
The characteristics of a business
intelligence solution
• Single point of access to information
• Timely answers to business questions
• Using BI in all departments of an organization.
• “BI today is like reading the newspaper”
Advanced Analytics
Self Service Reporting
End-User Analysis
Business Performance
Management
Operational Applications
Embedded analytics
MODULES
• Graphical OLAP
• Key Performance Indicators
• Dash boards
• Forecasting
• Graphical Reporting
Retail Analytics
• Market Basket Analytics
• Text Analytics
• Customer Segmentation/Clustering
• Tailored Product Assortments
• Inventory Forecasting
BI Reporting types
• KPI(Key Performance Indicator)
• Dashboards
• Ad-hoc reports
• Multidimensional OLAP reports
Key Performance Indicator (KPI)
• Is a measurable value that demonstrates how effectively a
company is achieving key business objectives
• Organizations use KPIs to evaluate their success at reaching
targets
Dashboards
• Provides at-a-glance views of KPIs
• 4 key elements to a good dashboard
Simple, communicates easily
Minimum distractions
Supports organized business with meaning and useful data
Applies human visual perception to visual presentation of
information
Ad-hoc Reporting
• Simple reports created by the end users on demand
• Designed from scratch or using a standard report as a
template
• Sample applications: Cognos Analysis Studio
Multidimensional OLAP reports
• Usually provide more general information - using dynamic
drill-down, slicing, dicing and filtering users can get the
information they need
• Reports with fixed design defined by a report designer
• Usually are made available on the web server or a shared
drive
• Sample applications: Cognos PowerPlay, Business Objects,
Pentaho Mondrian
QUESTIONS?