Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Unit 1

Uploaded by

Astha Shukla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit 1

Uploaded by

Astha Shukla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

UNIT-1

DATA WAREHOUSING

Overview:The term "Data Warehouse" was first coined by Bill Inmon in


1990. According to Inmon, a data warehouse is a subject oriented,
integrated, time-variant, and non-volatile collection of data. This data
helps analysts to take informed decisions in an organization.

An operational database undergoes frequent changes on a daily basis


on account of the transactions that take place. Suppose a business
executive wants to analyze previous feedback on any data such as a
product, a supplier, or any consumer data, then the executive will have
no data available to analyze because the previous data has been
updated due to transactions.

A data warehouses provides us generalized and consolidated data in


multidimensional view. Along with generalized and consolidated view of
data, a data warehouses also provides us Online Analytical Processing
(OLAP) tools. These tools help us in interactive and effective analysis of
data in a multidimensional space. This analysis results in data
generalization and data mining.

Data mining functions such as association, clustering, classification,


prediction can be integrated with OLAP operations to enhance the
interactive mining of knowledge
At multiple level of abstraction. That’s why data warehouse has now
become an important platform fo data analysis and online analytical
processing.

Understanding a Data Warehouse:


● A data warehouse is a database, which is kept separate from the
organization's operational database.

● There is no frequent updating done in a data warehouse.


● It possesses consolidated historical data, which helps the
organization to analyze its business.

● A data warehouse helps executives to organize, understand, and


use their data to take strategic decisions.

● Data warehouse systems help in the integration of diversity of


application systems.
● A data warehouse system helps in consolidated historical data
analysis.

Why a Data Warehouse is Separated from Operational Databases

A data warehouses is kept separate from operational databases due to


the following reasons-

● An operational database is constructed for well-known tasks and


workloads such as searching particular records, indexing, etc. In
contract, data warehouse queries are often complex and they
present a general form of data.

● Operational databases support concurrent processing of multiple


transactions. Concurrency control and recovery mechanisms are
required for operational databases to ensure robustness and
consistency of the database.

● An operational database query allows to read and modify


operations, while an OLAP query needs only read only access of
stored data.

● An operational database maintains current data. On the other


hand, a data warehouse maintains historical data.

Data Warehouse Features:

The key features of a data warehouse are discussed below-


● Subject Oriented: A data warehouse is subject oriented
because it provides information around a subject rather than the
organization's ongoing operations. These subjects can be product,
customers, suppliers, sales, revenue, etc. A data warehouse does
not focus on the ongoing operations, rather it focuses on modelling
and analysis of data for decision making.

● Integrated: A data warehouse is constructed by integrating data


from heterogeneous sources such as relational databases, flat
files, etc. This integration enhances the effective analysis of data.

● Time Variant: The data collected in a data warehouse is


identified with a particular time period. The data in a data
warehouse provides information from the historical point of view.

● Non-volatile: Non-volatile means the previous data is not


erased when new data is added to it. A data warehouse is kept
separate from the operational database and therefore frequent
changes in operational database is not reflected in the data
warehouse.

Note- A data warehouse does not require transaction processing,


recovery, and concurrency controls, because it is physically stored and
separate from the operational database.

Data Warehouse Applications:

As discussed before, a data warehouse helps business executives to


organize, analyze, and use their data for decision making. A data
warehouse serves as a sole part of a plan-execute-assess "closed-loop"
feedback system for the enterprise management. Data warehouses are
widely used in the following fields -

● Financial services

● Banking services
● Consumer goods

● Retail sectors

● Controlled manufacturing

Types of Data Warehouse

Information processing, analytical processing, and data mining are the


three types of data warehouse applications that are discussed below -

● Information Processing: A data warehouse allows to process


the data stored in it. The data can be processed by means of
querying, basic statistical analysis, reporting using crosstabs,
tables, charts, or graphs. Analytical Processing: A data warehouse
supports analytical processing of the information stored in it. The
data can be analyzed by means of basic OLAP operations,
including slice-and-dice, drill down, drill up, and pivoting.

● Data Mining: Data mining supports knowledge discovery by


finding hidden patterns and associations, constructing analytical
models, performing classification and prediction. These mining
results can be presented using the visualization tools.

Data Warehousing and Architecture:

Data warehousing is the process of constructing and using a data


warehouse. A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data
consolidations.

Using Data Warehouse Information:


There are decision support technologies that help utilize the data
available in a data warehouse. These technologies help executives to
use the warehouse quickly and effectively. They can gather data,
analyze it, and take decisions based on the information present in the
warehouse. The information gathered in a warehouse can be used in
any of the following domains -

● Tuning Production Strategies: The product strategies can


be well tuned by repositioning the products and managing the
product portfolios by comparing the sales quarterly or yearly.

● Customer Analysis: Customer analysis is done by analyzing


the customer's buying preferences, buying time, budget cycles,
etc.

● Operations Analysis: Data warehousing also helps in


customer relationship management, and making environmental
corrections. The information also allows us to analyze business
operations.

Integrating Heterogeneous Databases

To integrate heterogeneous databases, we have two approaches -

● Query-driven Approach

● Update-driven Approach

Query-Driven Approach:
This is the traditional approach to integrate heterogeneous databases.
This approach was used to build wrappers and integrators on top of
multiple heterogeneous databases. These integrators are also known as
mediators.

Process of Query-Driven Approach


● When a query is issued to a client side, a metadata dictionary
translates the query into an appropriate form for individual
heterogeneous sites involved.

● Now these queries are mapped and sent to the local query
processor.

● The results from heterogeneous sites are integrated into a global


answer set.

Disadvantages

● Query-driven approach needs complex integration and filtering


processes.

● This approach is very inefficient.

● It is very expensive for frequent queries.

● This approach is also very expensive for queries that require


aggregations.

Update-Driven Approach:
This is an alternative to the traditional approach. Today's data
warehouse systems follow update-driven approach rather than the
traditional approach discussed earlier. In update-driven approach, the
information from multiple heterogeneous sources are integrated in
advance and are stored in a warehouse. This information is available for
direct querying and analysis.

Advantages:

This approach has the following advantages -

● This approach provide high performance.


● The data is copied, processed, integrated, annotated, summarized
and restructured in semantic data store in advance.

● Query processing does not require an interface to process data at


local sources.

Functions of Data Warehouse Tools and Utilities

The following are the functions of data warehouse tools and utilities-

● Data Extraction- Involves gathering data from multiple


heterogeneous sources.

● Data Cleaning-Involves finding and correcting the errors in data.

● Data Transformation-involves converting the data from legacy


format to warehouse format.

● Data Loading-Involves sorting, summarizing, consolidating,


checking integrity, and building indices and partitions.

● Refreshing-Involves updating from data sources to warehouse.

Note Data cleaning and data transformation are important steps in


improving the quality of data and data mining results.

Business Analysis Framework:

The business analyst get the information from the data warehouses to
measure the performance and make critical adjustments in order to win
over other business. holders in the market. Having a data warehouse
offers the following advantages-

● Since a data warehouse can gather information quickly and


efficiently, it can enhance business productivity.
● A data warehouse provides us a consistent view of customers and
items, hence, it helps us manage customer relationship.

● A data warehouse also helps in bringing down the costs by


tracking trends, patterns over a long period in a consistent and
reliable manner.

To design an effective and efficient data warehouse, we need to


understand and analyze the business needs and construct a business
analysis framework. Each person has different views regarding the
design of a data warehouse. These views are as follows -

● The top-down view- This view allows the selection of relevant


information needed for a data warehouse.

● The data source view- This view presents the information being
captured, stored, and managed by the operational system.

● The data warehouse view- This view includes the fact tables and
dimension tables. It represents the information stored inside the
data warehouse.

● The business query view- It is the view of the data from the
viewpoint of the end-user.

Three-Tier Data Warehouse Architecture

Generally a data warehouses adopts a three-tier architecture. Following


are the three tiers of the data warehouse architecture.

● Bottom Tier- The bottom tier of the architecture is the data


warehouse database server. it is the relational database system.
We use the back end tools and utilities to feed data into the bottom
tier. These back end tools and utilities perform the Extract, Clean,
Load, and refresh functions.
● Middle Tier- In the middle tier, we have the OLAP Server that can
be implemented in either of the following ways.

❖ By Relational OLAP (ROLAP), which is an extended


relationaldatabase management system. The ROLAP maps
the operations on multidimensional data to standard
relational operations.
❖ By Multidimensional OLAP (MOLAP) model, which directly
implements the multidimensional data and operations.
● Top-Tier- This tier is the front-end client layer. This layer holds the
query tools and reporting tools, analysis tools and data mining
tools.

Data Warehouse Models:

From the perspective of data warehouse architecture, we have the


following data warehouse models -

● Virtual Warehouse

● Data mart

● Enterprise Warehouse

Virtual Warehouse

The view over an operational data warehouse is known as a virtual


warehouse. It is easy to build a virtual warehouse. Building a virtual
warehouse requires excess capacity on operational database servers.

Data Mart

Data mart contains a subset of organization-wide data. This subset of


data is valuable to specific groups of an organization.

In other words, we can claim that data marts contain data specific to a
particular group. For example, the marketing data mart may contain data
related to items, customers, and sales. Data marts are confined to
subjects.

Points to remember about data marts -

● Window-based or Unix/Linux-based servers are used to


implement data marts. They are implemented on low-cost servers.
● The implementation data mart cycles is measured in short periods
of time, i.e., in weeks rather than months or years.
● The life cycle of a data mart may be complex in long run, if its
planning and design are not organization-wide.
● Data marts are small in size.
● The source of a data mart is departmentally structured data
warehouse.
● Data marts are customized by department.
● Data mart are flexible.

Enterprise Warehouse
● An enterprise warehouse collects all the information and the
subjects spanning an entire organization
● It provides us enterprise-wide data integration.
● The data is integrated from operational systems and external
information providers.
● This information can vary from a few gigabytes to hundreds of
gigabytes, terabytes or beyond.

Load Manager Architecture

This component performs the operations required to extract and load


process.

● Extract the data from source system.


● Fast Load the extracted data into temporary data store.
● Perform simple transformations into structure similar to the one in
the data warehouse.
Extract Data from Source

The data is extracted from the operational databases or the external


information providers. Gateways is the application programs that are
used to extract data. It is supported by underlying DBMS and allows
client program to generate SQL to be executed at a server. Open
Database Connection(ODBC), Java Database Connection (JDBC), are
examples of gateway.

Fast Load
● In order to minimize the total load window the data need to be
loaded into the warehouse in the fastest possible time.
● The transformations affects the speed of data processing.
● It is more effective to load the data into relational database prior to
applying transformations and checks.
● Gateway technology proves to be not suitable, since they tend not
be performant when large data volumes are involved.

Simple Transformations

While loading it may be required to perform simple transformations. After


this has been completed we are in position to do the complex checks.
Suppose we are loading the EPOS sales transaction we need to perform
the following checks:
● Strip out all the columns that are not required within the
warehouse.
● Convert all the values to required data types.

Warehouse Manager

A warehouse manager is responsible for the warehouse management


process. It consists of third-party system software, C programs, and shell
scripts.

The size and complexity of warehouse managers varies between


specific solutions.
Warehouse Manager Architecture

A warehouse manager includes the following -


● The controlling process
● Stored procedures or C with SQL
● Backup/Recovery tool
● SQL Scripts

Operations Performed by Warehouse Manager


● A warehouse manager analyzes the data to perform consistency
and referential integrity checks.
● Creates indexes, business views, partition views against the base
data.
● Generates new aggregations and updates existing aggregations.
Generates normalizations.
● Transforms and merges the source data into the published data
warehouse.
● Backup the data in the data warehouse.
● Archives the data that has reached the end of its captured life.

Note- A warehouse Manager also analyzes query profiles to determine


index and aggregations are appropriate.

Query Manager:
● Query manager is responsible for directing the queries to the
suitable tables.
● By directing the queries to appropriate tables, the speed of
querying and response generation can be increased.
● Query manager is responsible for scheduling the execution of the
queries posed by the user.

Query Manager Architecture:

The following screenshot shows the architecture of a query manager. It


includes the following:
● Query redirection via C tool or RDBMS
● Stored procedures
● Query management tool
● Query scheduling via C tool or RDBMS
● Query scheduling via third-party software

Detailed Information:

Detailed information is not kept online, rather it is aggregated to the next


level of detail and then archived to tape. The detailed information part of
data warehouse keeps the detailed information in the starflake schema.
Detailed information is loaded into the data warehouse to supplement
the aggregated data.

The following diagram shows a pictorial impression of where detailed


information is stored and how it is used.

Note- If detailed information is held offline to minimize disk storage, we


should make sure that the data has been extracted, cleaned up, and
transformed into starflake schema before it is archived.

Summary Information

Summary Information is a part of data warehouse that stores predefined


aggregations. These aggregations are generated by the warehouse
manager. Summary Information must be treated as transient. It changes
on-the-go in order to respond to the changing query profiles.

The points to note about summary information are as follows -


● Summary information speeds up the performance of common
queries.
● It increases the operational cost.
● It needs to be updated whenever new data is loaded into the data
warehouse.
● It may not have been backed up, since it can be generated fresh
from the detailed information.
Difference between Database System and Data Warehouse

Database System:
Database System is used in traditional way of storing and retrieving
data. The major task of database system is to perform query processing.
These systems are generally referred as online transaction processing
system. These systems are used day to day operations of an
organization.

Characteristics of Database
● Offers security and removes redundancy
● Allow multiple views of the data
● Database system follows the ACID compliance (Atomicity,
Consistency,Isolation, and Durability).
● Allows insulation between programs and data
● Sharing of data and multiuser transaction processing
● Relational Database support multi-user environment

Data Warehouse:
Data Warehouse is the place where huge amount of data is stored. It is
meant for users or knowledge workers in the role of data analysis and
decision making. These systems are supposed to organize and present
data in different format and different forms in order to serve the need of
the specific user for specific purpose. These systems are referred as
online analytical processing.

Characteristics of Data Warehouse


● A data warehouse is subject oriented as it offers information
related to theme instead of companies' ongoing operations.
● The data also needs to be stored in the Dataware house in
common and unanimously acceptable manner.
● The time horizon for the data warehouse is relatively extensive
compared with other operational systems.
● A data warehouse is non-volatile which means the previous data is
not erased when new information is entered in it.
Data warehousing Advantages:

The successful implementation of a data warehouse can bring major,


benefits to an organization including

● Potential high returns on investment


Implementation of data warehousing by an organization requires a huge
investment typically from Rs 10 lack to 50 lacks. However, a study by the
International Data Corporation (IDC) in 1996 reported that average
three-year returns on investment (ROI) in data warehousing reached
401%.

● Competitive advantage
The huge returns on investment for those companies that have
successfully implemented a data warehouse is evidence of the
enormous competitive advantage. that accompanies this technology.
The competitive advantage is gained by allowing decision-makers
access to data that can reveal previously unavailable, unknown, and
untapped information on, for example, customers, trends, and demands.

● Increased productivity of corporate decision-makers


Data warehousing improves the productivity of corporate
decision-makers by creating an integrated database of consistent,
subject-oriented, historical data. It integrates data from multiple
incompatible systems into a form that provides one consistent view of
the organization. By transforming data into meaningful information, a
data warehouse allows business managers to perform more substantive,
accurate, and consistent analysis.

● More cost-effective decision-making


Data warehousing helps to reduce the overall cost of the product- by
reducing the number of channels.

● Better enterprise intelligence.


It helps to provide better enterprise intelligence.
❖ Enhanced customer service.
❖ It is used to enhance customer" service.
Metadata: Concepts and Classifications

Metadata is simply defined as data about data. The data that is used to
represent other data is known as metadata. For example, the index of a
book serves as a metadata for the contents in the book. In other words,
we can say that metadata is the summarized data that leads us to
detailed data. In terms of data warehouse, we can define metadata as
follows.
● Metadata is the road-map to a data warehouse.
● Metadata in a data warehouse defines the warehouse objects.
● Metadata acts as a directory. This directory helps the decision
support system to locate the contents of a data warehouse.

Categories of Metadata:

● Business Metadata: It has the data ownership information,


business definition, and changing policies.

● Technical Metadata: It includes database system names, table


and column names and sizes, data types and allowed values.
Technical metadata also includes structural information such as
primary and foreign key attributes and indices.

● Operational Metadata: It includes currency of data and data


lineage. Currency of data means whether the data is active,
archived, or purged. Lineage of data means the history of data
migrated and transformation. applied on it.

Role of Metadata:

Metadata has a very important role in a data warehouse. The role of


metadata in a warehouse is different from the warehouse data, yet it
plays an important role. The various roles of metadata are explained
below.
● Metadata acts as a directory.
● This directory helps the decision support system to locate the
contents of the data warehouse.
● Metadata helps in decision support system for mapping of data
when data is transformed from operational environment to data
warehouse environment.
● Metadata helps in summarization between current detailed data
and highly summarized data.
● Metadata also helps in summarization between lightly detailed
data and highly summarized data.
● Metadata is used for query tools.
● Metadata is used in extraction and cleansing tools.
● Metadata is used in reporting tools.
● Metadata is used in transformation tools.
● Metadata plays an important role in loading functions.

Metadata Repository:

Metadata repository is an integral part of a data warehouse system. It


has the following metadata:

● Definition of data warehouse: It includes the description of


structure of data warehouse. The description is defined by
schema, view, hierarchies, derived data definitions, and data mart
locations and contents.
● Business metadata: It contains has the data ownership
information, business definition, and changing policies.
● Operational Metadata: It includes currency of data and data
lineage.Currency of data means whether the data is active,
archived, or purged.Lineage of data means the history of data
migrated and transformation applied on it.
● Data for mapping from operational environment to data
warehouse: It includes the source databases and their contents,
data extraction, data partition cleaning, transformation rules, data
refresh and purging rules.
● Algorithms for summarization: It includes dimension algorithms,
data on granularity, aggregation, summarizing, etc.
Challenges in Metadata Management:

The importance of metadata can not be overstated. Metadata helps in


driving the accuracy of reports, validates data transformation, and
ensures the accuracy of calculations. Metadata also enforces the
definition of business terms to business end-users. With all these uses of
metadata, it also has its challenges. Some of the challenges are
discussed below.

● Metadata in a big organization is scattered across the


organization. This metadata is spread in spreadsheets, databases,
and applications.
● Metadata could be present in text files or multimedia files. To use
this data for information management solutions, it has to be
correctly defined.
● There are no industry-wide accepted standards. Data
management solution vendors have narrow focus.
● There are no easy and accepted methods of passing metadata.

Multi-Dimensional Data Model, Data Cubes, Stars, Snow


Flakes, Fact Constellations

Multi-Dimensional Data Model

A multidimensional model views data in the form of a data-cube. A data


cube enables data to be modeled and viewed in multiple dimensions. It
is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an


organization keeps records. For example, a shop may create a sales
data warehouse to keep records of the store's sales for the dimension
time, Item, and location. These dimensions allow the save to keep track
of things, for example, monthly sales of items and the locations at which
the iterns were sold. Each dimension has a table related to it, called a
dimensional table, which describes the dimension further. For example,
a dimensional table for an item may contain the attributes item_name,
brand, and type.

A multidimensional data model is organized around a central theme, for


example, sales. This theme is represented by a fact table. Facts are
numerical measures. The fact table contains the names of the facts or
measures of the related dimensional tables.

Working on a Multidimensional Data Model

On the basis of the pre-decided steps, the Multidimensional Data Model


works.

The following stages should be followed by every project for building a


Multi Dimensional Data Model:
● Stage 1: Assembling data from the client. In first stage, a Multi
Dimensional Data Model collects correct data from the client.
Mostly, software professionals provide simplicity to the client about
the range of data which can be gained with the selected
technology and collect the complete data in detail.
● Stage 2: Grouping different segments of the system: in the second
stage, the Multi Dimensional Data Model recognizes and classifies
all the data to the respective section they belong to and also builds
it problem-free to apply step by step.
● Stage 3: Noticing the different proportions. In the third stage, it is
the basis on which the design of the system is based. In this stage,
the main factors are recognized according to the user's point of
view. These factors are also known as "Dimensions".
● Stage 4: Preparing the actual-time factors and their respective
qualities: In the fourth stage, the factors which are recognized in
the previous step are used further for identifying the related
qualities. These qualities are also known as "attributes" in the
database.
● Stage 5: Finding the actuality of factors which are listed previously
and their qualities: In the fifth stage, A Multi Dimensional Data
Model separates and differentiates the actuality from the factors
which are collected by it. These actually play a significant role in
the arrangement of a Multi Dimensional Data Model.
● Stage 6: Building the Schema to place the data, with respect to
the information collected from the steps above. In the sixth stage,
on the basis of the data which was collected previously, a Schema
is built.

Data Cubes

In computer programming contexts, a data cube (or datacube) is a


multi-dimensional ("n-D") array of values. Typically, the term datacube is
applied in contexts where these arrays are massively larger than the
hosting computer's main memory, examples include
multi-terabyte/petabyte data warehouses and time series of image data.

The data cube is used to represent data (sometimes called facts) along
some dimensions of interest. For example, in OLAP such dimensions
could be the subsidiaries a company has, the products the company
offers, and time; in this setup, a fact would be a sales event where a
particular product has been sold in a particular subsidiary at a particular
time. In satellite image timeseries dimensions would be Latitude and
Longitude coordinates and time, a fact (sometimes called measure)
would be a pixel at a given space and time as taken by the satellite
(following some processing that is not of concern here). Even though it is
called a cube (and the examples provided above happen to be
3-dimensional for brevity), a data cube generally is a multi-dimensional
concept which can be 1-dimensional, 2-dimensional, 3-dimensional, or
higher-dimensional. In any case, every dimension divides data into
groups of cells whereas each cell in the cube represents a single
measure of interest. Sometimes cubes hold only few values with the rest
being empty, i.e. undefined, sometimes most or all cube coordinates
hold a cell value. In the first case such data are called sparse, in the
second case they are called dense, although there is no hard delineation
between both.
Applications:

Multi-dimensional arrays can meaningfully represent spatio-temporal


sensor, image, and simulation data, but also statistics data where the
semantics of dimensions is not necessarily of spatial or temporal nature.
Generally, any kind of axis can be combined with any other into a
datacube.

Stars:

Star schema is the fundamental schema among the data mart schema
and it is simplest This schema is widely used to develop or build a data
warehouse and dimensional data marts. It includes one or more fact
tables indexing any number of dimensional tables. The star schema is a
necessary cause of the snowflake schema. It is also efficient for handling
basic queries

It is said to be star as its physical model resembles to the star shape


having a fact table at Its center and the dimension tables at its peripheral
representing the star's points.

Advantages of Star Schema:

Simpler Queries

Join logic of star scherna is quite cinch in comparison to other join logic
which are needed to fetch data from a transactional schema that is
highly normalized

Simplified Business Reporting Logic

In comparison to a transactional schema that is highly normalized, the


star schema makes simpler common business reporting logic, such as
as of reporting and period-over-period.
Feeding Cubes

Star schema is widely used by all OLAP systems to design OLAP cubes
efficiently. In fact major OLAP systems deliver a ROLAP mode of
operation which can use a star schema as a source without designing a
cube structure.

Disadvantages of Star Schema:


● Data integrity is not enforced well since in a highly de-normalized
schema state.
● Not flexible in terms if analytical needs as a normalized data
model.
● Star schemas don't reinforce many-to-many relationships within
business entities at least not frequently.

Snow Flakes

Snowflake Schema in data warehouse is a logical arrangement of tables


in a multidimensional database such that the ER diagram resembles a
snowflake shape. A Snowflake Schema is an extension of a Star
Schema, and it adds additional dimensions.The dimension tables are
normalized which splits data into additional tables.

Characteristics of Snowflake:

● The main benefit of the snowflake schema it uses smaller disk


space.
● Easier to implement a dimension is added to the Schema
● Due to multiple tables query performance is reduced
● The primary challenge that you will face while using the snowflake
Schema is that you need to perform more maintenance efforts
because of the more lookup tables.
Fact Constellations
Fact constellation is a measure of online analytical processing, which is
a collection of multiple fact tables sharing dimension tables, viewed as a
collection of stars. It can be seen as an extension of the star schema.

A fact constellation schema has multiple fact tables. It is also known as


galaxy schema. It is widely used schema and more complex than star
schema and snowflake schema. It is possible to create fact constellation
schema by splitting original star schema into more star schema. It has
many fact tables and some common dimension table.

Advantage:

Provides a flexible schema.

Disadvantage:

It is much more complex and hence, hard to implement and maintain.

Concept hierarchy, 3 Tier Architecture, ETL, Data Marting

Concept Hierarchy

A concept hierarchy represents a series of mappings from a set of


low-level concepts to larger-level, more general concepts. Concept
hierarchy organizes information or concepts in a hierarchical structure or
a specific partial order, which are used for defining knowledge in brief,
high-level methods, and creating possible mining knowledge at several
levels of abstraction.

A conceptual hierarchy includes a set of nodes organized in a tree,


where the nodes define values of an attribute known as concepts. A
specific node, "ANY", is constrained for the root of the tree. A number is
created to the level of each node in a conceptual hierarchy. The level of
the root node is one. The level of a non-root node is one more the level
of its parent level number.
Because values are defined by nodes, the levels of nodes can also be
used to describe the levels of values. Concept hierarchy enables raw
information to be managed at a higher and more generalized level of
abstraction.

There are several types of concept hierarchies which are as


follows:

Set-Grouping Hierarchy: A set-grouping hierarchy constructs values for


a given attribute or dimension into groups or constant range values. It is
also known as instance hierarchy because the partial series of the
hierarchy is represented on the set of instances or values of an attribute.
These hierarchies have more functional sense and are so approved than
other hierarchies.

Schema Hierarchy:Schema hierarchy represents the total or partial


order between attributes in the database. It can define existing semantic
relationships between attributes. In a database, more than one schema
hierarchy can be generated by using multiple sequences and grouping of
attributes.

Operation-Derived Hierarchy: Operation-derived hierarchy is


represented by a set of operations on the data. These operations are
defined by users, professionals, or the datal mining system. These
hierarchies are usually represented for mathematical attributes. Such
operations can be as easy as range value comparison, as difficult as a
data clustering and data distribution analysis algorithm.

Rule-based Hierarchy: In a rule-based hierarchy either a whole concept


hierarchy or an allocation of it is represented by a set of rules and is
computed dynamically based on the current information and rule
definition. A lattice-like architecture is used for graphically defining this
type of hierarchy, in which each child-parent route is connected with a
generalization rule.
3 Tier Architecture

Data Warehouses usually have a three-level (tier) architecture that


includes:
● Bottom Tier (Data Warehouse Server)
● Middle Tier (OLAP Server)
● Top Tier (Front end Tools).

A bottom-tier that consists of the Data Warehouse server, which is


almost always an RDBMS. It may include several specialized data marts
and a metadata repository.

Data from operational databases and external sources (such as user


profile data provided by external consultants) are extracted using
application program interfaces called a gateway. A gateway is provided
by the underlying DBMS and allows customer programs to generate
SQL code to be executed at a server.

A middle-tier which consists of an OLAP server for fast querying of the


data warehouse.

The OLAP server is implemented using either.

(1) A Relational OLAP (ROLAP) model, le, an extended relational DBMS


that maps functions on multidimensional data to standard relational
operations.

(2) A Multidimensional OLAP (MOLAP) model, le, a particular purpose


server that directly imolamente multidimensional information and
operations.

A top-tier that contains front-end tools for displaying results provided by


OLAP, as well as additional tools for data mining of the OLAP-generated
data.
ETL:

ETL is a process in Data Warehousing and it stands for Extract,


Transform and Load. It is a process in which an ETL tool extracts the
data from various data source systems, transforms it in the staging area,
and then finally, loads it into the Data Warehouse system.

Extraction:

The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats like
relational databases, No SQL, XML, and flat files into the staging area. It
is important to extract the data from various source systems and store it
into the staging area first and not directly into the data warehouse
because the extracted data is in various formats and can be corrupted
also. Hence loading it directly into the data warehouse may damage it
and rollback will be much more difficult. Therefore, this is one of the
most important steps of ETL process.

Transformation:

The second step of the ETL process is transformation. In this step, a set
of rules or functions are applied on the extracted data to convert it into a
single standard format. It may involve following processes/tasks:
● Filtering: Loading only certain attributes into the data warehouse.
● Cleaning: Filling up the NULL values with some default values,
mapping U.S.A, United States, and America into USA, etc.
● Joining: Joining multiple attributes into one.
● Splitting: Splitting a single attribute into multiple attributes.
● Sorting: Sorting tuples on the basis of some attribute (generally
key- attribute).

Loading:

The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse. Sometimes
the data is updated by loading into the data warehouse very frequently
and sometimes it is done after longer but regular intervals. The rate and
period of loading solely depends on the requirements and varies from
system to system.

Data Marting

A Data Mart is focused on a single functional area of an organization and


contains a subset of data stored in a Data Warehouse. A Data Mart is a
condensed version of Data Warehouse and is designed for use by a
specific department, unit or set of users in an organization. E.g.,
Marketing, Sales, HR or finance. It is often controlled by a single
department in an organization

Data Mart usually draws data from only a few sources compared to a
Data warehouse. Data marts are small in size and are more flexible
compared to a Datawarehouse.

There are three main types of data mart:

Dependent: Dependent data marts are created by drawing data directly


from operational, external or both sources.

Independent: Independent data mart is created without the use of a


central data warehouse

Hybrid: This type of data marts can take data from data warehouses or
operational systems.

Use of Data warehousing in Current Industry Scenario

Finance:

The application of data warehousing in the financial industry is the same


as in the banking sector. The right solution helps the financing industry
analyze customer expenses that enable them to outline better strategies
to maximize profits at both ends.
Banking:

With the perfect Data Warehousing solution, bankers can manage all
their available resources more effectively. They can better analyze their
consumer data, government regulations, and market trends to facilitate
better decision-making.

Education:

The educational sector requires data warehousing to have a


comprehensive view of their students' and faculty data. It provides
educational institutions access to real-time data feeds to make valued
and informed decisions.

Manufacturing & Distribution:

With an effective data warehousing solution, organizations in the


manufacturing & distribution sector can organize all their data under one
roof and predict market changes, analyze the latest trends, view
development areas, and finally can make result-driven decisions.

Healthcare:

Another critical use of data warehouses is in the Healthcare sector. All


the clinical, financial, and employee data are stored in the warehouse,
and analysis is run to derive valuable insights to strategize resources in
the best way possible.

Insurance:

In the Insurance sector, data warehousing is required to maintain


existing customers' records and analyze the same to up see client trends
to bring more footsteps towards the business.
Services:

In the services sector, data warehousing is used for maintaining


customer details, financial records, and resources to analyze patterns
and boost decision-making for positive outcomes.

Retailing:

Retailers are the mediators between wholesalers and end customers,


and that's why it is necessary for them to maintain the records of both
parties. For helping them store data in an organized manner, the
application of data warehousing comes into the frame.

You might also like