Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
115 views

Data Mining Unit - 1 Notes

The document discusses data mining and data warehouses. It begins by defining a data warehouse as a central storage place for data from multiple sources used for reporting and decision making. It then discusses how data is loaded from operational systems into the data warehouse and used for analytical reporting and decision making. The rest of the document outlines the delivery process for a data warehouse, including identifying requirements, creating a technical blueprint, building an initial version, loading historical data, setting up query tools, expanding scope, and evolving requirements. It concludes by comparing key differences between database systems and data warehouses.

Uploaded by

Ashwathy MN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views

Data Mining Unit - 1 Notes

The document discusses data mining and data warehouses. It begins by defining a data warehouse as a central storage place for data from multiple sources used for reporting and decision making. It then discusses how data is loaded from operational systems into the data warehouse and used for analytical reporting and decision making. The rest of the document outlines the delivery process for a data warehouse, including identifying requirements, creating a technical blueprint, building an initial version, loading historical data, setting up query tools, expanding scope, and evolving requirements. It concludes by comparing key differences between database systems and data warehouses.

Uploaded by

Ashwathy MN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

DATA MINING

A Data Warehouse consists of data from multiple heterogeneous data sources and is used for analytical
reporting and decision making. Data Warehouse is a central place where data is stored from different
data sources and applications. The term Data Warehouse was first invented by Bill Inmom in 1990. A
Data Warehouse is always kept separate from an Operational Database.

The data in a DW system is loaded from operational transaction systems like −

Sales, Marketing, HR, SCM, etc.

It may pass through operational data store or other transformations before it is loaded to the DW
system for information processing.

A Data Warehouse is used for reporting and analyzing of information and stores both historical and
current data. The data in DW system is used for Analytical reporting, which is later used by Business
Analysts, Sales Managers or Knowledge workers for decision-making.
Data warehousing delivery process

A data warehouse is never static; it evolves as the business expands. As the business evolves, its
requirements keep changing and therefore a data warehouse must be designed to ride with these
changes. Hence a data warehouse system needs to be flexible.

Ideally there should be a delivery process to deliver a data warehouse. However data warehouse
projects normally suffer from various issues that make it difficult to complete tasks and deliverables in
the strict and ordered fashion demanded by the waterfall method. Most of the times, the requirements
are not understood completely. The architectures, designs, and build components can be completed
only after gathering and studying all the requirements.

Delivery Method

The delivery method is a variant of the joint application development approach adopted for the delivery
of a data warehouse. We have staged the data warehouse delivery process to minimize risks. The
approach that we will discuss here does not reduce the overall delivery time-scales but ensures the
business benefits are delivered incrementally through the development process.

Note: The delivery process is broken into phases to reduce the project and delivery risk.

IT Strategy

Data warehouse are strategic investments that require a business process to generate benefits. IT
Strategy is required to procure and retain funding for the project.

Business Case
The objective of business case is to estimate business benefits that should be derived from using a data
warehouse. These benefits may not be quantifiable but the projected benefits need to be clearly stated.
If a data warehouse does not have a clear business case, then the business tends to suffer from
credibility problems at some stage during the delivery process. Therefore in data warehouse projects,
we need to understand the business case for investment.

Education and Prototyping

Organizations experiment with the concept of data analysis and educate themselves on the value of
having a data warehouse before settling for a solution. This is addressed by prototyping. It helps in
understanding the feasibility and benefits of a data warehouse. The prototyping activity on a small scale
can promote educational process as long as:

The prototype addresses a defined technical objective.

The prototype can be thrown away after the feasibility concept has been shown.

The activity addresses a small subset of eventual data content of the data warehouse.

The activity timescale is non-critical.

The following points are to be kept in mind to produce an early release and deliver business benefits.

Identify the architecture that is capable of evolving.

Focus on business requirements and technical blueprint phases.

Limit the scope of the first build phase to the minimum that delivers business benefits.

Understand the short-term and medium-term requirements of the data warehouse.

Business Requirements

To provide quality deliverables, we should make sure the overall requirements are understood. If we
understand the business requirements for both short-term and medium-term, then we can design a
solution to fulfil short-term requirements. The short-term solution can then be grown to a full solution.

The following aspects are determined in this stage:


Things to determine in this stage are following.
The business rule to be applied on data.
The logical model for information within the data warehouse.
The query profiles for the immediate requirement.
The source systems that provide this data.
Technical Blueprint

This phase need to deliver an overall architecture satisfying the long term requirements. This phase also
deliver the components that must be implemented in a short term to derive any business benefit. The
blueprint need to identify the followings.

The overall system architecture.

The data retention policy.

The backup and recovery strategy.

The server and data mart architecture.

The capacity plan for hardware and infrastructure.

The components of database design.

Building the version

In this stage, the first production deliverable is produced. This production deliverable is the smallest
component of a data warehouse. This smallest component adds business benefit.

History Load

This is the phase where the remainder of the required history is loaded into the data warehouse. In this
phase, we do not add new entities, but additional physical tables would probably be created to store
increased data volumes.

Let us take an example. Suppose the build version phase has delivered a retail sales analysis data
warehouse with 2 months’ worth of history. This information will allow the user to analyze only the
recent trends and address the short-term issues. The user in this case cannot identify annual and
seasonal trends. To help him do so, last 2 years’ sales history could be loaded from the archive. Now the
40GB data is extended to 400GB.

Note: The backup and recovery procedures may become complex, therefore it is recommended to
perform this activity within a separate phase.

Ad hoc Query

In this phase, we configure an ad hoc query tool that is used to operate a data warehouse. These tools
can generate the database query.
Note: It is recommended not to use these access tools when the database is being substantially
modified.

Automation

In this phase, operational management processes are fully automated. These would include:

Transforming the data into a form suitable for analysis.

Monitoring query profiles and determining appropriate aggregations to maintain system performance.

Extracting and loading data from different source systems.

Generating aggregations from predefined definitions within the data warehouse.

Backing up, restoring, and archiving the data.

Extending Scope

In this phase, the data warehouse is extended to address a new set of business requirements. The scope
can be extended in two ways:

By loading additional data into the data warehouse.

By introducing new data marts using the existing information.

Note: This phase should be performed separately, since it involves substantial efforts and complexity.

Requirements Evolution

From the perspective of delivery process, the requirements are always changeable. They are not static.
The delivery process must support this and allow these changes to be reflected within the system.

This issue is addressed by designing the data warehouse around the use of data within business
processes, as opposed to the data requirements of existing queries.

The architecture is designed to change and grow to match the business needs, the process operates as a
pseudo-application development process, where the new requirements are continually fed into the
development activities and the partial deliverables are produced. These partial deliverables are fed back
to the users and then reworked ensuring that the overall system is continually updated to meet the
business needs.

Difference between Database System and Data Warehouse:

Database System Data Warehouse

It supports operational It supports analysis and


processes. performance reporting.
Database System Data Warehouse

Capture and maintain


the data. Explore the data.

Current data. Multiple years of history.

Data is balanced within


the scope of this one Data must be integrated and
system. balanced from multiple system.

Data is updated when Data is updated on scheduled


transaction occurs. processes.

Data verification occurs Data verification occurs after


when entry is done. the fact.

100 MB to GB. 100 GB to TB.

ER based. Star/Snowflake.

Application oriented. Subject oriented.

Primitive and highly


detailed. Summarized and consolidated.

Flat relational. Multidimensional.

Multidimensional Data Model of Data Warehouse

The multidimensional data model holds data in the shape of a data cube. Two or three-dimensional cubes
are often served by data warehousing. A data cube requires various measurements of data to be
interpreted. Dimensions are organizations about which an entity needs to hold information. For example,
dimensions allow storing to keep track of items such as monthly item purchases and branches and
positions in the store sales record.
A multidimensional database allows to rapidly and reliably providing data-related responses to
complicated market questions. The Multidimensional Data Model can be defined as a way to arrange the
data in the database, to help structure and organize the contents of the database. The Multidimensional
Data Model can include two or three dimensions of objects from the database structure, versus a system
of one dimension, such as a list.

In organisations, it is usually used for objective findings and report production, which can be used as the
primary source for imperative decision-making processes. Usually, this model is extended to applications
working with OLAP techniques (Online Analytical Processing).

How does the Multidimensional Data Model work?

The Multidimensional Data Model, like every other system, often operates based on preset steps to
preserve the same pattern in the industry and to allow the database structures already built or
developed to be reusable. Any project should go all the way through the steps below to construct a
multidimensional data model.

Congregating the requirements from the client

Categorizing the various modules of the system

Spotting the various dimensions based on which the system needs to be designed

Drafting the real-time dimensions and the corresponding properties

Discovering the facts from the already listed dimensions and their properties

Constructing the Schema to place the data, for the information gathered from the above steps

Advantages of Multidimensional Data Model

Followings are the key advantages of Multidimensional Data Model -

Unlike basic one-dimensional database structures, Multi-Dimensional Data Models are workable on
complex systems and applications.
The modularity of this type of database is an incentive for maintenance workers for tasks with lower
bandwidth.

Overall, the Multi-Dimensional Data Models' operational capability and structural description help to
maintain clearer and accurate data in the database.

It is uncomplicated by specifically specified data positioning creation, in circumstances such as one team
constructing the database, another team working on it, and some other team working on maintenance.
If and when necessary, it serves as a method of self-learning.

Disadvantages of Multidimensional Data Model

Followings are the key disadvantages of Multidimensional Data Model -

These types of databases are usually dynamic in design since the Multi-Dimensional Data Model
manages complicated structures.

Being a dynamic structure means the content of the database is often immense in quantity. This makes
the device particularly risky where there is a lack of confidentiality.

The performance of the system is significantly impaired when the system caches due to operations on
the Multi-Dimensional Data Model.

While the final result of a Multi-Dimensional Data Model is advantageous, much of the time the road to
achieving it is complicated.

Data Cube
A data cube in a data warehouse is a multidimensional structure used to store data. The data cube was
initially planned for the OLAP tools that could easily access the multidimensional data. But the data cube
can also be used for data mining.

Data cube represents the data in terms of dimensions and facts. A data cube is used to represents the
aggregated data. A data cube is basically categorized into two main kinds that are multidimensional data
cube and relational data cube.

Let us take an example, consider we have data about AllElectronics sales. Here we can store the sales
data in many perspectives or dimensions like sales in all time, sale at all branches, sales at all location,
sales of all items. The figure below shows the data cube for AllElectronics sales.
Each dimension has a dimension table which contains a further description of that dimension. Such as a
branch dimension may have branch_name, branch_code, branch_address etc.

A multidimensional data model like data cube is always based on a theme which is termed as fact. Like
in the above example of a data set of AllElectronic we have stored data based on the sales of the
electronic item. So, here the fact is sales. A fact has a fact table associated with it.

What is a Star Schema?

Star Schema in data warehouse, in which the center of the star can have one fact table and a number of
associated dimension tables. It is known as star schema as its structure resembles a star. The Star
Schema data model is the simplest type of Data Warehouse schema. It is also known as Star Join Schema
and is optimized for querying large data sets.

In the following Star Schema example, the fact table is at the center which contains keys to every
dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID & other attributes like Units
sold and revenue.
Characteristics of Star Schema:

• Every dimension in a star schema is represented with the only one-dimension table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal disk usage.
• The dimension tables are not normalized. For instance, in the above figure, Country_ID
does not have Country lookup table as an OLTP design would have.
• The schema is widely supported by BI Tools

What is a Snowflake Schema?


Snowflake Schema in data warehouse is a logical arrangement of tables in a multidimensional
database such that the ER diagram resembles a snowflake shape. A Snowflake Schema is an
extension of a Star Schema, and it adds additional dimensions. The dimension tables are
normalized which splits data into additional tables.

In the following Snowflake Schema example, Country is further normalized into an individual
table.

Example of Snowflake Schema


Characteristics of Snowflake Schema:

• The main benefit of the snowflake schema it uses smaller disk space.
• Easier to implement a dimension is added to the Schema
• Due to multiple tables query performance is reduced
• The primary challenge that you will face while using the snowflake Schema is that you
need to perform more maintenance efforts because of the more lookup tables.

Fact Constellation in Data Warehouse modelling


Fact Constellation is a schema for representing multidimensional model. It is a collection of
multiple fact tables having some common dimension tables. It can be viewed as a collection of
several star schemas and hence, also known as Galaxy schema. It is one of the widely used
schema for Data warehouse designing and it is much more complex than star and snowflake
schema. For complex systems, we require fact constellations.

Process Architecture
The process architecture defines an architecture in which the data from the data warehouse is
processed for a particular computation.
Following are the two fundamental process architectures:
Centralized Process Architecture
In this architecture, the data is collected into single centralized storage and processed upon
completion by a single machine with a huge structure in terms of memory, processor, and
storage.
Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service. t requires minimal resources both from people and
system perspectives.
It is very successful when the collection and consumption of data occur at the same location.
Distributed Process Architecture
In this architecture, information and its processing are allocated across data centers, and its
processing is distributed across data centers, and processing of data is localized with the group of
the results into centralized storage. Distributed architectures are used to overcome the limitations
of the centralized process architectures where all the information needs to be collected to one
central location, and results are available in one central location.

There are several architectures of the distributed process:


Client-Server
In this architecture, the user does all the information collecting and presentation, while the server
does the processing and management of data.
Three-tier Architecture
With client-server architecture, the client machines need to be connected to a server machine,
thus mandating finite states and introducing latencies and overhead in terms of record to be
carried between clients and servers.
Three-Tier Data Warehouse Architecture

Data Warehouses usually have a three-level (tier) architecture that includes:


Bottom Tier (Data Warehouse Server)
Middle Tier (OLAP Server)
Top Tier (Front end Tools).
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It
may include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided by
external consultants) are extracted using application program interfaces called a gateway. A
gateway is provided by the underlying DBMS and allows customer programs to generate SQL
code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking
and Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).
A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose or primary data
subject which may be distributed to provide business needs. Data Marts are analytical record stores designed to focus
on particular business functions for a specific community within an organization. Data marts are derived from subsets
of data in a data warehouse, though in the bottom-up data warehouse design methodology, the data warehouse is
created from the union of organizational data marts.

The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to gather, store, access, and
analyze record. It can be used by smaller businesses to utilize the data they have accumulated since it is less expensive
than implementing a data warehouse.

Reasons for creating a data mart

o Creates collective data by a group of users

o Easy access to frequently needed data

o Ease of creation

o Improves end-user response time

o Lower cost than implementing a complete data warehouses


o Potential clients are more clearly defined than in a comprehensive data warehouse

o It contains only essential business data and is less cluttered.

Types of Data Marts

There are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts

o Independent Data Marts

Dependent Data Marts

A dependent data marts is a logical subset of a physical subset of a higher data warehouse. According to this technique,
the data marts are treated as the subsets of a data warehouse. In this technique, firstly a data warehouse is created from
which further various data marts can be created. These data mart are dependent on the data warehouse and extract the
essential record from it. In this technique, as the data warehouse creates the data mart; therefore, there is no need for
data mart integration. It is also known as a top-down approach.

Independent Data Marts

The second approach is Independent data marts (IDM) Here, firstly independent data marts are created, and then a
data warehouse is designed using these independent multiple data marts. In this approach, as all the data marts are
designed independently; therefore, the integration of data marts is required. It is also termed as a bottom-up
approach as the data marts are integrated to develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."

Hybrid Data Marts

It allows us to combine input from sources other than a data warehouse. This could be helpful for many situations;
especially when Adhoc integrations are needed, such as after a new group or product is added to the organizations.

Steps in Implementing a Data Mart

The significant steps in implementing a data mart are to design the schema, construct the physical storage, populate
the data mart with data from source systems, access it to make informed decisions and manage it over time. So, the
steps are:

Designing

The design step is the first in the data mart process. This phase covers all of the functions from initiating the request
for a data mart through gathering data about the requirements and developing the logical and physical design of the
data mart.

It involves the following tasks:

1. Gathering the business and technical requirements

2. Identifying data sources

3. Selecting the appropriate subset of data

4. Designing the logical and physical architecture of the data mart.

Constructing

This step contains creating the physical database and logical structures associated with the data mart to provide fast
and efficient access to the data.

It involves the following tasks:

1. Creating the physical database and logical structures such as tablespaces associated with the data mart.

2. creating the schema objects such as tables and indexes describe in the design step.

3. Determining how best to set up the tables and access structures.

Populating

This step includes all of the tasks related to the getting data from the source, cleaning it up, modifying it to the right
format and level of detail, and moving it into the data mart.

It involves the following tasks:


1. Mapping data sources to target data sources

2. Extracting data

3. Cleansing and transforming the information.

4. Loading data into the data mart

5. Creating and storing metadata

Accessing

This step involves putting the data to use: querying the data, analyzing it, creating reports, charts and graphs and
publishing them.

It involves the following tasks:

1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer translates database
operations and objects names into business conditions so that the end-clients can interact with the data mart
using words which relates to the business functions.

2. Set up and manage database architectures like summarized tables which help queries agree through the front-
end tools execute rapidly and efficiently.

Managing

This step contains managing the data mart over its lifetime. In this step, management functions are performed as:

1. Providing secure access to the data.

2. Managing the growth of the data.

3. Optimizing the system for better performance.

4. Ensuring the availability of data event with system failur

You might also like