Data Mining Unit - 1 Notes
Data Mining Unit - 1 Notes
A Data Warehouse consists of data from multiple heterogeneous data sources and is used for analytical
reporting and decision making. Data Warehouse is a central place where data is stored from different
data sources and applications. The term Data Warehouse was first invented by Bill Inmom in 1990. A
Data Warehouse is always kept separate from an Operational Database.
It may pass through operational data store or other transformations before it is loaded to the DW
system for information processing.
A Data Warehouse is used for reporting and analyzing of information and stores both historical and
current data. The data in DW system is used for Analytical reporting, which is later used by Business
Analysts, Sales Managers or Knowledge workers for decision-making.
Data warehousing delivery process
A data warehouse is never static; it evolves as the business expands. As the business evolves, its
requirements keep changing and therefore a data warehouse must be designed to ride with these
changes. Hence a data warehouse system needs to be flexible.
Ideally there should be a delivery process to deliver a data warehouse. However data warehouse
projects normally suffer from various issues that make it difficult to complete tasks and deliverables in
the strict and ordered fashion demanded by the waterfall method. Most of the times, the requirements
are not understood completely. The architectures, designs, and build components can be completed
only after gathering and studying all the requirements.
Delivery Method
The delivery method is a variant of the joint application development approach adopted for the delivery
of a data warehouse. We have staged the data warehouse delivery process to minimize risks. The
approach that we will discuss here does not reduce the overall delivery time-scales but ensures the
business benefits are delivered incrementally through the development process.
Note: The delivery process is broken into phases to reduce the project and delivery risk.
IT Strategy
Data warehouse are strategic investments that require a business process to generate benefits. IT
Strategy is required to procure and retain funding for the project.
Business Case
The objective of business case is to estimate business benefits that should be derived from using a data
warehouse. These benefits may not be quantifiable but the projected benefits need to be clearly stated.
If a data warehouse does not have a clear business case, then the business tends to suffer from
credibility problems at some stage during the delivery process. Therefore in data warehouse projects,
we need to understand the business case for investment.
Organizations experiment with the concept of data analysis and educate themselves on the value of
having a data warehouse before settling for a solution. This is addressed by prototyping. It helps in
understanding the feasibility and benefits of a data warehouse. The prototyping activity on a small scale
can promote educational process as long as:
The prototype can be thrown away after the feasibility concept has been shown.
The activity addresses a small subset of eventual data content of the data warehouse.
The following points are to be kept in mind to produce an early release and deliver business benefits.
Limit the scope of the first build phase to the minimum that delivers business benefits.
Business Requirements
To provide quality deliverables, we should make sure the overall requirements are understood. If we
understand the business requirements for both short-term and medium-term, then we can design a
solution to fulfil short-term requirements. The short-term solution can then be grown to a full solution.
This phase need to deliver an overall architecture satisfying the long term requirements. This phase also
deliver the components that must be implemented in a short term to derive any business benefit. The
blueprint need to identify the followings.
In this stage, the first production deliverable is produced. This production deliverable is the smallest
component of a data warehouse. This smallest component adds business benefit.
History Load
This is the phase where the remainder of the required history is loaded into the data warehouse. In this
phase, we do not add new entities, but additional physical tables would probably be created to store
increased data volumes.
Let us take an example. Suppose the build version phase has delivered a retail sales analysis data
warehouse with 2 months’ worth of history. This information will allow the user to analyze only the
recent trends and address the short-term issues. The user in this case cannot identify annual and
seasonal trends. To help him do so, last 2 years’ sales history could be loaded from the archive. Now the
40GB data is extended to 400GB.
Note: The backup and recovery procedures may become complex, therefore it is recommended to
perform this activity within a separate phase.
Ad hoc Query
In this phase, we configure an ad hoc query tool that is used to operate a data warehouse. These tools
can generate the database query.
Note: It is recommended not to use these access tools when the database is being substantially
modified.
Automation
In this phase, operational management processes are fully automated. These would include:
Monitoring query profiles and determining appropriate aggregations to maintain system performance.
Extending Scope
In this phase, the data warehouse is extended to address a new set of business requirements. The scope
can be extended in two ways:
Note: This phase should be performed separately, since it involves substantial efforts and complexity.
Requirements Evolution
From the perspective of delivery process, the requirements are always changeable. They are not static.
The delivery process must support this and allow these changes to be reflected within the system.
This issue is addressed by designing the data warehouse around the use of data within business
processes, as opposed to the data requirements of existing queries.
The architecture is designed to change and grow to match the business needs, the process operates as a
pseudo-application development process, where the new requirements are continually fed into the
development activities and the partial deliverables are produced. These partial deliverables are fed back
to the users and then reworked ensuring that the overall system is continually updated to meet the
business needs.
ER based. Star/Snowflake.
The multidimensional data model holds data in the shape of a data cube. Two or three-dimensional cubes
are often served by data warehousing. A data cube requires various measurements of data to be
interpreted. Dimensions are organizations about which an entity needs to hold information. For example,
dimensions allow storing to keep track of items such as monthly item purchases and branches and
positions in the store sales record.
A multidimensional database allows to rapidly and reliably providing data-related responses to
complicated market questions. The Multidimensional Data Model can be defined as a way to arrange the
data in the database, to help structure and organize the contents of the database. The Multidimensional
Data Model can include two or three dimensions of objects from the database structure, versus a system
of one dimension, such as a list.
In organisations, it is usually used for objective findings and report production, which can be used as the
primary source for imperative decision-making processes. Usually, this model is extended to applications
working with OLAP techniques (Online Analytical Processing).
The Multidimensional Data Model, like every other system, often operates based on preset steps to
preserve the same pattern in the industry and to allow the database structures already built or
developed to be reusable. Any project should go all the way through the steps below to construct a
multidimensional data model.
Spotting the various dimensions based on which the system needs to be designed
Discovering the facts from the already listed dimensions and their properties
Constructing the Schema to place the data, for the information gathered from the above steps
Unlike basic one-dimensional database structures, Multi-Dimensional Data Models are workable on
complex systems and applications.
The modularity of this type of database is an incentive for maintenance workers for tasks with lower
bandwidth.
Overall, the Multi-Dimensional Data Models' operational capability and structural description help to
maintain clearer and accurate data in the database.
It is uncomplicated by specifically specified data positioning creation, in circumstances such as one team
constructing the database, another team working on it, and some other team working on maintenance.
If and when necessary, it serves as a method of self-learning.
These types of databases are usually dynamic in design since the Multi-Dimensional Data Model
manages complicated structures.
Being a dynamic structure means the content of the database is often immense in quantity. This makes
the device particularly risky where there is a lack of confidentiality.
The performance of the system is significantly impaired when the system caches due to operations on
the Multi-Dimensional Data Model.
While the final result of a Multi-Dimensional Data Model is advantageous, much of the time the road to
achieving it is complicated.
Data Cube
A data cube in a data warehouse is a multidimensional structure used to store data. The data cube was
initially planned for the OLAP tools that could easily access the multidimensional data. But the data cube
can also be used for data mining.
Data cube represents the data in terms of dimensions and facts. A data cube is used to represents the
aggregated data. A data cube is basically categorized into two main kinds that are multidimensional data
cube and relational data cube.
Let us take an example, consider we have data about AllElectronics sales. Here we can store the sales
data in many perspectives or dimensions like sales in all time, sale at all branches, sales at all location,
sales of all items. The figure below shows the data cube for AllElectronics sales.
Each dimension has a dimension table which contains a further description of that dimension. Such as a
branch dimension may have branch_name, branch_code, branch_address etc.
A multidimensional data model like data cube is always based on a theme which is termed as fact. Like
in the above example of a data set of AllElectronic we have stored data based on the sales of the
electronic item. So, here the fact is sales. A fact has a fact table associated with it.
Star Schema in data warehouse, in which the center of the star can have one fact table and a number of
associated dimension tables. It is known as star schema as its structure resembles a star. The Star
Schema data model is the simplest type of Data Warehouse schema. It is also known as Star Join Schema
and is optimized for querying large data sets.
In the following Star Schema example, the fact table is at the center which contains keys to every
dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID & other attributes like Units
sold and revenue.
Characteristics of Star Schema:
• Every dimension in a star schema is represented with the only one-dimension table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal disk usage.
• The dimension tables are not normalized. For instance, in the above figure, Country_ID
does not have Country lookup table as an OLTP design would have.
• The schema is widely supported by BI Tools
In the following Snowflake Schema example, Country is further normalized into an individual
table.
• The main benefit of the snowflake schema it uses smaller disk space.
• Easier to implement a dimension is added to the Schema
• Due to multiple tables query performance is reduced
• The primary challenge that you will face while using the snowflake Schema is that you
need to perform more maintenance efforts because of the more lookup tables.
Process Architecture
The process architecture defines an architecture in which the data from the data warehouse is
processed for a particular computation.
Following are the two fundamental process architectures:
Centralized Process Architecture
In this architecture, the data is collected into single centralized storage and processed upon
completion by a single machine with a huge structure in terms of memory, processor, and
storage.
Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service. t requires minimal resources both from people and
system perspectives.
It is very successful when the collection and consumption of data occur at the same location.
Distributed Process Architecture
In this architecture, information and its processing are allocated across data centers, and its
processing is distributed across data centers, and processing of data is localized with the group of
the results into centralized storage. Distributed architectures are used to overcome the limitations
of the centralized process architectures where all the information needs to be collected to one
central location, and results are available in one central location.
The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to gather, store, access, and
analyze record. It can be used by smaller businesses to utilize the data they have accumulated since it is less expensive
than implementing a data warehouse.
o Ease of creation
There are mainly two approaches to designing data marts. These approaches are
A dependent data marts is a logical subset of a physical subset of a higher data warehouse. According to this technique,
the data marts are treated as the subsets of a data warehouse. In this technique, firstly a data warehouse is created from
which further various data marts can be created. These data mart are dependent on the data warehouse and extract the
essential record from it. In this technique, as the data warehouse creates the data mart; therefore, there is no need for
data mart integration. It is also known as a top-down approach.
The second approach is Independent data marts (IDM) Here, firstly independent data marts are created, and then a
data warehouse is designed using these independent multiple data marts. In this approach, as all the data marts are
designed independently; therefore, the integration of data marts is required. It is also termed as a bottom-up
approach as the data marts are integrated to develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."
It allows us to combine input from sources other than a data warehouse. This could be helpful for many situations;
especially when Adhoc integrations are needed, such as after a new group or product is added to the organizations.
The significant steps in implementing a data mart are to design the schema, construct the physical storage, populate
the data mart with data from source systems, access it to make informed decisions and manage it over time. So, the
steps are:
Designing
The design step is the first in the data mart process. This phase covers all of the functions from initiating the request
for a data mart through gathering data about the requirements and developing the logical and physical design of the
data mart.
Constructing
This step contains creating the physical database and logical structures associated with the data mart to provide fast
and efficient access to the data.
1. Creating the physical database and logical structures such as tablespaces associated with the data mart.
2. creating the schema objects such as tables and indexes describe in the design step.
Populating
This step includes all of the tasks related to the getting data from the source, cleaning it up, modifying it to the right
format and level of detail, and moving it into the data mart.
2. Extracting data
Accessing
This step involves putting the data to use: querying the data, analyzing it, creating reports, charts and graphs and
publishing them.
1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer translates database
operations and objects names into business conditions so that the end-clients can interact with the data mart
using words which relates to the business functions.
2. Set up and manage database architectures like summarized tables which help queries agree through the front-
end tools execute rapidly and efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step, management functions are performed as: