Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data mining and warehousing(chp#3) .

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Subject:Data Mining and Warehousing

Chp#3:Data Warehouse data Modelling

Presented to: Ma'am Umber Saba

Presented by:Ruqiya Yousaf

Introduction:

Data warehouse (DW) data modeling is the process of designing and creating
a conceptual representation of data for analytical and reporting purposes. It involves
structuring data in a way that optimizes query performance, data integrity, and user
understanding.

Goals of Data Warehouse Data Modeling:


1. Support business intelligence and analytics.
2. Provide a single, unified view of data.
3. Improve data quality and consistency.
4. Enhance query performance.
5. Simplify data navigation.

Key Components of Data Warehouse Data Modeling:


1. Facts: Measurable events or transactions.
2. Dimensions: Contextual information (e.g., date, location).
3. Measures: Quantifiable values (e.g., sales, revenue).
4. Grain: Level of detail (e.g., daily, monthly).
5. Schema: Overall structure (e.g., star, snowflake)

.Data Warehouse Data Modeling Techniques:


1. Star Schema: Central fact table surrounded by dimension tables.
2. Snowflake Schema: Extended star schema with additional dimensions.
3. Fact-Constellation Schema: Multiple fact tables sharing dimensions.
4. Dimensional Modeling: Focus on business processes and metrics.
Benefits of Effective Data Warehouse Data Modeling:
1. Faster query performance.
2. Improved data quality.
3. Enhanced business insights.
4. Simplified data maintenance.
5. Better decision-making

What is data modeling?


Data modeling is the process of creating a visual representation of either a whole
information system or parts of it to communicate connections between data points and
structures.
Types of data models:
Like any design process, database and information system design begins at a high level
of abstraction and becomes increasingly more concrete and specific. Data models can
generally be divided into three categories, which vary according to their degree of
abstraction. The process will start with a conceptual model, progress to a logical model
and conclude with a physical model. Each type of data model is discussed in more
detail in subsequent sections:
1:-Conceptual data models
They are also referred to as domain models and offer a big-picture view of what the
system will contain, how it will be organized, and which business rules are involved.
Conceptual models are usually created as part of the process of gathering initial project
requirements. Typically, they include entity classes (defining the types of things that are
important for the business to represent in the data model), their characteristics and
constraints, the relationships between them and relevant security and data integrity
requirements. Any notation is typically simple.
2:-Logical data models
They are less abstract and provide greater detail about the concepts and relationships
in the domain under consideration. One of several formal data modeling notation
systems is followed. These indicate data attributes, such as data types and their
corresponding lengths, and show the relationships among entities. Logical data models
don’t specify any technical system requirements. This stage is frequently omitted in
agile or DevOps practices. Logical data models can be useful in highly procedural
implementation environments, or for projects that are data-oriented by nature, such as
data warehouse design or reporting system development.

3:-Physical data model


They provide a schema for how the data will be physically stored within a database. As
such, they’re the least abstract of all. They offer a finalized design that can be
implemented as a relational database, including associative tables that illustrate the
relationships among entities as well as the primary keys and foreign keys that will be
used to maintain those relationships. Physical data models can include database
management system (DBMS)-specific properties, including performance tuning.
Depending on the complexity of the database and the problem to be
solved, either the star schema or snowflake schema can be developed
to solve the problem.
Before we go further, let us understand two concepts— fact table and dimensional table.
The fact table is the central table in a star schema that stores information for analysis. It
is often surrounded by a number of tables. These tables are known as dimensional
tables.
Star Schema
The star schema is made up of a fact table and dimensional tables. The dimensional
tables are linked to the fact table. The dimensional tables are built based on problems to
be solved and they represent different aspects or perspectives of the data (e.g., time,
product, location). Consider the star schema below:

The Sales table is the fact table and the Products, Location, and Time tables are the
dimensional tables. This schema has been made because sales made at various
locations, times, and by selling different products need to be ascertained. With this,
queries can be carried out and relevant findings can be made

Snowflake Schema
Consider a case where we don’t just need to know the products sold but also the
categories the products belong to and subsequently the subcategories, we would need
to make further extensions of the Product table. This means that tables for category
names and subcategory names would be created. This would aid in the classification of
the products. In this case, the database would be modified further into this:

Simply put, the snowflake schema is an extension of the star schema. In this case, the
dimension tables are further restructured or normalized into sub-dimensions in order to
achieve desired goals.

Key differences between Star Schema and Snowflake Schema:


1:-The star schema has dimensional tables directly connected to the fact table while in
the snowflake schema, the dimensional tables have further extensions, and not all the
tables are directly connected to the fact table.
2:-The star schema is simpler to understand than the snowflake schema
3:-Star schema is generally denormalized. This means that all the attributes are kept in
a single table. On the other hand, the snowflake schema is a normalized structure.

Multidimensional data model:


The Multi Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow
customers to access data in the form of queries. They allow users to rapidly receive
answers to the requests which they made by creating and examining the data
comparatively fast.

OLAP (online analytical processing) and data warehousing uses multi dimensional
databases.It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the
data from many dimensions and perspectives. It is defined by dimensions and facts and
is represented by a fact table. Facts are numerical measures and fact tables contain
measures of the related dimensional tables or names of the facts.

Multidimensional Data Representation

Working on a Multidimensional Data Model:-

The following stages should be followed by every project for building a Multi
Dimensional Data Model :
Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data
Model collects correct data from the client. Mostly, software professionals provide
simplicity to the client about the range of data which can be gained with the selected
technology and collect the complete data in detail.
Stage 2 : Grouping different segments of the system : In the second stage, the Multi
Dimensional Data Model recognizes and classifies all the data to the respective section
they belong to and also builds it problem-free to apply step by step.
Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which
the design of the system is based. In this stage, the main factors are recognized
according to the user’s point of view. These factors are also known as “Dimensions”.
Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth
stage, the factors which are recognized in the previous step are used further for
identifying the related qualities. These qualities are also known as “attributes” in the
database.
Stage 5 : Finding the actuality of factors which are listed previously and their qualities :
In the fifth stage, A Multi Dimensional Data Model separates and differentiates the
actuality from the factors which are collected by it. These actually play a significant role
in the arrangement of a Multi Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to the information
collected from the steps above : In the sixth stage, on the basis of the data which was
collected previously, a Schema is built.

Ralph Kimball's Four-Step Process for Data Warehouse Design:

Step 1: Select the Organizational Process:


1. Identify business processes (e.g., sales, inventory, customer management).
2. Determine key performance indicators (KPIs).
3. Understand business rules and requirements.
4. Choose processes with measurable outcomes.

Step 2: Declare the Grain:


1. Define the level of detail (e.g., daily, weekly, monthly).
2. Determine the fact table's granularity.
3. Identify the smallest unit of analysis.
4. Balance detail and aggregation.

Step 3: Identify Dimensions:


1. Determine relevant dimensions (e.g., date, customer, location).
2. Define dimension attributes (e.g., date: year, month, day).
3. Establish dimension hierarchies (e.g., date → month → quarter).
4. Consider slowly changing dimensions.

Step 4: Identify Facts:


1. Identify measurable events (e.g., sales transactions).
2. Define fact tables and measures (e.g., revenue, quantity).
3. Determine fact table granularity (matches grain).
4. Consider aggregated facts.

Additional Tips:
1. Involve business stakeholders.
2. Document requirements.
3. Use data profiling.
4. Iterate and refine.
Kimball's Four-Step Process ensures data warehouses meet business needs.

Slowly Changing Dimensions.


What is slowly changing dimension?

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both
current and historical data over time in a data warehouse. It is considered and
implemented as one of the most critical ETL tasks in tracking the history of dimension
records.

There are three types of SCDs and you can use Warehouse Builder to define, deploy,
and load all three types of SCDs.

What are three types of SCDs?

The three types of SCDs are:

Type-1 SCDs(Overwriting ):In a Type 1 SCD the new data overwrites the existing data.
Thus the existing data is lost as it is not stored anywhere else. This is the default type of
dimension you create. You do not need to specify any additional information to create a
Type 1 SCD.

Type -2 SCDs(Historical tracking):A Type 2 SCD retains the full history of values.
When the value of a chosen attribute changes, the current record is closed. A new
record is created with the changed data values and this new record becomes the
current record. Each record contains the effective time and expiration time to identify the
time period between which the record was active.

Type -3 SCDs(Current and previous values):A Type 3 SCD stores two versions of
values for certain selected level attributes. Each record stores the previous value and
the current value of the selected attribute. When the value of any of the selected
attributes changes, the current value is stored as the old value and the new value
becomes the current value.

Differentiate between typeI,II and III :-Type 1 – This model involves overwriting
the old current value with the new current value. No history is maintained. Type 2 – The

current and the historical records are kept and maintained in the same file or table. Type
3 – The current data and historical data are kept in the same record.

Example with diagram:.


Basics of computer storage:-
In computer storage, data organization significantly impacts performance, scalability,
and efficiency. There are two primary methods for organizing data in databases:
row-oriented and column-oriented storage.

1:-Row data data store


Definition:
Stores data as complete records (rows) in a single unit.
Structure::
Each row contains all attributes of a single record.
Advantages:Fast for operations involving entire records, making it ideal for
transaction-oriented applications.
Easier to manage and update single records.
Use Cases in Data Warehousing:
Transactional Systems: Useful for ETL processes where data is frequently updated
(e.g., sales transactions).
Operational Reporting: Queries that require complete row retrieval.
Example:
Traditional RDBMS like MySQL, PostgreSQL, and Oracle.
2:-Column data store:
Definition:
Stores data as individual columns, grouping all values of a specific attribute together.
Structure:
Each column is stored separately, allowing efficient access to specific attributes.
Advantages:
Optimized for read-heavy operations, especially for aggregations and analytical queries.
Better compression and performance for large datasets due to similar data types being
stored together.
Use Cases in Data Warehousing:
Analytics and Reporting: Suitable for OLAP scenarios where complex queries often
aggregate large volumes of data (e.g., sales analysis, trend analysis).
Data Mining: Queries that require access to specific attributes across many records.
Example::
Columnar databases like Amazon Redshift, Google BigQuery, and Apache Cassandra.
Categorizing Scenarios
When considering the data warehouse data model, you can categorize scenarios
based on the nature of the workload:

Transactional Workloads (OLTP)

Best Fit: Row Stores


Examples: Banking transactions, online retail, or any scenario requiring frequent
updates and complete record access.
Analytical Workloads (OLAP)

Best Fit: Column Stores


Examples: Business intelligence reporting, data analysis, and data mining tasks that
aggregate large datasets and require fast read access.
Summary
In a data warehouse context, the choice between row and column data stores
depends on the specific use case:
Row Stores are optimal for applications where complete records are frequently
accessed and updated.
Column Stores shine in environments where analytical queries dominate, allowing for
efficient data retrieval and processing across large datasets.
This distinction is crucial for designing efficient data warehouse architectures that meet
performance and scalability needs.

Why is column store faster?


A columnar database stores data grouped by columns rather than by rows, optimizing
performance for analytical queries. Each column contains data of the same type,
allowing for efficient compression. And because a query needs to access only relevant
columns, the design enhances data retrieva.

You might also like