Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Department of Computer Science and Engineering: Rajalakshmi Institute of Technology

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

RAJALAKSHMI INSTITUTE OF TECHNOLOGY

(An Autonomous Institution)


Kuthambakkam Post, Chennai – 600124

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

2023-24 Even Semester

Sub Code: CCS341


Subject Name: DATA WAREHOUSING
Semester/Year: VI/III

UNIT 3: Metadata, DataMart and Partition Strategy

(R2021)
RIT CCS341 - DATA WAREHOUSING 1

UNIT-III
METADATA, DATAMART AND PARTITION STRATEGY

Meta Data – Categories of Metadata – Role of Metadata – Metadata Repository – Challenges for Meta
Management - Data Mart – Need of Data Mart- Cost Effective Data Mart- Designing Data Marts- Cost
of Data Marts- Partitioning Strategy – Vertical partition – Normalization – Row Splitting – Horizontal
Partition

Meta data Definitions


Here is a sample list of definitions:
Data about the data
Table of contents for the data
Catalog for the data
Data warehouse atlas
Data warehouse roadmap
Data warehouse directory
Glue that holds the data warehouse contents together
Tongs to handle the data
The nerve center

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 2

Role of Metadata

Challenges for Metadata Management


Although metadata is so vital in a data warehouse environment, seamlessly integrating all the parts of
metadata is a formidable task. Industry-wide standardization is far from being a reality. Metadata created by
a process at one end cannot be viewed through a tool used at another end without going through
convoluted transformations. These challenges force many data warehouse developers to abandon the
requirements for proper metadata management.

Here are the major challenges to be addressed while providing metadata:

 Each software tool has its own propriety metadata. If you are using several tools in your
data warehouse, how can you reconcile the formats?
 No industry-wide accepted standards exist for metadata formats.
 There are conflicting claims on the advantages of a centralized metadata repository as
opposedto a collection of fragmented metadata stores.
 There are no easy and accepted methods of passing metadata along the processes as data
moves from the source systems to the staging area and thereafter to the data warehouse
storage.
 Preserving version control of metadata uniformly throughout the data warehouse is
tedious and difficult.
 In a large data warehouse with numerous source systems, unifying the metadata relating
to the data sources can be an enormous task. You have to deal with conflicting
standards, formats, and data naming conventions, data definitions, attributes, values,
business rules, and units of measure. You have to resolve indiscriminate use of aliases
and compensate for inadequate datavalidation rules.

Metadata Repository
Think of a metadata repository as a general-purpose information directory or cataloguing device to
classify, store, and manage metadata. As we have seen earlier, business metadata and technical metadata
serve different purposes. The end-users need the business metadata; data warehouse developers and
administrators require the technical metadata.
The structures of these two categories of metadata also vary. Therefore, the metadata repository can be
thought of as two distinct information directories, one to store business metadata and the other to store
technical metadata. This division may also be logical within a single physical repository.
The following Figure shows the typical contents in a metadata repository. Notice the division between
business and technical metadata. Did you also notice another component called the information navigator?
This component is implemented in different ways in commercial offerings. The functions of the
information navigator include the following:

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 3

Interface from query tools. This function attaches data warehouse data to third-party query tools so that
metadata definitions inside the technical metadata may be viewed from these tools.
Drill-down for details. The user of metadata can drill down and proceed from one level of metadata to a
lower level for more information. For example, you can first get the definition of a data table, then go to
the next level for seeing all attributes, and go further to get the details of individual attributes.

Review predefined queries and reports. The user is able to review predefined queries and reports, and
launch the selected ones with proper parameters.

A centralized metadata repository accessible from all parts of the data warehouse for your end- users,
developers, and administrators appears to be an ideal solution for metadata management. But for a
centralized metadata repository to be the best solution, the repository must meet some basic requirements.
Let us quickly review these requirements. It is not easy to find a repository tool that satisfies every one of
the requirements listed below.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 4

Flexible organization. Allow the data administrator to classify and organize metadata into logical
categories and subcategories, and assign specific components of metadata to the classifications.

 Historical. Use versioning to maintain the historical perspective of the metadata.


 Integrated. Store business and technical metadata in formats meaningful to all types of users.
 Good compartmentalization. Able to separate and store logical and physical database models.
 Analysis and look-up capabilities. Capable of browsing all parts of metadata and also navigating
through the relationships.
 Customizable. Able to create customized views of metadata for individual groups of users and to
include new metadata objects as necessary.
 Maintain descriptions and definitions. View metadata in both business and technicalterms.
 Standardization of naming conventions. Flexibility to adopt any type of naming convention and
standardize throughout the metadata repository.
 Synchronization. Keep metadata synchronized within all parts of the data warehouse environment
and with the related external systems.
 Open. Support metadata exchange between processes via industry-standard interfaces and be
compatible with a large variety of tools.

Selection of a suitable metadata repository product is one of the key decisions the project team must
make. Use the above list of criteria as a guide while evaluating repository tools for your data warehouse

What Is A Data Mart?

A data mart is a small portion of the data warehouse that is mainly related to a particular business
domain as marketing (or) sales etc.

The data stored in the DW system is huge hence data marts are designed with a subset of data that belongs
to individual departments. Thus a specific group of users can easily utilize this data for their analysis.

Unlike a data warehouse that has many combinations of users, each data mart will have a particular set of
end-users. The lesser number of end-users results in better response time.

Data marts are also accessible to business intelligence (BI) tools. Data marts do not contain duplicated
(or) unused data. They do get updated at regular intervals. They are subject-oriented and flexible databases.
Each team has the right to develop and maintain its data marts without modifying data warehouse (or)
other data mart’s data.

A data mart is more suitable for small businesses as it costs very less than a data warehouse system. The
time required to build a data mart is also lesser than the time required for buildinga data warehouse.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 5

Pictorial representation of Multiple Data Marts:

When Do We Need Data Mart?


Based on the necessity, plan and design a data mart for your department by engaging the
stakeholders because the operational cost of data mart may be high some times.

Consider the below reasons to build a data mart:


 If you want to partition the data with a set of user access control strategy.
 If a particular department wants to see the query results much faster instead of scanning huge DW
data.
 If a department wants data to be built on other hardware (or) software platforms.
 If a department wants data to be designed in a manner that is suitable for its tools.

Cost-Effective Data Mart:


A cost-effective data mart can be built by the following steps:

 Identify The Functional Splits: Divide the organization data into each data mart (departmental)
specific data to meet its requirement, without any further organizationaldependency.
 Identify User Access Tool Requirements: There may be different user access tools in the
market that need different data structures. Data marts are used to support all these internal
structures without disturbing the DW data. One data mart can be associated with one tool as per
the user needs. Data marts can also provide updated data to such tools daily.
 Identify Access Control Issues: If different data segments in a DW system need privacy and
should be accessed by a set of authorized users then all such data can be moved into data marts.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 6

Cost of Data Mart:

The cost of data mart can be estimated as follows:

 Hardware and Software Cost: Any newly added data mart may need extra hardware, software,
processing power, network, and disk storage space to work on queries requested by the end-
users. This makes data marting an expensive strategy. Hence the budget should be planned
precisely.
 Network Access: If the location of the data mart is different from that of the data warehouse,
then all the data should be transferred with the data mart loading process. Thus a network should
be provided to transfer huge volumes of data which may be expensive.
 Time Window Constraints: The time taken for the data mart loading process will depend on
various factors such as complexity & volumes of data, network capacity, and data transfer
mechanisms, etc.

Comparison Of Data Warehouse Vs Data Mart

S.No Data Warehouse Data Mart


1 Complex and costs more to implement. Simple and cheaper to implement.
2 Works at the organization level for the The scope is limited to a
entirebusiness. particulardepartment.
3 Querying the DW is difficult for Querying the data mart is easy for
business users because of huge data businessusers because of limited data.
dependencies.
4 Implementation time is more may be Implementation time is less may be in
inmonths or years. days,weeks or months.
5 Gathers data from various external Gathers data from a few centralized DW
sourcesystems. (or)internal (or) external source systems.
6 Strategic decisions can be made. Business decisions can be made.

Types Of Data Marts

Data marts are classified into three types i.e. Dependent, Independent and Hybrid. This classification
is based on how they have been populated i.e. either from a data warehouse (or) from any other data
sources.

Extraction, Transformation, and Transportation (ETT) is the process that is used to populate data mart’s
data from any source systems.

#1) Dependent Data Mart


In a dependent data mart, data is sourced from the existing data warehouse itself. This is a top- down
approach because the portion of restructured data into the data mart is extracted from the centralized data
warehouse.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 7

A data mart can use DW data either logically or physically as shown below:
 Logical View: In this scenario, data mart’s data is not physically separated from the DW. It
refers to DW data through virtual views (or) tables logically.

 Physical subset: In this scenario, data mart’s data is physically separated from the DW. Once one
or more data marts are developed, you can allow the users to access only the data marts (or) to access
both Data marts and Data warehouses.

ETT is a simplified process in the case of dependent data marts because the usable data is already
existing in the centralized DW. The accurate set of summarized data should be just moved to the
respective data marts.

An Image of Dependent Data Mart is shown below:

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 8

#2) Independent Data Mart


An independent data mart is best suitable for small departments in an organization. Here data is not
sourced from the existing data warehouse. The Independent data mart is neither dependent on enterprise
DW nor other data marts.

Independent data marts are stand-alone systems where data is extracted, transformed and loaded from
external (or) internal data sources. These are easy to design and maintain until it is supporting simple
department wise business needs.

You have to work with each phase of the ETT process in case of independent data marts in a similar way
as to how the data has been processed into centralized DW. However, the number of sources and the data
populated to the data marts may be less.

Pictorial representation of an Independent Data Mart:

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 9

#3) Hybrid Data Mart


In a hybrid data mart, data is integrated from both the DW and other operational systems. Hybrid data
marts are flexible with large storage structures. It can also refer to other data martsdata.

Pictorial representation of a Hybrid Data Mart:

Implementation Steps Of A Data Mart


The implementation of Data Mart which is considered to be a bit complex is explained in the below steps:

 Designing: Since the time business users request a data mart, the designing phase involves
requirements gathering, creating appropriate data from respective data sources, creating the
logical and physical data structures and ER diagrams.
 Constructing: The team will design all tables, views, indexes, etc., in the data mart system.
 Populating: Data will be extracted, transformed and loaded into data mart along with metadata.
 Accessing: Data Mart data is available to be accessed by the end-users. They can query the data
for their analysis and reports.
 Managing: This involves various managerial tasks such as user access controls, data mart
performance fine-tuning, maintaining existing data marts and creating data mart recovery
scenarios in case the system fails.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 10

Structure Of A Data Mart


The structure of each data mart is created as per the requirement. Data Mart structures are called Star joins.
This structure will differ from one data mart to another.

Star joins are multi-dimensional structures that are formed with fact and dimension tables to support large
amounts of data. Star join will have a fact table in the center surrounded by the dimension tables.

Respective fact table data is associated with dimension tables’ data with a foreign key reference. A fact table
can be surrounded by 20-30 dimension tables.

Similar to the DW system, in star joins as well, the fact tables contain only numerical data and the
respective textual data can be described in dimension tables. This structure resembles a starschema in DW.

Pictorial representation of a Star Join Structure.

But the granular data from the centralized DW is the base for any data mart’s data. Many calculations
will be performed on the normalized DW data to transform it into multidimensional data marts data
which is stored in the form of cubes.

This works similarly as to how the data from legacy source systems is transformed into a normalized
DW data.

When Is A Pilot Data Mart Useful?


A pilot can be deployed in a small environment with a restricted number of users to ensure if the
deployment is successful before the full-fledged deployment. However, this is not essential all the time.
The pilot deployments will be of no use once the purpose is met.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 11

You need to consider the below scenarios that recommend for the pilot deployment:
 If the end-users are new to the Data warehouse system.
 If the end-users want to feel comfortable to retrieve data/reports by themselves before going to
production.
 If the end-users want hands-on with the latest tools (or) technologies.
 If the management wants to see the benefits as a proof of concept before making it as abig release.
 If the team wants to if ensure all ETL components (or) infrastructure components work well before
the release.

Drawbacks Of Data Mart


Though data marts have some benefits over DW they also have some drawbacks as explainedbelow:

 Unwanted data marts that have been created are tough to maintain.
 Data marts are meant for small business needs. Increasing the size of data marts will decrease its
performance.
 If you are creating more number of data marts then the management should properly take care of
their versioning, security, and performance.
 Data marts may contain historical (or) summarized (or) detailed data. However, updates to DW data
and data mart data may not happen at the same time due to data inconsistency issues.
Partitioning Strategy
Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also
helps in balancing the various requirements of the system. It optimizes the hardware performance and
simplifies the management of data warehouse by partitioning each fact table into multiple separate
partitions. In this chapter, we will discuss different partitioningstrategies.
Why is it Necessary to Partition?
Partitioning is important for the following reasons −

 For easy management,


 To assist backup/recovery,
 To enhance performance.

For Easy Management


The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact table
is very hard to manage as a single entity. Therefore it needs partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the data.
Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time to load
and also enhances the performance of the system.
Note − To cut down on the backup size, all partitions other than the current partition can be marked as
read-only. We can then put these partitions into a state where they cannot be modified. Then they can be
backed up. It means only the current partition is to be backed up.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 12

To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance
is enhanced because now the query scans only those partitions that are relevant. It does not have to scan
the whole data.

Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to
keep in mind the requirements for manageability of the data warehouse.

Partitioning by Time into Equal Segments


In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time period
represents a significant retention period within the business. For example, if the user queries for month
to date data then it is appropriate to partition the data into monthly segments. We can reuse the
partitioned tables by removing the data in them.

Partition by Time into Different-sized Segments


This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of
small partitions for relatively current data, larger partition for inactive data.

Points to Note
 The detailed information remains available online.
 The number of physical tables is kept relatively small, which reduces the operating
cost.
 This technique is suitable where a mix of data dipping recent history and data mining
through entire history is required.
 This technique is not useful where the partitioning profile changes on a regular basis,
because repartitioning will increase the operation cost of data warehouse.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 13

Partition on a Different Dimension


The fact table can also be partitioned on the basis of dimensions other than time such as product group,
region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like on a state by state
basis. If each region wants to query on information captured within its region, it would prove to be more
effective to partition the fact table into regional partitions. This will cause the queries to speed up because
it does not require to scan information that is not relevant.
Points to Note
 The query does not have to scan irrelevant data which speeds up the query process.
 This technique is not appropriate where the dimensions are unlikely to change in future.
So, it is worth determining that the dimension does not change in future.
 If the dimension changes, then the entire fact table would have to be repartitioned.

Note − We recommend to perform the partition only on the basis of time dimension, unless you are
certain that the suggested dimension grouping will not change within the life of the data warehouse.
Partition by Size of Table
When there are no clear basis for partitioning the fact table on any dimension, then we should partition
the fact table on the basis of their size. We can set the predetermined size as a critical point. When the
table exceeds the predetermined size, a new table partition is created.
Points to Note
 This partitioning is complex to manage.
 It requires metadata to identify what data is stored in each partition.

Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions. Here we
have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to apply
comparisons, that dimension may be very large. This would definitely affect the responsetime.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata
to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the datawarehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 14

Vertical partitioning can be performed in the following two ways:


 Normalization
 Row Splitting

Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following tables
that show how normalization is performed.
Table before Normalization

Product_id Qty Value sales_date Store_id Store_name Location Region

30 5 3.67 3-Aug-13 16 sunny Bangalore S

35 4 5.33 3-Sep-13 16 sunny Bangalore S

40 5 2.50 3-Sep-13 64 san Mumbai W

45 7 5.66 3-Sep-13 16 sunny Bangalore S

Table after Normalization

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY


RIT CCS341 - DATA WAREHOUSING 15

Store_id Store_name Location Region

16 Sunny Bangalore W

64 San Mumbai S

Product_id Quantity Value sales_date Store_id

30 5 3.67 3-Aug-13 16

35 4 5.33 3-Sep-13 16

40 5 2.50 3-Sep-13 64

45 7 5.66 3-Sep-13 16

Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed
up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a major join
operation between two partitions.
Identify Key to Partition
It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to reorganizing
the fact table. Let's have an example. Suppose we want to partition the followingtable.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be

 Region
 Transaction date
Suppose the business is organized in 30 geographical regions and each region has different number of
branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because
our requirements capture has shown that a vast majority of queries are restricted to the user's own
business region.
If we partition by transaction date instead of region, then the latest transaction from every region will be
in one partition. Now the user who wants to look at data within his own region has to query across
multiple partitions. Hence it is worth determining the right partitioning key.

UNIT 3: METADATA, DATAMART, AND PARTITION STRATEGY

You might also like