Department of Computer Science and Engineering: Rajalakshmi Institute of Technology
Department of Computer Science and Engineering: Rajalakshmi Institute of Technology
Department of Computer Science and Engineering: Rajalakshmi Institute of Technology
(R2021)
RIT CCS341 - DATA WAREHOUSING 1
UNIT-III
METADATA, DATAMART AND PARTITION STRATEGY
Meta Data – Categories of Metadata – Role of Metadata – Metadata Repository – Challenges for Meta
Management - Data Mart – Need of Data Mart- Cost Effective Data Mart- Designing Data Marts- Cost
of Data Marts- Partitioning Strategy – Vertical partition – Normalization – Row Splitting – Horizontal
Partition
Role of Metadata
Each software tool has its own propriety metadata. If you are using several tools in your
data warehouse, how can you reconcile the formats?
No industry-wide accepted standards exist for metadata formats.
There are conflicting claims on the advantages of a centralized metadata repository as
opposedto a collection of fragmented metadata stores.
There are no easy and accepted methods of passing metadata along the processes as data
moves from the source systems to the staging area and thereafter to the data warehouse
storage.
Preserving version control of metadata uniformly throughout the data warehouse is
tedious and difficult.
In a large data warehouse with numerous source systems, unifying the metadata relating
to the data sources can be an enormous task. You have to deal with conflicting
standards, formats, and data naming conventions, data definitions, attributes, values,
business rules, and units of measure. You have to resolve indiscriminate use of aliases
and compensate for inadequate datavalidation rules.
Metadata Repository
Think of a metadata repository as a general-purpose information directory or cataloguing device to
classify, store, and manage metadata. As we have seen earlier, business metadata and technical metadata
serve different purposes. The end-users need the business metadata; data warehouse developers and
administrators require the technical metadata.
The structures of these two categories of metadata also vary. Therefore, the metadata repository can be
thought of as two distinct information directories, one to store business metadata and the other to store
technical metadata. This division may also be logical within a single physical repository.
The following Figure shows the typical contents in a metadata repository. Notice the division between
business and technical metadata. Did you also notice another component called the information navigator?
This component is implemented in different ways in commercial offerings. The functions of the
information navigator include the following:
Interface from query tools. This function attaches data warehouse data to third-party query tools so that
metadata definitions inside the technical metadata may be viewed from these tools.
Drill-down for details. The user of metadata can drill down and proceed from one level of metadata to a
lower level for more information. For example, you can first get the definition of a data table, then go to
the next level for seeing all attributes, and go further to get the details of individual attributes.
Review predefined queries and reports. The user is able to review predefined queries and reports, and
launch the selected ones with proper parameters.
A centralized metadata repository accessible from all parts of the data warehouse for your end- users,
developers, and administrators appears to be an ideal solution for metadata management. But for a
centralized metadata repository to be the best solution, the repository must meet some basic requirements.
Let us quickly review these requirements. It is not easy to find a repository tool that satisfies every one of
the requirements listed below.
Flexible organization. Allow the data administrator to classify and organize metadata into logical
categories and subcategories, and assign specific components of metadata to the classifications.
Selection of a suitable metadata repository product is one of the key decisions the project team must
make. Use the above list of criteria as a guide while evaluating repository tools for your data warehouse
A data mart is a small portion of the data warehouse that is mainly related to a particular business
domain as marketing (or) sales etc.
The data stored in the DW system is huge hence data marts are designed with a subset of data that belongs
to individual departments. Thus a specific group of users can easily utilize this data for their analysis.
Unlike a data warehouse that has many combinations of users, each data mart will have a particular set of
end-users. The lesser number of end-users results in better response time.
Data marts are also accessible to business intelligence (BI) tools. Data marts do not contain duplicated
(or) unused data. They do get updated at regular intervals. They are subject-oriented and flexible databases.
Each team has the right to develop and maintain its data marts without modifying data warehouse (or)
other data mart’s data.
A data mart is more suitable for small businesses as it costs very less than a data warehouse system. The
time required to build a data mart is also lesser than the time required for buildinga data warehouse.
Identify The Functional Splits: Divide the organization data into each data mart (departmental)
specific data to meet its requirement, without any further organizationaldependency.
Identify User Access Tool Requirements: There may be different user access tools in the
market that need different data structures. Data marts are used to support all these internal
structures without disturbing the DW data. One data mart can be associated with one tool as per
the user needs. Data marts can also provide updated data to such tools daily.
Identify Access Control Issues: If different data segments in a DW system need privacy and
should be accessed by a set of authorized users then all such data can be moved into data marts.
Hardware and Software Cost: Any newly added data mart may need extra hardware, software,
processing power, network, and disk storage space to work on queries requested by the end-
users. This makes data marting an expensive strategy. Hence the budget should be planned
precisely.
Network Access: If the location of the data mart is different from that of the data warehouse,
then all the data should be transferred with the data mart loading process. Thus a network should
be provided to transfer huge volumes of data which may be expensive.
Time Window Constraints: The time taken for the data mart loading process will depend on
various factors such as complexity & volumes of data, network capacity, and data transfer
mechanisms, etc.
Data marts are classified into three types i.e. Dependent, Independent and Hybrid. This classification
is based on how they have been populated i.e. either from a data warehouse (or) from any other data
sources.
Extraction, Transformation, and Transportation (ETT) is the process that is used to populate data mart’s
data from any source systems.
A data mart can use DW data either logically or physically as shown below:
Logical View: In this scenario, data mart’s data is not physically separated from the DW. It
refers to DW data through virtual views (or) tables logically.
Physical subset: In this scenario, data mart’s data is physically separated from the DW. Once one
or more data marts are developed, you can allow the users to access only the data marts (or) to access
both Data marts and Data warehouses.
ETT is a simplified process in the case of dependent data marts because the usable data is already
existing in the centralized DW. The accurate set of summarized data should be just moved to the
respective data marts.
Independent data marts are stand-alone systems where data is extracted, transformed and loaded from
external (or) internal data sources. These are easy to design and maintain until it is supporting simple
department wise business needs.
You have to work with each phase of the ETT process in case of independent data marts in a similar way
as to how the data has been processed into centralized DW. However, the number of sources and the data
populated to the data marts may be less.
Designing: Since the time business users request a data mart, the designing phase involves
requirements gathering, creating appropriate data from respective data sources, creating the
logical and physical data structures and ER diagrams.
Constructing: The team will design all tables, views, indexes, etc., in the data mart system.
Populating: Data will be extracted, transformed and loaded into data mart along with metadata.
Accessing: Data Mart data is available to be accessed by the end-users. They can query the data
for their analysis and reports.
Managing: This involves various managerial tasks such as user access controls, data mart
performance fine-tuning, maintaining existing data marts and creating data mart recovery
scenarios in case the system fails.
Star joins are multi-dimensional structures that are formed with fact and dimension tables to support large
amounts of data. Star join will have a fact table in the center surrounded by the dimension tables.
Respective fact table data is associated with dimension tables’ data with a foreign key reference. A fact table
can be surrounded by 20-30 dimension tables.
Similar to the DW system, in star joins as well, the fact tables contain only numerical data and the
respective textual data can be described in dimension tables. This structure resembles a starschema in DW.
But the granular data from the centralized DW is the base for any data mart’s data. Many calculations
will be performed on the normalized DW data to transform it into multidimensional data marts data
which is stored in the form of cubes.
This works similarly as to how the data from legacy source systems is transformed into a normalized
DW data.
You need to consider the below scenarios that recommend for the pilot deployment:
If the end-users are new to the Data warehouse system.
If the end-users want to feel comfortable to retrieve data/reports by themselves before going to
production.
If the end-users want hands-on with the latest tools (or) technologies.
If the management wants to see the benefits as a proof of concept before making it as abig release.
If the team wants to if ensure all ETL components (or) infrastructure components work well before
the release.
Unwanted data marts that have been created are tough to maintain.
Data marts are meant for small business needs. Increasing the size of data marts will decrease its
performance.
If you are creating more number of data marts then the management should properly take care of
their versioning, security, and performance.
Data marts may contain historical (or) summarized (or) detailed data. However, updates to DW data
and data mart data may not happen at the same time due to data inconsistency issues.
Partitioning Strategy
Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also
helps in balancing the various requirements of the system. It optimizes the hardware performance and
simplifies the management of data warehouse by partitioning each fact table into multiple separate
partitions. In this chapter, we will discuss different partitioningstrategies.
Why is it Necessary to Partition?
Partitioning is important for the following reasons −
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance
is enhanced because now the query scans only those partitions that are relevant. It does not have to scan
the whole data.
Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to
keep in mind the requirements for manageability of the data warehouse.
Points to Note
The detailed information remains available online.
The number of physical tables is kept relatively small, which reduces the operating
cost.
This technique is suitable where a mix of data dipping recent history and data mining
through entire history is required.
This technique is not useful where the partitioning profile changes on a regular basis,
because repartitioning will increase the operation cost of data warehouse.
Note − We recommend to perform the partition only on the basis of time dimension, unless you are
certain that the suggested dimension grouping will not change within the life of the data warehouse.
Partition by Size of Table
When there are no clear basis for partitioning the fact table on any dimension, then we should partition
the fact table on the basis of their size. We can set the predetermined size as a critical point. When the
table exceeds the predetermined size, a new table partition is created.
Points to Note
This partitioning is complex to manage.
It requires metadata to identify what data is stored in each partition.
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions. Here we
have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to apply
comparisons, that dimension may be very large. This would definitely affect the responsetime.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata
to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the datawarehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.
Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following tables
that show how normalization is performed.
Table before Normalization
16 Sunny Bangalore W
64 San Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed
up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a major join
operation between two partitions.
Identify Key to Partition
It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to reorganizing
the fact table. Let's have an example. Suppose we want to partition the followingtable.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
Region
Transaction date
Suppose the business is organized in 30 geographical regions and each region has different number of
branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because
our requirements capture has shown that a vast majority of queries are restricted to the user's own
business region.
If we partition by transaction date instead of region, then the latest transaction from every region will be
in one partition. Now the user who wants to look at data within his own region has to query across
multiple partitions. Hence it is worth determining the right partitioning key.