Data Warehouse
Data Warehouse
Data Warehouse
Data Warehouse is a central managed and integrated database containing data from
the operational sources in an organization (such as SAP, CRM, ERP system). It may
gather manual inputs from users determining criteria and parameters for grouping or
classifying records.
That database contains structured data for query analysis and can be accessed by
users. The data warehouse can be created or updated at any time, with minimum
disruption to operational systems. It is ensured by a strategy implemented in a ETL
process.
A source for the data warehouse is a data extract from operational databases. The
data is validated, cleansed, transformed and finally aggregated and it becomes ready
to be loaded into the data warehouse.
Data warehouse is a dedicated database which contains detailed, stable, non-volatile
and consistent data which can be analyzed in the time variant.
Sometimes, where only a portion of detailed data is required, it may be worth
considering using a data mart. A data mart is generated from the data warehouse
and contains data focused on a given subject and data that is frequently accessed or
summarized.
Keeping the data warehouse filled with very detailed and not efficiently selected data
may lead to growing the database to a huge size, which may be difficult to manage and
unusable. To significantly reduce number of rows in the data warehouse, the data is
aggregated which leads to the easier data maintenance and efficiency in browsing and
data analysis.
A well designed and maintained Data Warehouse can significantly improve the quality
and accessibility of the company data and increase the amount of information delivered
to the end users.
Key Data Warehouse systems and the most widely used database engines for storing and
serving data for the enterprise business intelligence and performance management:
Teradata
Oracle
Microsoft SQL Server
IBM DB2
SAS
DataWarehouse Architecture
In Data Warehouse environments, the relational model can be transformed into the
following architectures:
Star schema
Snowflake schema
Constellation schema
Star schema architecture
Star schema architecture is the simplest data warehouse design. The main feature of
a star schema is a table at the center, called the fact table and the dimension
tables which allow browsing of specific categories, summarizing, drill-downs and
specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form,
while dimensional tables are de-normalized (second normal form).
Despite the fact that the star schema is the simpliest datawarehouse architecture, it
is most commonly used in the datawarehouse implementations across the world
today (about 90-95% cases).
Fact table
The fact table is not a typical relational database table as it is de-normalized on
purpose - to enhance query response times. The fact table typically contains records
that are ready to explore, usually with ad hoc queries. Records in the fact table are
often referred to as events, due to the time-variant nature of a data warehouse
environment.
The primary key for the fact table is a composite of all the columns except numeric
values / scores (like QUANTITY, TURNOVER, exact invoice date and time).
Typical fact tables in a global enterprise data warehouse are (usually there may be
additional company or business specific fact tables):
Dimension table
Nearly all of the information in a typical fact table is also present in one or more
dimension tables. The main purpose of maintaining Dimension Tables is to allow
browsing the categories quickly and easily.
The primary keys of each of the dimension tables are linked together to form the
composite primary key of the fact table. In a star schema design, there is only
one de-normalized table for a given dimension.
Snowflake schemas are generally used when a dimensional table becomes very big
and when a star schema can’t represent the complexity of a data structure. For
example if a PRODUCT dimension table contains millions of rows, the use of
snowflake schemas should significantly improve performance by moving out some
data to other table (with BRANDS for instance).
The problem is that the more normalized the dimension table is, the more
complicated SQL joins must be issued to query them. This is because in order for a
query to be answered, many tables need to be joined and aggregates generated.
The main disadvantage of the fact constellation schema is a more complicated design
because many variants of aggregation must be considered.
In a fact constellation schema, different fact tables are explicitly assigned to the
dimensions, which are for given facts relevant. This may be useful in cases when
some facts are associated with a given dimension level and other facts with a deeper
dimension level.
Use of that model should be reasonable when for example, there is a sales fact table
(with details down to the exact date and invoice header id) and a fact table with
sales forecast which is calculated based on month, client id and product id.
In that case using two different fact tables on a different level of grouping is realized
through a fact constellation model.
Data mart
Data marts are designated to fulfill the role of strategic decision support for
managers responsible for a specific business area.
Data warehouse operates on an enterprise level and contains all data used for
reporting and analysis, while data mart is used by a specific business department
and are focused on a specific subject (business area).
A scheduled ETL process populates data marts within the subject specific data
warehouse information.
The typical approach for maintaining a data warehouse environment with data marts
is to have one Enterprise Data Warehouse which comprises divisional and regional
data warehouse instances together with a set of dependent data marts which derive
the information directly from the data warehouse.
It is crucial to keep data marts consistent with the enterprise-wide data warehouse
system as this will ensure that they are properly defined, constituted and managed.
Otherwise the DW environment mission of being "the single version of the truth"
becomes a myth. However, in data warehouse systems there are cases where
developing an independent data mart is the only way to get the required figures out
of the DW environment. Developing independent data marts, which are not 100%
reconciled with the data warehouse environment and in most cases includes a
supplementary source of data, must be clearly understood and all the associated
risks must be identified.
Data marts are usually maintained in the same environment as the data warehouse
(like Oracle, Teradata, MS SQL Server, SAS) and are smaller in size. Because data
marts are organized as one dimensional models, by far the most popular way to
deliver data marts is to make them available as a relational tables, create text files
with the data or an OLAP cubes.
In the next step, the data from data marts is usually represented by a reporting or
analysis tool, such as Hyperion, Cognos PowerPlay, Business Objects, Pentaho BI,
Microsoft Excel or any other.
Usually, a company maintains multiple data marts serving the needs of finance,
marketing, sales, operations, IT and other departments upon needs.
Example use of data marts in an organization: CRM reporting, customer migration
analysis, production planning, monitoring of marketing campaigns, performance
indicators, internal ratings and scoring, risk management, integration with other
systems (systems which use the processed DW data) and many more uses specific to
the business.
Reporting
The problem is that the report generation process is not particularly interesting from
the IT point of view as it does not involve heavy data processing and manipulation
tasks. The IT professionals do not tend to pay great attention to this BI area as they
consider it rather 'look and feel' than the 'real heavy stuff'.
From the other hand, the lack of technical exposure of the business users usually
makes the report design process too complicated for them.
The conclusion is that the key to success in reporting (and the whole BI
environment) is the collaboration between the business and IT professionals.
Types of reports
Reporting is a broad BI category and there is plenty of options and modes of its
generation, definition, design, formatting and propagation.
Reporting platforms
The most widely used reporting platforms:
Cognos
Oracle Hyperion
Business Objects + Crystal Reports (SAP)
Microstrategy
Microsoft BI
SAS
BIRT - open source Business Intelligence and Reporting Tools Project
Pentaho OSBI Reporting