Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Warehouse

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 8
At a glance
Powered by AI
The key takeaways are that a data warehouse contains historical data from operational systems for analysis, and is designed differently from OLTP databases with denormalized tables. Common data warehouse systems include Teradata, Oracle, SQL Server and IBM DB2.

The main components of a data warehouse are the data extract from source systems, data validation/cleansing, transformation through ETL processes, and loading into dimensional models like star schemas for analysis.

The database in a data warehouse is de-normalized into dimension and fact tables for analysis, while an OLTP database is optimized for speed of transactions. A data warehouse is refreshed periodically while an OLTP database handles real-time transactions.

Data warehouse

Data Warehouse is a central managed and integrated database containing data from
the operational sources in an organization (such as SAP, CRM, ERP system). It may
gather manual inputs from users determining criteria and parameters for grouping or
classifying records.
That database contains structured data for query analysis and can be accessed by
users. The data warehouse can be created or updated at any time, with minimum
disruption to operational systems. It is ensured by a strategy implemented in a ETL
process.

A source for the data warehouse is a data extract from operational databases. The
data is validated, cleansed, transformed and finally aggregated and it becomes ready
to be loaded into the data warehouse.
Data warehouse is a dedicated database which contains detailed, stable, non-volatile
and consistent data which can be analyzed in the time variant.
Sometimes, where only a portion of detailed data is required, it may be worth
considering using a data mart. A data mart is generated from the data warehouse
and contains data focused on a given subject and data that is frequently accessed or
summarized.

Business Intelligence - Data Warehouse - ETL:

Keeping the data warehouse filled with very detailed and not efficiently selected data
may lead to growing the database to a huge size, which may be difficult to manage and
unusable. To significantly reduce number of rows in the data warehouse, the data is
aggregated which leads to the easier data maintenance and efficiency in browsing and
data analysis.
A well designed and maintained Data Warehouse can significantly improve the quality
and accessibility of the company data and increase the amount of information delivered
to the end users.

Key Data Warehouse systems and the most widely used database engines for storing and
serving data for the enterprise business intelligence and performance management:
Teradata
 Oracle
 Microsoft SQL Server
 IBM DB2
 SAS

DataWarehouse Architecture

The main difference between the database architecture in a standard, on-line


transaction processing oriented system (usually ERP or CRM system) and a
DataWarehouse is that the system’s relational model is usually de-normalized into
dimension and fact tables which are typical to a data warehouse database design.
The differences in the database architectures are caused by different purposes of
their existence.

In a typical OLTP system the database performance is crucial, as end-user


interface responsiveness is one of the most important factors determining usefulness
of the application. That kind of a database needs to handle inserting thousands of
new records every hour. To achieve this usually the database is optimized for speed
of Inserts, Updates and Deletes and for holding as few records as possible. So from a
technical point of view most of the SQL queries issued will be INSERT, UPDATE and
DELETE.

Opposite to OLTP systems, a DataWarehouse is a system that should give


response to almost any question regarding company performance measure.
Usually the information delivered from a data warehouse is used by people who are
in charge of making decisions. So the information should be accessible quickly and
easily but it doesn't need to be the most recent possible and in the lowest detail
level.
Usually the data warehouses are refreshed on a daily basis (very often the ETL
processes run overnight) or once a month (data is available for the end users around
5th working day of a new month). Very often the two approaches are combined.

The main challenge of a DataWarehouse architecture is to enable business to access


historical, summarized data with a read-only access of the end-users. Again, from a
technical standpoint the most SQL queries would start with a SELECT statement.

In Data Warehouse environments, the relational model can be transformed into the
following architectures:

 Star schema
 Snowflake schema
 Constellation schema
Star schema architecture
Star schema architecture is the simplest data warehouse design. The main feature of
a star schema is a table at the center, called the fact table and the dimension
tables which allow browsing of specific categories, summarizing, drill-downs and
specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form,
while dimensional tables are de-normalized (second normal form).
Despite the fact that the star schema is the simpliest datawarehouse architecture, it
is most commonly used in the datawarehouse implementations across the world
today (about 90-95% cases).

Fact table
The fact table is not a typical relational database table as it is de-normalized on
purpose - to enhance query response times. The fact table typically contains records
that are ready to explore, usually with ad hoc queries. Records in the fact table are
often referred to as events, due to the time-variant nature of a data warehouse
environment.
The primary key for the fact table is a composite of all the columns except numeric
values / scores (like QUANTITY, TURNOVER, exact invoice date and time).

Typical fact tables in a global enterprise data warehouse are (usually there may be
additional company or business specific fact tables):

 sales fact table - contains all details regarding sales


 orders fact table - in some cases the table can be split into open orders and historical
orders. Sometimes the values for historical orders are stored in a sales fact table.
 budget fact table - usually grouped by month and loaded once at the end of a year.
 forecast fact table - usually grouped by month and loaded daily, weekly or monthly.
 inventory fact table - report stocks, usually refreshed daily

Dimension table
Nearly all of the information in a typical fact table is also present in one or more
dimension tables. The main purpose of maintaining Dimension Tables is to allow
browsing the categories quickly and easily.
The primary keys of each of the dimension tables are linked together to form the
composite primary key of the fact table. In a star schema design, there is only
one de-normalized table for a given dimension.

Typical dimension tables in a data warehouse are:

 time dimension table


 customers dimension table
 products dimension table
 key account managers (KAM) dimension table
 sales office dimension table
Star schema example
An example of a star schema architecture is depicted below.

Snowflake Schema architecture

Snowflake schema architecture is a more complex variation of a star schema design.


The main difference is that dimensional tables in a snowflake schema are
normalized, so they have a typical relational database design.

Snowflake schemas are generally used when a dimensional table becomes very big
and when a star schema can’t represent the complexity of a data structure. For
example if a PRODUCT dimension table contains millions of rows, the use of
snowflake schemas should significantly improve performance by moving out some
data to other table (with BRANDS for instance).
The problem is that the more normalized the dimension table is, the more
complicated SQL joins must be issued to query them. This is because in order for a
query to be answered, many tables need to be joined and aggregates generated.

Fact constellation schema architecture

For each star schema or snowflake schema it is possible to construct a fact


constellation schema. This schema is more complex than star or snowflake
architecture, which is because it contains multiple fact tables. This allows dimension
tables to be shared amongst many fact tables.
That solution is very flexible, however it may be hard to manage and support.

The main disadvantage of the fact constellation schema is a more complicated design
because many variants of aggregation must be considered.

In a fact constellation schema, different fact tables are explicitly assigned to the
dimensions, which are for given facts relevant. This may be useful in cases when
some facts are associated with a given dimension level and other facts with a deeper
dimension level.
Use of that model should be reasonable when for example, there is a sales fact table
(with details down to the exact date and invoice header id) and a fact table with
sales forecast which is calculated based on month, client id and product id.
In that case using two different fact tables on a different level of grouping is realized
through a fact constellation model.
Data mart

Data marts are designated to fulfill the role of strategic decision support for
managers responsible for a specific business area.

Data warehouse operates on an enterprise level and contains all data used for
reporting and analysis, while data mart is used by a specific business department
and are focused on a specific subject (business area).
A scheduled ETL process populates data marts within the subject specific data
warehouse information.

The typical approach for maintaining a data warehouse environment with data marts
is to have one Enterprise Data Warehouse which comprises divisional and regional
data warehouse instances together with a set of dependent data marts which derive
the information directly from the data warehouse.

It is crucial to keep data marts consistent with the enterprise-wide data warehouse
system as this will ensure that they are properly defined, constituted and managed.
Otherwise the DW environment mission of being "the single version of the truth"
becomes a myth. However, in data warehouse systems there are cases where
developing an independent data mart is the only way to get the required figures out
of the DW environment. Developing independent data marts, which are not 100%
reconciled with the data warehouse environment and in most cases includes a
supplementary source of data, must be clearly understood and all the associated
risks must be identified.

Data marts are usually maintained in the same environment as the data warehouse
(like Oracle, Teradata, MS SQL Server, SAS) and are smaller in size. Because data
marts are organized as one dimensional models, by far the most popular way to
deliver data marts is to make them available as a relational tables, create text files
with the data or an OLAP cubes.

In the next step, the data from data marts is usually represented by a reporting or
analysis tool, such as Hyperion, Cognos PowerPlay, Business Objects, Pentaho BI,
Microsoft Excel or any other.

Usually, a company maintains multiple data marts serving the needs of finance,
marketing, sales, operations, IT and other departments upon needs.
Example use of data marts in an organization: CRM reporting, customer migration
analysis, production planning, monitoring of marketing campaigns, performance
indicators, internal ratings and scoring, risk management, integration with other
systems (systems which use the processed DW data) and many more uses specific to
the business.

Reporting

A successful reporting platform in a business intelligence environment requires great


attention to be paid from both the business end users and IT professionals.
The fact is that the reporting layer is what business users might consider a data
warehouse system and if they do not like it, they will not use it. Even though it might
be a perfectly maintained data warehouse with high-quality data, stable and
optimized ETL processes and faultless operation. It will be just useless for them, thus
useless for the whole organization.

The problem is that the report generation process is not particularly interesting from
the IT point of view as it does not involve heavy data processing and manipulation
tasks. The IT professionals do not tend to pay great attention to this BI area as they
consider it rather 'look and feel' than the 'real heavy stuff'.
From the other hand, the lack of technical exposure of the business users usually
makes the report design process too complicated for them.
The conclusion is that the key to success in reporting (and the whole BI
environment) is the collaboration between the business and IT professionals.
Types of reports
Reporting is a broad BI category and there is plenty of options and modes of its
generation, definition, design, formatting and propagation.

Standard, static report


 Subject oriented, reported data defined precisely before creation
 Reports with fixed layout defined by a report designer when the report is created
 Very often the static reports contain subreports and perform calculations or implement advanced
functions
 Generated either on request by an end user or refreshed periodically from a scheduler
 Usually are made available on the web server or a shared drive
Sample applications: Cognos Report Studio, Crystal Reports, BIRT
Ad-hoc report
 Simple reports created by the end users on demand
 Designed from scratch or using a standard report as a template
Sample applications: Cognos Analysis Studio
Interactive, multidimensional OLAP report
 Usually provide more general information - using dynamic drill-down, slicing, dicing and filtering
users can get the information they need
 Reports with fixed design defined by a report designer
 Generated either on request by an end user or refreshed periodically from a scheduler
 Usually are made available on the web server or a shared drive
Sample applications: Cognos PowerPlay, Business Objects, Pentaho Mondrian
Dashboard
 Contain high-level, aggregated company strategic data with comparisons and performance indicators
 Include both static and interactive reports
 Lots of graphics, charts, gauges and illustrations
Sample applications: Pentaho Dashboards, Oracle Hyperion, Microsoft SharePoint Server, Cognos
Connection Portal
Write-back report
Those are interactive reports directly linked to the Data Warehouse which allow modification of the data
warehouse data.
By far the most often use of this kind of reports is:
 Editing and customizing products and customers grouping
 Entering budget figures, forecasts, rebates
 Setting sales targets
 Refining business relevant data
Sample applications: Cognos Planning, SAP, Microsoft Access and Excel
Technical report
This group of reports is usually generated to fulfill the needs of the following areas:
 IT technical reports for monitoring the BI system, generate execution performance statistics, data
volumes, system workload, user activity etc.
 Data quality reports - which are an input for business analysts to the data cleansing process
 Metadata reports - for system analysts and data modelers
 Extracts for other systems - formatted in a specific way
Usually generated in CSV or Microsoft Excel format

Reporting platforms
The most widely used reporting platforms:

 Cognos
 Oracle Hyperion
 Business Objects + Crystal Reports (SAP)
 Microstrategy
 Microsoft BI
 SAS
 BIRT - open source Business Intelligence and Reporting Tools Project
 Pentaho OSBI Reporting

You might also like