An Introduction To Data Warehousing: System Services Corporation, Chicago, Illinois
An Introduction To Data Warehousing: System Services Corporation, Chicago, Illinois
An Introduction To Data Warehousing: System Services Corporation, Chicago, Illinois
August 1997
This white paper is adapted from the forthcoming book Data Warehousing with MS SQL Server.
er
ry pow
The skyrocketing power of
M e m o hardware and software, along
Disk, e
CPU, e r a nd eas with the availability of
w
op P o affordable and easy-to-use
Deskt a nd eas
e reporting and analysis tools
Po w e r
Server
have played the most important
role in evolution of data
warehouses. Figure 1
Hardw
are pr highlights the technological
Softw ices revolution that has greatly
are pr impacted data warehousing.
ices
1.3.2Global corporation
The fall of communism and liberalization of Asian and South American economies has changed the business climate
worldwide forever. Competition from emerging economies has forced large corporations to become lean and efficient. The
emergence of this global economy has led to the migration of manufacturing to less expensive and less restrictive countries.
Former communist and South American countries present very exciting and challenging business opportunities. Along with
these opportunities they present a very volatile business climate and economies that are nearly impossible to predict.
Businesses have not only focused on building products worldwide, but they have also changed their organization to sell
products around the globe. Trade agreements such as NAFTA and EEC greatly impact the decisions to enter markets or
build factories. This globalization of business has increased the need not merely for more continuous analysis, but also to
manage data in a centralized location. The process of rolling up manufacturing and sales data from far-flung business units
has now started to impact much larger number of corporations. Businesses now need to continuously make the “build or
buy” decisions. Globalization of business has made the consolidation of data in a central data warehouse more complicated.
Factors such as currency fluctuations and product customization for different markets have added complexity to data
warehousing, making the analysis much more complicated. Imagine trying to assess profitability of products built and sold in
multiple countries with volatile currencies. Or, attempting to hedge the risks of downturn in economies that have been
expanding rapidly for extended periods.
Section2:Data
•Emergence of global economy
•Economic downturns in United States, Europe, Japan
warehousing
•Liberalization of Asian, South American, former Communist economies attributes and
•Compelling standard business applications concepts
•Technology savvy business analyst and technology aware management
ry
•Response time 2 seconds
Product Price/inventory vento to 60 minutes
ce/In
t pri
•10 second response time oduc
k ly pr •Data is not modified
Wee
•Last 10 price changes
ms
gra
•Last 20 inventory transactions
g pro
e tin
a rk
Marketing l ym
ek
We
•30 second response time
•Last 2 years programs
In short, the separation of operational data from the analysis data is the most fundamental data warehousing concept. Not
only is the data stored in a structured manner outside the operational system, businesses today are allocating considerable
resources to build data warehouses at the same time that the operational applications are deployed. Rather than archiving
data to a tape as an afterthought of implementing an operational system, data warehousing systems have become the primary
interface for operational systems. Figure 3 highlights the reasons for separation discussed in this section.
Future
Future
The data warehouse model needs to be extensible and structured such that the data from different applications can be added
as a business case can be made for the data. A data warehouse project in most cases cannot include data from all possible
applications right from the start. Many of the successful data warehousing projects have taken an incremental approach to
adding data from the operational systems and aligning it with the existing data. They start with the objective of eventually
adding most if not all business data to the data warehouse. Keeping this long-term objective in mind, they may begin with
one or two operational applications that provide the most fertile data for business analysis. Figure 4 illustrates the extensible
architecture of the data warehouse.
•Purchased Applications: The application data structure may be dictated by an application that was purchased from a
software vendor and integrated into the business. The user of the application may have very little or no control over the
data model. Some vendor applications have a very generic data model that is designed to accommodate a large number
and types of businesses.
•Legacy Application: The source application may be a very old mostly homegrown application where the data model has
evolved over the years. The database engine in this application may have been changed more than once without anyone
taking the time to fully exploit the features of the new engine. There are many legacy applications in existence today
where the data model is neither well documented nor understood by anyone currently supporting the application.
Order processing
Customer Product
orders price Data
Available Inventory Warehouse
Customers
Products
Product Price/inventory
Product Product Orders
price Inventory
Product Inventory
Product Price changes
Product Price
Marketing
Customer Product
Profile price
Marketing programs
Figure 5 illustrates the alignment of data warehouse entities with the business structure. The data warehouse model breaks
away from the limitations of the source application data models and builds a flexible model that parallels the business
structure. This extensible data model is easy to understand by the business analysts as well as the managers.
ders
Inventory system.
Editor: Orders (Closed)
Order Please add Open,
Backorder, Shipped, Inventory snapshot 1
2.2.4De-normalization of
Closed to the arrow
around the order
data
Inventory snapshot 2
ot
apsh
ntory sn Before we consider data model
•Operational state information is not carried to the ve warehouse
data
kly in
Down
Inventory
•Data is transferred to the data warehouse after all state changes
of data warehousing, let us
•Or, data is transferred with period snapshots
quickly review relational
database concepts and the
Figure 6. Transformation of the operational state information normalization process. E. F.
Codd developed relational
•Operational state information is not carried to the data warehouse database theory in the late
•Data is transferred to the data warehouse after all state changes 1960s while he was a
•Or, data is transferred with period snapshots researcher at IBM. Many
prominent researchers have
made significant contributions
Figure 6. Transformation of the operational state information
to this model since its
introduction. Today, most of
the popular database platforms
follow this model closely. A relational database model is a collection of two-dimensional tables consisting of rows and
columns. In the relational modeling terminology, the tables, rows, and columns are respectively called relations, attributes,
and tuples. The name for relational database model is derived from the term relation for a table. The model further identifies
unique keys for all tables and describes the relationship between tables.
Normalization is a relational database modeling process where the relations or tables are progressively decomposed into
smaller relations to a point where all attributes in a relation are very tightly coupled with the primary key of the relation.
Most data modelers try to achieve the “Third Normal Form” with all of the relations before they de-normalize for
performance or other reasons. The three levels of normalization are briefly described below:
• First Normal Form: A relation is said to be in First Normal Form if it describes a single entity and it contains no arrays or
repeating attributes. For example, an order table or relation with multiple line items would not be in First Normal Form
because it would have repeating sets of attributes for each line item. The relational theory would call for separate tables
for order and line items.
• Second Normal Form: A relation is said to be in Second Normal Form if in addition to the First Normal Form properties,
all attributes are fully dependent on the primary key for the relation.
• Third Normal Form: A relation is in Third Normal Form if in addition to Second Normal Form, all non-key attributes are
completely independent of each other.
The process of normalization generally breaks a table into many independent tables. While a fully normalized database can
yield fantastically flexible model, it generally makes the data model more complex and difficult to follow. Further, a fully
normalized data model can perform very inefficiently. A data modeler in an operational system would take normalized
logical data model and convert it into a physical data model that is significantly de-normalized. De-normalization reduces
the need for database table joins in the queries.
Some of the reasons for de-normalizing the data warehouse model are the same as they would be for an operational system,
namely, performance and simplicity. The data normalization in relational databases provides considerable flexibility at the
cost of the performance. This performance cost is sharply increased in a data warehousing system because the amount of
data involved may be much larger. A three-way join with relatively small tables of an operational system may be acceptable
in terms of performance cost, but the join may take unacceptably long time with large tables in the data warehouse system.
Order processing
Customer Product
Data
Marketing
Customer Product
Profile price
Marketing programs
Logical transformation concepts of source application data described here require considerable effort and they are a very
important early investment towards development of a successful data warehouse. Figure 7 highlights the logical
transformation concepts discussed in this section.
Transformation
Operational -----------------------
Data Warehouse
System A cust, cust_id, borrower
>> customer ID System
-----------------------
Summarized Data
“1” >> “M”
“2” >> “F” Detailed
-----------------------
Operational
System B Missing >>> “……..” Data
Figure 8 highlights the physical transformation concepts for data warehousing systems. Physical transformation of source
application data requires considerable effort and it can be difficult at times, but a well-considered set of physical data
transformations can make a data warehouse more user-friendly. Further, accurate and complete transformations help
maintain the integrity of the data warehouse.
Detailed
Perform business Data
analysis on detail data
Summarization and predefined analysis of data in a data warehouse system is an important task. It is essential to maintain the
integrity of the summary views because a very large part of the data warehouse activity is against the summary views. Figure
9 highlights the key concepts around summary views. The summary views need to be not only designed and built, they need
to be maintained as new data comes into the data warehouse.
2.5 Definition
After considering the various attributes and concepts of data warehousing systems, a broad definition of a data warehouse can
be the following:
A data warehouse is a structured extensible environment designed for the analysis of non-volatile data, logically and
physically transformed from multiple source applications to align with business structure, updated and maintained
for a long time period, expressed in simple business terms, and summarized for quick analysis.
Data Warehouse
System
Predefined Queries against
reports and Summarized Data summary data
queries Detailed
Data
Data mining in
detail data
Other
Applications
Figure 10 illustrates the analysis processes that run against a data warehouse. Although a majority of the activity against
today’s data warehouses is simple reporting and analysis, the sophistication of analysis at the high end continues to increase
rapidly. Of course, all analysis run at data warehouse is simpler and cheaper to run than through the old methods. This
simplicity continues to be a main attraction of data warehousing systems.
Summary
This paper introduced the fundamental concepts of data warehousing. It is important to note that data warehousing is a
science that continues to evolve. Many of the design and development concepts introduced here greatly influence the quality
of the analysis that is possible with data in the data warehouse. If invalid or corrupt data is allowed to get into the data
warehouse, the analysis done with this data is likely to be invalid.
After the rapid acceptance of data warehousing systems during past three years, there will continue to be many more
enhancements and adjustments to the data warehousing system model. Further evolution of the hardware and software
technology will also continue to greatly influence the capabilities that are built into data warehouses.
Data warehousing systems have become a key component of information technology architecture. A flexible enterprise data
warehouse strategy can yield significant benefits for a long period.