Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

An Introduction To Data Warehousing1

Download as rtf, pdf, or txt
Download as rtf, pdf, or txt
You are on page 1of 20

5

An Introduction to Data Warehousing

10

15

20

25

30

Data warehousing has quickly evolved into a unique and popular business application class. Early builders of data warehouses already consider their systems to be key components of their IT strategy and architecture. Numerous examples can be cited of highly successful data warehouses developed and deployed for businesses of all si es and all types. !ardware and software vendors have quickly developed products and services that specifically target the data warehousing market. This paper will introduce key concepts surrounding the data warehousing systems. "hat is a data warehouse# $ simple answer could be that a data warehouse is managed data situated after and outside the operational systems. $ complete definition requires discussion of many key attributes of a data warehouse system. %ater in &ection '( we will identify these key attributes and discuss the definition they provide for a data warehouse. &ection ) briefly reviews the activity against a data warehouse system. Initially in &ection *( however( we will take a brief tour of the traditions of managing data after it passes through the operational systems and the types of analysis generated from this historical data.

10

Section 1:
15

Evolution of an application class

This section reviews the historical management of the analysis data and the factors that have led to the evolution of the data warehousing application class.

1.1

Traditional approaches to historical data In reviewing the development of data warehousing( we need to begin with a review of what had been done with the data before of evolution of data warehouses. %et us first look at how the kind of data that ends up in today+s data warehouses had been managed historically.
20

25

Throughout the history of systems development( the primary emphasis had been given to the operational systems and the data they process. It is not practical to keep data in the operational systems indefinitely, and only as an afterthought was a structure designed for archiving the data that the operational system has processed. The fundamental requirements of the operational and analysis systems are different- the operational systems need performance( whereas the analysis systems need flexibility and broad scope. It has rarely been acceptable to have business analysis interfere with and degrade performance of the operational systems. 1.1.1 Data from legacy systems In the *./0+s virtually all business system development was done on the I12 mainframe computers using tools such as 3obol( 3I3&( I2&( D1'( etc. The *.40+s brought in the new mini5computer platforms such as $&6700 and 8$9682&. The late eighties and early nineties made :NI9 a popular server platform with the introduction of client6server architecture. Despite all the changes in the platforms( architectures( tools( and technologies( a remarkably large number of business applications continue to run in the mainframe environment of the *./0+s. 1y some estimates( more than /0 percent of business data for large corporations still resides in the mainframe environment. There are many reasons for this. The most important reason( and one that is particularly relevant to our topic( is that over the years these systems have grown to capture the business knowledge and rules that are incredibly difficult to carry to a new platform or application. These systems( generically called legacy systems( continue to be the largest source of data for analysis systems. The data that is stored in D1'( I2&( 8&$2( etc. for the transaction systems ends up in large tape libraries in remote data centers. $n institution will generate countless reports and extracts over the years( each designed to extract requisite information out of the legacy systems. In most instances( I&6IT groups assume responsibility for designing and developing programs for these reports and extracts. The time required to generate and deploy these programs frequently turns out to be longer than the end users think they can afford. 1.1.2 Extracted information on the Desktop During the past decade( the sharply increasing popularity of the personal computer on business desktops has introduced many new options and compelling opportunities for business analysis. The gap between the programmer and end user has started to close as 1usiness $nalysts now have at their fingertips many of the tools required to gain proficiency in the use of spreadsheets for analysis and graphic representation. $dvanced users will frequently use desktop database programs that allow them to store and work with the information extracted

30

35

40

45

from the legacy sources. 2any desktop reporting and analysis tools are increasingly targeted towards end users and have gained considerable popularity on the desktop. The downside of this model for business analysis is that it leaves the data fragmented and oriented towards very specific needs. Each individual user has obtained only the information that he or she requires. Not being standardi ed( the extracts are unable to address the requirements of multiple users and uses. The time and cost involved in addressing the requirements of only one user prove prohibitive. This approach to data management assumes the end user has the time to expend on managing the data in the spreadsheets( files( and databases. "hile many of these users may be proficient at data management( most undertake these tasks as a necessity. $nd given the choice( most users would find it more efficient to focus on the actual analysis and the tools available to them. 1.1.3 Decision-Support and Executive Information Systems $nother category of popular analysis systems has been decision support systems and executive information systems. Decision support systems tend to focus more on detail and are targeted towards lower to mid5level managers. Executive information systems have generally provided a higher level of consolidation and a multi5 dimensional view of the data( as high level executives need more the ability to slice and dice the same data than to drill down to review the data detail. These two similar and overlapping categories are perhaps the closest precursors to the data warehousing systems. ;et the high price of their development and the coordination required for their production made them an elite product that never entered the mainstream. The following are some characteristics generally associated with decision support or executive information systems
25

10

15

20

These systems have data in descriptive standard business terms( rather than in cryptic computer fields names. Data names and data structures in these systems are designed for use by non5technical users. The data is generally preprocessed with the application of standard business rules such as how to allocate revenue to products( business units( and markets. 3onsolidated views of the data such as product( customer( and market are available. $lthough these systems will at times have the ability to drill down to the detail data( rarely are they able to access all the detail data at the same time.

30

Today+s data warehousing systems provide the analytical tools afforded by their precursors. 1ut their design is no longer derived from the specific requirements of analysts or executives, and( as we will see later( data warehousing systems are most successful when their design aligns with the overall business structure rather than specific requirements.

1.2
35

Emergence of key enabling technologies 2any factors have influenced the quick evolution of the data warehousing discipline. The most significant set of factors has been the enormous forward movement in the hardware and software technologies. &harply decreasing prices and the increasing power of computer hardware( coupled with ease of use of today+s software( has made possible quick analysis of hundreds of gigabytes of information and business knowledge. 1.2.1 Hardware prices plummeting according to the oore!s law The most important factor in the evolution of data warehousing has been the sharply increasing power of computer hardware. $long with the increase in this power( their prices have fallen <ust as sharply. =ordon 2oore( co5founder of Intel( predicted that the capacity of a microprocessor will double every *4 months. This has not only held true for the processor but also for other components of the computer. "hile desktop computers today are more powerful than the mainframes of yesterday( an inexpensive server possesses power that was difficult to imagine <ust a decade ago. The >entium II and $lpha processors have brought incredible power to the commodity computer market. &ophisticated processor hardware architectures such as symmetric multi5processing have come to the mainstream computing with inexpensive machines. !igher capacity memory chips( a key component influencing the performance of a data warehouse system( are now available at very low prices. Now it is possible to have a moderately priced machine with * or ' gigabytes of memory. 3omputer 1us such as >3I and controller interfaces such as :ltra &3&I have made I6? incredibly fast. %ast but not the least( the disk drive has shrunk to hold ama ing amounts of information. @ust two decades ago( it would have taken a roomful of disk drives to store information that can now be easily stored on a single one5inch high disk drive.

40

45

50

10

1.2.2 Desktop power increasing Entering the market as a novelty computer in early eighties( the personal computer has become the hotbed in innovation during the past decade. The personal computer was initially used for word processing and other minor tasks with no links to primary analytical functions. "ith the help of innovations such as powerful personal productivity software( easyAto5use graphical interface( and responsive business applications( the personal computer has become the focal point of all computing today. The powerful desktop hardware and software has allowed for development of the client6server or multi5tier computing architecture. $lmost all data warehouses are accessed by personal computer based tools. These tools vary from very simple query capabilities available with most productivity packages to incredibly powerful graphical multi5dimensional analysis tools. "ithout the wide array of choices available for a data warehouse access( data warehousing would not have evolved so quickly. 1.2.3 Ever increasing power of server software &erver operating systems such as "indows NT and :nix have brought mission5critical stability and powerful features to the distributed computing environment. The operating system software has become very feature5rich and powerful as the cost has been going down steadily. "ith this combination( sophisticated operating system concepts such virtual memory( multi5tasking( and symmetric multi5processing are now available on inexpensive operating platforms. ?perating systems such as "indows NT have made these powerful systems very easy to set up and operate reducing the total cost of ownership of these powerful servers. 1.2.4 Explosion of Intranets and We" "ased applications The most important development in computing since the advent of the personal computer is the explosion of Internet and "eb based applications. &omewhat after the fact( the business community has quickly <umped onto the Internet bandwagon. ?ne of the most exciting fields in computing industry today is the development of Intranet applications. Intranets are private business networks that are based on the Internet standards( although they are designed to be used internally. The Internet6Intranet trend has very important implications for data warehousing applications. Birst( data warehouses can be available world wide on public6private network at much lower cost. This availability minimi es the need to replicate data across diverse geographical locations. &econd( this standard has allowed the web server to provide a middle tier where all the heavy5duty analysis takes place before it is presented to the web5browsing client to use.

15

20

25

30

*P+(&is$( ,emory po er &es$top Po er an! ease 'erverPo eran!ease


35

The skyrocketing power of hardware and software( along with the availability of affordable and easy5 to5use reporting and analysis tools have played the most important role in evolution of data warehouses. Bigure * highlights the technological revolution that has greatly impacted data warehousing.

)ar! areprices
'oft areprices

1.3

40

Proces ors many times po erful than yester!ay"s mainframes Ine#pensive !is$s can store hun!re!s of gi a%ytes &es$top very po erful for analysi to ls. 'erver soft are ine#pensive( po erful( easy to maintain )ar! are an! soft are prices harply lo er

Change in the nature of the business $nother very significant influence on evolution of data warehousing science is the fundamental changes in the business organi ation and structure during late eighties and early nineties. The emergence of a vibrant global economy has profoundly changed the information demands made by corporations in the :nited &tates and worldwide. 3orporations have found markets for their products globally while competing with other companies in vastly different cultures and economic environments. The mergers and acquisition of businesses have crossed the country boundaries. 1.3.1 Economic factors of the recent years The economic downturn of the late eighties led many global corporations through a remarkable period of consolidation. >henomena such as Cbusiness process reengineeringD and Cdownsi ingD forced businesses to reevaluate their business practices. 2any industries went through prolonged periods of consolidation and reinvention. During this period( simple economics forced the businesses to identify their core competency areas and shed businesses that were not profitable.

45

Figure 1. Impact of technological revolution

50

These economic factors have played an important role in the evolution of data warehousing. Bor example( when a banking unit that used different operational systems changed hands( the top management still needed to view the consolidated business and manage the associated risks accordingly. The banking industry

has been the leader in the use of data warehouses. Today+s data warehousing systems are extensively used for profitability and customer behavior analysis. 1.3.2 #lo"al corporation The fall of communism and liberali ation of $sian and &outh $merican economies has changed the business climate worldwide forever. 3ompetition from emerging economies has forced large corporations to become lean and efficient. The emergence of this global economy has led to the migration of manufacturing to less expensive and less restrictive countries. Bormer communist and &outh $merican countries present very exciting and challenging business opportunities. $long with these opportunities they present a very volatile business climate and economies that are nearly impossible to predict. 1usinesses have not only focused on building products worldwide( but they have also changed their organi ation to sell products around the globe. Trade agreements such as N$BT$ and EE3 greatly impact the decisions to enter markets or build factories. This globali ation of business has increased the need not merely for more continuous analysis( but also to manage data in a centrali ed location. The process of rolling up manufacturing and sales data from far5flung business units has now started to impact much larger number of corporations. 1usinesses now need to continuously make the Cbuild or buyD decisions. =lobali ation of business has made the consolidation of data in a central data warehouse more complicated. Bactors such as currency fluctuations and product customi ation for different markets have added complexity to data warehousing( making the analysis much more complicated. Imagine trying to assess profitability of products built and sold in multiple countries with volatile currencies. ?r( attempting to hedge the risks of downturn in economies that have been expanding rapidly for extended periods. 1.3.3 Emergence of standard "usiness applications $nother factor that is fast becoming an important variable in data warehousing equations is the emergence of vendors with popular business application suites. %ed by wildly popular =erman software vendor &$> $=( flexible business software suites adapted to the particulars of a business have become a very popular way to move to a sophisticated multi5tier architecture. ?ther vendors such as 1aan( >eople&oft( and ?racle have likewise come out with suites of software that provide different strengths but have comparable functionality. The emergence of these application suites has a direct bearing on the increased use of data warehousing in that they are increasingly able to provide standard applications that are replacing existing custom developed legacy applications. In the near future( almost every data warehouse is likely to derive data from one of these application sources rather than the customi ed extraction from legacy systems. Burther( there are significant initiatives at these vendors to make transaction data easily available to data warehousing systems. To the extent that these standard applications have extensive customi ation features( data acquisition from these applications can be much simpler than from the mainframe systems. 1.3.4 End-user more technology savvy ?ne of the most important results of the massive investment in technology and movement towards the powerful personal computer has been the evolution of a technology5savvy business analyst. Even though the technology5 savvy end users are not always beneficial to all pro<ects( this trend certainly has produced a crop of technology5 leading business analysts that are becoming essential to today+s business. These technology5savvy end users have frequently played an important role in the development and deployment of data warehouses. They have become the core users that are first to demonstrate the initial benefits of data warehouses. These end users are also critical to the development of the data warehouse model- as they become experts with the data warehousing system( they play a very important role of mentoring other users. "ord processing and spreadsheets were the first applications to be effectively used on the personal computers. In fact( the spreadsheet is said to be the killer application that led to widespread deployment of personal computers. The charting functions from a spreadsheet represent one of the most extensively used business analysis and presentation functions. The new pivot tables available in popular spreadsheets have allowed for simple multi5dimensional analysis. The aggressive use of inexpensive personal productivity software has led to use of more robust reporting and analysis tools along with more powerful desktop database engines. These powerful tools are now more targeted towards the end user and often require very little training for simple applications.

10

15

20

25

30

35

40

45

50

1.3.5 anagement more information conscience 2any factors affect the heightened awareness of trends in information technology among mid and upper management levels. :nlike a decade ago( the information technology now is nearly universally accepted as a key strategic business asset. 2any mid and upper level managers that have risen through the ranks over the last decade have invariably made their mark with successful technology investments. $s a result( they tend not to shy away from risking resources on new and emerging technologies. The explosive use of Internet has greatly aided in the managers+ awareness of technology trends. The Internet is now being used to conduct business transactions, but its greatest asset to this date has been dissemination of information. Today( executives can not only review various sources of industry trends( they can also readily find case studies and vendor information. The use of technology by mid and upper level managers has increased significantly. They have decisively moved beyond using the personal computer for email. This hands5on use of information and technology by upper management has facilitated the sponsorship of larger pro<ects such as data warehousing.

10

15
2echnology sav y user an!manager

$longside the availability of key enabling technologies( these fundamental changes in the nature of business over the past decade have played a central role in the evolution of data warehouse. &ome might even argue that these changes in business have led the technology to its current state.

Section 2:

Data warehousing attri"utes and concepts

20

!aving looked at the historical use of the analysis data and explored some of the factors influencing the evolution of data warehouses( we will now turn to identifying the key attributes of a data warehouse. It is important to recogni e that data warehousing is still an evolving science. $s with any evolving technology( particular care must be taken to discount some marketing claims driven by vendors attempting to differentiate themselves from the competitors. Bor example( the si e of the data warehouse should not determine if a data warehouse is really a data warehouse. &ome vendor may say that a data warehouse that is only E0 gigabytes is not a full5fledged data warehouse( and they may refer to it instead as a data mart. Bor a smaller company( E0 gigabytes or even much less can represent every relevant piece of information covering last *0 years and can well represent a powerful data warehouse.

-mergence of glo%al economy

25

-conomic !o nturns in +nite! 'ta es( -urope( .apan /i%erali0ation of 1sian( 'outh 1merican( former *om unist economies *ompel ing stan!ar! %usines ap lications 2echnol gy sav y %usines analyst an! technol gy a are management

30

35

This section explores the data warehousing concepts and attributes. These concepts are grouped into four sub5sections. The first sub5section discusses the reasons for separating the data for business analysis from the operational data. The logical transformation of the data( including data warehouse modeling and de5normali ation of the data( are introduced in the second sub5section. &ub5section three reviews the issues associated with physical transformation of the data. &ub5section four discusses the generation of summary views. $ very simple and broad definition of a data warehouse follows the discussion of the data warehousing concepts and attributes.
Figure2. *hanges in the natureof %usines

2.1

40

Warehousing data outside the operational systems The primary concept of data warehousing is that the data stored for business analysis can most effectively be accessed by separating it from the data in the operational systems. 2any of the reasons for this separation have evolved over the years. In the past( legacy systems archived data onto tapes as it became inactive and many analysis reports ran from these tapes or mirror data sources to minimi e the performance impact on the operational systems. These reasons to separate the operational data from analysis data have not significantly changed with the evolution of the data warehousing systems( except that now they are considered more formally during the data warehouse building process. $dvances in technology and changes in the nature of business have made many of the business analysis processes much more complex and sophisticated. In addition to producing standard reports( today+s data warehousing systems support very sophisticated online analysis including multi5dimensional analysis. 2.1.1 Integrating data from more than one operational system Data warehousing systems are most successful when data can be combined from more than one operational system. "hen the data needs to be brought together from more than one source application( it is natural that this

45

50

integration be done at a place independent of the source applications. 1efore the evolution of structured data warehouses( analysts in many instances would combine data extracted from more than one operational system into a single spreadsheet or a database. The data warehouse may very effectively combine data from multiple source applications such as sales( marketing( finance( and production. 2any large data warehouse architectures allow for the source applications to be integrated into the data warehouse incrementally. The primary reason for combining data from multiple source applications is the ability to cross5reference data from these applications. Nearly all data in a typical data warehouse is built around the time dimension. Time is the primary filtering criterion for a very large percentage of all activity against the data warehouse. $n analyst may generate queries for a given week( month( quarter( or a year. $nother popular query in many data warehousing applications is the review of year5on5year activity. Bor example( one may compare sales for the first quarter of this year with the sales for first quarter of the prior years. The time dimension in the data warehouse also serves as a fundamental cross5referencing attribute. Bor example( an analyst may attempt to access the impact of a new marketing campaign run during selected months by reviewing the sales during the same periods. The ability to establish and understand the correlation between activities of different organi ational groups within a company is often cited as the single biggest advanced feature of the data warehousing systems. The data warehouse system can serve not only as an effective platform to merge data from multiple current applications, it can also integrate multiple versions of the same application. Bor example( an organi ation may have migrated to a new standard business application that replaces an old mainframe5based( custom5developed legacy application. The data warehouse system can serve as a very powerful and much needed platform to combine the data from the old and the new applications. Designed properly( the data warehouse can allow for year5on5year analysis even though the base operational application has changed. 2.1.2 Differences "etween transaction and analysis processes The most important reason for separating data for business analysis from the operational data has always been the potential performance degradation on the operational system that can result from the analysis processes. !igh performance and quick response time is almost universally critical for operational systems. The loss of efficiency and the costs incurred with slower responses on the predefined transactions are usually easy to calculate and measure. Bor example( a loss of five seconds of processing time is perhaps negligible in and of itself, but it compounds out to considerably more time and high costs once all the other operations it impacts are brought into the picture. ?n the other hand( business analysis processes in a data warehouse are difficult to predefine and they rarely need to have rigid response time requirements. ?perational systems are designed for acceptable performance for pre5defined transactions. Bor an operational system( it is typically possible to identify the mix of business transaction types in a given time frame including the peak loads. It also relatively easy to specify the maximum acceptable response time given a specific load on the system. The cost of a long response time can then be computed by considering factors such as the cost of operators( telecommunication costs( and the cost of any lost business. Bor example( an order processing system might specify the number of active order takers and the average number of orders for each operational hour. Even the query and reporting transactions against the operational system are most likely to be predefined with predictable volume. Even though many of the queries and reports that are run against a data warehouse are predefined( it is nearly impossible to accurately predict the activity against a data warehouse. The process of data exploration in a data warehouse takes a business analyst through previously undefined paths. It is also common to have runaway queries in a data warehouse that are triggered by unexpected results or by users+ lack of understanding of the data model. Burther( many of the analysis processes tend to be all encompassing whereas the operational processes are well segmented. $ user may decide to explore detail data while reviewing the results of a report from the summary tables. $fter finding some interesting sales activity in a particular month( the user may <oin the activity for this month with the marketing programs that were run during that particular month to further understand the sales. ?f course( there would be instances where a user attempts to run a query that will try to build a temporary table that is a 3artesian product of two tables containing a million rows eachF "hile an activity like this would unacceptably degrade an operational system+s performance( it is expected and planned for in a data warehousing system. 2.1.3 Data is mostly non-volatile $nother key attribute of the data in a data warehouse system is that the data is brought to the warehouse after it has become mostly non5volatile. This means that after the data is in the data warehouse( there are no modifications to be made to this information. Bor example( the order status does not change( the inventory

10

15

20

25

30

35

40

45

50

snapshot does not change( and the marketing promotion details do not change. This attribute of the data warehouse has many very important implications for the kind of data that is brought to the data warehouse and the timing of the data transfer.
5

10

%et us further review what it means for the data to be non5volatile. In an operational system the data entities go through many attribute changes. Bor example( an order may go through many statuses before it is completed. ?r( a product moving through the assembly line has many processes applied to it. =enerally speaking( the data from an operational system is triggered to go to the data warehouse when most of the activity on these business entity data has been completed. This may mean completion of an order or final assembly of an accepted product. ?nce an order is completed and shipped( it is unlikely to go back to backorder status. ?r( once a product is built and accepted( it is unlikely to go back to the first assembly station. $nother important example can be the constantly changing data that is transferred to the data warehouse one snapshot at a time. The inventory module in an operational system may change with nearly every transaction, it is impossible to carry all of these changes to the data warehouse. ;ou may determine that a snapshot of inventory carried once every week to the data warehouse is adequate for all analysis. &uch snapshot data naturally is non5volatile. It is important to reali e that once data is brought to the data warehouse( it should be modified only on rare occasions. It is very difficult( if not impossible( to maintain dynamic data in the data warehouse. 2any data warehousing pro<ects have failed miserably when they attempted to synchroni e volatile data between the operational and data warehousing systems. 2.1.4 Data saved for longer periods than in transaction systems Data from most operational systems is archived after the data becomes inactive. Bor example( an order may become inactive after a set period from the fulfillment of the order, or a bank account may become inactive after it has been closed for a period of time. The primary reason for archiving the inactive data has been the performance of the operational system. %arge amounts of inactive data mixed with operational live data can significantly degrade the performance of a transaction that is only processing the active data. &ince the data warehouses are designed to be the archives for the operational data( the data here is saved for a very long period. In fact( a data warehouse pro<ect may start without any specific plan to archive the data off the warehouse. The cost of maintaining the data once it is loaded in the data warehouse is minimal. 2ost of the significant costs are incurred in data transfer and data scrubbing. &toring data for more than five years is very common for data warehousing systems. There are industry examples were the success of a data warehousing pro<ect has encouraged the managers to expand the time hori on of the data stored in the data warehouse. They may start with storing the data for two or three years and then expand to five or more years once the wealth of business knowledge in the data warehouse is discovered. The falling prices of hardware have also encouraged the expansion of successful data warehousing pro<ects.

15

20

25

30

Order processing
2 secon! response time /ast 6 months or!ers
&aily cl ose! or !ers

Data Warehouse
/ast 5 years !ata 3esponse time 2 secon!s to 60 minutes &ata is not mo!ifie!

Product Price/inventory
10 secon! response time /ast 10 price changes /ast 20 inventory transactions
$ 7ee

ce8In t pri o!uc ly pr

ry vento

Mar eting
30 secon! response time /ast 2 years programs

e$ 7e

e a r$ ym

g tin

pro

ms gra

&ifferent performance re4uirements *om%ine !ata from multiple applications &ata is mostly non5volatile &ata save! for a long time perio!

Figure 3. 3easons for moving !ata outsi!e the operations systems

In short( the separation of operational data from the analysis data is the most fundamental data warehousing concept. Not only is the data stored in a structured manner outside the operational system( businesses today are allocating considerable resources to build data warehouses at the same time that the operational applications are deployed. Gather than archiving data to a tape as an afterthought of implementing an operational system( data warehousing systems have become the primary interface for operational systems. Bigure ) highlights the reasons for separation discussed in this section.

2.2
10

15

Logical transformation of operational data This sub5section explores the concepts associated with the data warehouse logical model. The data is logically transformed when it is brought to the data warehouse from the operational systems. The issues associated with the logical transformation of data brought from the operational systems to the data warehouse may require considerable analysis and design effort. The architecture of the data warehouse and the data warehouse model greatly impact the success of the pro<ect. This section reviews some of the most fundamental concepts of relational database theory that do not fully apply to data warehousing systems. Even though most data warehouses are deployed on relational database platforms( some basic relational principles are knowingly modified when developing the logical and physical model of the data warehouses. 2.2.1 Structured extensi"le data model The data warehouse model outlines the logical and physical structure of the data warehouse. :nlike the archived data of the legacy systems( considerable effort needs to be devoted to the data warehouse modeling. This data modeling effort in the early phases of the data warehousing pro<ect can yield significant benefits in the form of an efficient data warehouse that is expandable to accommodate all of the business data from multiple operational applications. The data modeling process needs to structure the data in the data warehouse independent of the relational data model that may exist in any of the operational systems. $s discussed later in this paper( the data warehouse model is likely to be less normali ed than an operational system model. Burther( the operational systems are likely to have large amounts of overlapping business reference data. Information about current products is likely to be used in varying forms in many of the operational systems. The data warehouse system needs to consolidate all of the reference data. Bor example( the operational order processing system may maintain the pricing and physical attributes of products whereas the manufacturing floor application may maintain design and formula attributes for the same product. The data warehouse reference table for products would consolidate and maintain all attributes associated with products that are relevant for the analysis processes. &ome attributes that are essential to the operational system are likely to be deemed unnecessary for the data warehouse and may not be loaded and maintained in the data warehouse.

20

25

30

Pro!uct Future Future

'etup frame or$ for -nterprise !ata arehouse 'tart ith fe a most valua%le source applications 1!! a!!itional applications as %usiness case can %e ma!e

Figure 4. -#tensi%le !ata arehouse

The data warehouse model needs to be extensible and structured such that the data from different applications can be added as a business case can be made for the data. $ data warehouse pro<ect in most cases cannot include data from all possible applications right from the start. 2any of the successful data warehousing pro<ects have taken an incremental approach to adding data from the operational systems and aligning it with the existing data. They start with the ob<ective of eventually adding most if not all business data to the data warehouse. Heeping this long5term ob<ective in mind( they may begin with one or two operational applications that provide the most fertile data for business analysis. Bigure 7 illustrates the extensible architecture of the data warehouse. 2.2.2 Data warehouse model aligns with the "usiness structure $ data warehouse logical model aligns with the business structure rather than the data model of any particular application. The entities defined and maintained in the data warehouse parallel the actual business entities such as customers( products( orders( and distributors. Different parts of an organi ation may have a very narrow view of a business entity such as a customer. Bor example( a loan service group in a bank may only know about a customer in the context of one or more loans outstanding. $nother group in the same bank may know about the same customer in context of a deposit account. The data warehouse view of the customer would transcend the view from a particular part of the business. $ customer in the data warehouse would represent a bank customer that has any kind of business with the bank. The data warehouse would most likely build attributes of a business entity by collecting data from multiple source applications. 3onsider( for example( the demographic data associated with a bank customer. The retail operational system may provide some attributes such as social security number( address( and phone number. $ mortgage system or some purchased database may provide with employment( income( and net worth information. The structure of the data in any single source application is likely to be inadequate for the data warehouse. The structure in a single application may be influenced by many factors( including1

10

15

20

25

>urchased $pplications- The application data structure may be dictated by an application that was purchased from a software vendor and integrated into the business. The user of the application may have very little or no control over the data model. &ome vendor applications have a very generic data model that is designed to accommodate a large number and types of businesses. %egacy $pplication- The source application may be a very old mostly homegrown application where the data model has evolved over the years. The database engine in this application may have been changed more than once without anyone taking the time to fully exploit the features of the new engine. There are many legacy applications in existence today where the data model is neither well documented nor understood by anyone currently supporting the application.

2
30

-nterprise &ata 7arehouse

9r!ers

>latform %imitations- The source application data model may be restricted by the limitations of the hardware6software platform or development tools and technologies. $ database platform may not support certain logical relationship or there may be physical limitations on the data attributes.

Order processing
*ustomer or!ers Pro!uct price

1vaila%le Inventory

Data Warehouse
!usto"ers Products Orders Product #nventory Product Price

Product Price/inventory
Pro!uct price Pro!uct Inventory

Pro!uct Price changes

Mar eting
*ustomer Profile Pro!uct price

,ar$eting programs

:o !ata mo!el restrictions of the source application &ata arehouse mo!el has %usiness entities

Figure 5. &ata arehouse entities align ith the %usiness structure

5
Bigure E illustrates the alignment of data warehouse entities with the business structure. The data warehouse model breaks away from the limitations of the source application data models and builds a flexible model that parallels the business structure. This extensible data model is easy to understand by the business analysts as well as the managers.
10

15

2.2.3 $ransformation of the operational state information It is essential to understand the implications of not being able to maintain the state information of the operational system when the data is moved to the data warehouse. 2any of the attributes of entities in the operational system are very dynamic and constantly modified. 2any of these dynamic operational system attributes are not carried over to the data warehouse, others are static by the time they are moved to the data warehouse. $ data warehouse generally does not contain information about entities that are dynamic and constantly going through state changes. To understand what it means to lose the operational state information( let us consider the example of an order fulfillment system that tracks the inventory to fill orders. Birst let us look at the order entity in this operational system. $n order may go through many different statuses or states before it is fulfilled or goes to the CclosedD status. ?ther order statuses may indicate that the order is ready to be filled( it is being filled( back ordered( ready to be shipped( etc. This order entity may go through many states that capture the status of the order and the business processes that have been applied to it. It is nearly impossible to carry forward all of attributes associated with these order states to the data warehousing system. The data warehousing system is most likely to have <ust one final snapshot of this order. ?r( as the order is ready to be moved into the data warehouse( the information may be gathered from multiple operational entities such as order and shipping to build the final data warehouse order entity. Now let us consider the more complicated example of inventory data within this system. The inventory may change with every single transaction. The quantity of a product in the inventory may be reduced by an order fulfillment transaction or this quantity may be increased with receipt of a new shipment of the product. If this order processing system executes ten thousand transactions in a given day( it is likely that the actual inventory in the database will go through <ust as many states or snapshots during this day. It is impossible to capture this constant change in the database and carry it forward to the data warehouse. This is still one of the most perplexing problems with the data warehousing systems. There are many approaches to solving this problem. ;ou will most

20

25

30

likely choose to carry periodical snapshots of the inventory data to the data warehouse. This scenario can apply to a very large portion of the data in the operational systems. The issues associated with this get much more complicated as extended time periods are considered.
5

Order Proces ing Syste"


-!itor;

Bigure I illustrates how most of the operational state information cannot be carried over the data warehouse system.
&aily close! or!ers

Data Warehouse
Orders $!%osed& #nventory snapshot1 #nventory snapshot2

9r!er
10

Pleasea! 9pen( <ac$or!er( 'hip e!( *lose! to thear o aroun! theor!er

Inventory

7ee$ly inventorysnapshot

15

2.2.4 De-normali%ation of data 1efore we consider data model de5normali ation in the context of data warehousing( let us quickly review relational database concepts and the normali ation process. E. B. 3odd developed relational database theory in the late *.I0s while he was a researcher at I12. 2any prominent researchers have made significant contributions to this model since its introduction. Today( most of the popular database platforms follow this model closely. $ relational database model is a collection of two5 dimensional tables consisting of rows and columns. In the relational modeling terminology( the tables( rows( and columns are respectively called relations( attributes( and tuples. The name for relational database model is derived from the term relation for a table. The model further identifies unique keys for all tables and describes the relationship between tables. Normali ation is a relational database modeling process where the relations or tables are progressively decomposed into smaller relations to a point where all attributes in a relation are very tightly coupled with the primary key of the relation. 2ost data modelers try to achieve the CThird Normal BormD with all of the relations before they de5normali e for performance or other reasons. The three levels of normali ation are briefly described below Birst Normal Borm- $ relation is said to be in Birst Normal Borm if it describes a single entity and it contains no arrays or repeating attributes. Bor example( an order table or relation with multiple line items would not be in Birst Normal Borm because it would have repeating sets of attributes for each line item. The relational theory would call for separate tables for order and line items.

&o n
+p

9perational sta einformation is not car ie! to the !at arehouse &at is transfer e! to the !at arehouseafter al sta e changes

20

9r( !at is transfer e! ith perio! snapshots

Figure6. 2ransformation of theoperational sta e information

25

&econd Normal Borm- $ relation is said to be in &econd Normal Borm if in addition to the Birst Normal Borm properties( all attributes are fully dependent on the primary key for the relation. Third Normal Borm- $ relation is in Third Normal Borm if in addition to &econd Normal Borm( all non5key attributes are completely independent of each other.

30

The process of normali ation generally breaks a table into many independent tables. "hile a fully normali ed database can yield fantastically flexible model( it generally makes the data model more complex and difficult to follow. Burther( a fully normali ed data model can perform very inefficiently. $ data modeler in an operational system would take normali ed logical data model and convert it into a physical data model that is significantly de5 normali ed. De5normali ation reduces the need for database table <oins in the queries. &ome of the reasons for de5normali ing the data warehouse model are the same as they would be for an operational system( namely( performance and simplicity. The data normali ation in relational databases provides considerable flexibility at the cost of the performance. This performance cost is sharply increased in a data warehousing system because the amount of data involved may be much larger. $ three5way <oin with relatively small tables of an operational system may be acceptable in terms of performance cost( but the <oin may take unacceptably long time with large tables in the data warehouse system. 2.2.5 Static relationships in historical data $nother reason that de5normali ation is an important process in data warehousing modeling is that the relationship between many attributes does not change in this historical data. Bor example( in an operational system( a product may be part of the product group C$D this month and product group C1D starting next month. In a properly normali ed data model( it would be inappropriate to include the product group attribute with an order entity that records an order for this product, only the product ID would be included. The relational theory would call for a <oin on the order table and product table to determine the product group and any other attributes of this product. This relational theory concept does not apply to a data warehousing system because in a data warehousing system you may be capturing the group that this product belonged to when the order was filled. Even though the product moves to different groups over time( the relationship between the product and the group in context of this particular order is static. $nother important example can be the price of a product. The prices in an operational system may change constantly. &ome of these price changes may be carried to the data warehouse with a periodic snapshot of the

35

40

45

50

product price table. In a data warehousing system you would carry the list price of the product when the order is placed with each order regardless of the selling price for this order. The list price of the product may change many times in one year and your product price database snapshot may even manage to capture all these prices. 1ut( it is nearly impossible to determine the historical list price of the product at the time each order is generated if it is not carried to the data warehouse with the order. The relational database theory makes it easy to maintain dynamic relationships between business entities( whereas a data warehouse system captures relationships between business entities at a given time.
Order processing

1vaila%le Inventory

Product Price/inventory
Pro!uct price Pro!uct Inventory

&e5normali0e! !ata

!usto"ers Products Orders

Pro!uct Price changes

2ransform 'tate

Product #nventory Product Price

Mar eting
*ustomer Profile Pro!uct price

,ar$eting programs

'tructure! e#tensi%le !ata mo!el &ata arehouse mo!el aligns ith the %usiness structure 2ransformation of the state information &ata is !e5normali0e! %ecause the relationships are static

Figure =. /ogical transformation of application !ata

10

%ogical transformation concepts of source application data described here require considerable effort and they are a very important early investment towards development of a successful data warehouse. Bigure / highlights the logical transformation concepts discussed in this section.

2.3
15

20

Physical transformation of operational data >hysical transformation of data homogeni es and purifies the data. These data warehousing processes are typically known as Cdata scrubbingD or Cdata stagingD processes. The Cdata scrubbingD processes are some of the most labor5intensive and tedious processes in a data warehousing pro<ect. ;et( without proper scrubbing( the analytical value of even the clean data can be greatly diminished. >hysical transformation includes the use of easy5to5understand standard business terms( and standard values for the data. $ complete dictionary associated with the data warehouse can be a very useful tool. During these physical transformation processes the data is sometimes CstagedD before it is entered into the data warehouse. The data may be combined from multiple applications during this CstagingD step or the integrity of the data may be checked during this process. The concepts associated with the physical transformation of the data are introduced in this sub5section. !istorical data and the current operational application data is likely to have some missing or invalid values. It is important to note that it is essential to manage missing values or incomplete transformations while moving the data to the data warehousing system. The end user of the data warehouse must have a way to learn about any missing data and the default values used by the transformation processes. 2.3.1 &perational terms transformed into uniform "usiness terms The terms and names used in the operational systems are transformed into uniform standard business terms by the data warehouse transformation processes. The operational application may use cryptic or difficult to understand terms for a variety of different reasons. The platform software may impose length and format restriction on a term( or purchased application may be using a term that is too generic for your business. The data warehouse needs to consistently use standard business terms that are self5explanatory.

25

30

-#tensi%le !ata arehouse

*ustomer or!ers

Pro!uct price

Data Warehouse

$ customer identifier in the operational systems may be called cust( custJid( or custJno. Burther( different operational applications may use different terms to refer to the same attribute. Bor example( a customer in the loan organi ation in a bank may be referred to as a 1orrower. ;ou may choose a simple standard business term such as 3ustomer Id in the data warehouse. This term would require little or no explanation even to the novice user of the data warehouse. 2.3.2 Single physical definition of an attri"ute Different systems may evolve to use different lengths and data types for the same data element. ?ne system may have the product ID to be either *' or *7 numeric characters( whereas another system may accommodate product IDs of up to *4 alphanumeric characters. The software of an operational application may support very limited data types and it may impose severe limitations on the names. &oftware of another application may support a very rich set of data types( and it may be very flexible with the naming conventions. $s an attribute is defined physically for the data warehouse( it is essential to use meaningful data types and lengths. :se the standard data length and data type for each attribute everywhere it is used. $ functional data dictionary can facilitate this consistent use of physical attributes.

10

15

2.3.3 'onsistent use of entity attri"ute values $ll attributes in the data warehouse need to be consistent in the use of predefined values. Different source applications invariably use different attribute values to represent the same meaning. These different values need to be converted into a single( most sensible value as the data is loaded into the data warehouse. $ simple example for the consistent use of entity attributes is the use of a gender flag for an individual. ?ne source application may use flags such as C2D and CBD to store gender for an individual whereas another application may use the detail C2aleD and CBemaleD to store gender. ?ther applications may use yet other values to store the same piece of information. The data warehouse may choose to consistently use C2D and CBD for gender for all individuals throughout the system. $ more complex example can be the case of dealing with complex data values in the source application. 2any older applications use single data value to represent multiple attributes. $n account number( for example( may not only represent a unique account but it may also represent the account type. $ll accounts starting with * or ' may represent one type of account whereas all other accounts may represent something else to the business. The data warehouse would consistently use the account ID to only represent a unique account. The account type may be computed and saved as a separate attribute. 2.3.4 Issues associated with default and missing values The data brought into the data warehouse is sometimes incomplete or contains values that cannot be transformed properly. It is very important for the data warehouse transformation process to use intelligent default values for the missing or corrupt data. It is also important to devise a mechanism for users of the data warehouse to be aware of these default values. &ome data attributes can easily be defaulted to a reasonable value when the original is missing or corrupt. ?ther values can be obtained by referencing other current data. Bor example( a missing product attribute such as unit5 of5measure on an order entity can be obtained by accessing the current product database. &ome attributes cannot be filled by defaults for missing values. In fact( it may be dangerous to attempt to assign default for certain types of missing values. $ poor default may corrupt the data and lead to invalid analysis at a later stage. In these cases( it is safest to leave the missing values as blank. In some cases( it may make sense to pick a specific value or symbol that indicates a missing value. The timing of the start of the period for which data is loaded into the data warehouse can be important. It is safest to load data in the data warehouse for complete years. This would prevent any misinterpretation of analysis run on this data. Imagine a data warehouse that started loading data from the month of 2arch for the first year in the data warehouse. It is very likely that a user is going to run a query for a range of whole year without reali ing that the data for @anuary and Bebruary is not stored in the data warehouse. $lso( missing data for part of the year prevents any meaningful year5on5year analysis. It is important to design a good system to log and identify data that is missing from the data warehouse. "hen a user runs a query against the data warehouse( it is essential to understand the population against which the query is run.

20

25

30

35

40

45

50

'rans(or"ation

9perational 'ystem 1

55555555555555555555555
cust( cust?i!( %orro er @@ customer I&

&ata 7arehouse 'ystem 'ummari0e! &ata

55555555555555555555555
A1B @@ A,B A2B @@ AFB

&etaile! &ata

9perational 'ystem <

55555555555555555555555
,issing @@@ ACC..B

+niform %usiness terms 'ingle physical !efinition of an attri%ute *onsistent use of entity attri%utes &efault an! missing values

Figure >. Physical transformation of application !ata

Bigure 4 highlights the physical transformation concepts for data warehousing systems. >hysical transformation of source application data requires considerable effort and it can be difficult at times( but a well5considered set of physical data transformations can make a data warehouse more user5friendly. Burther( accurate and complete transformations help maintain the integrity of the data warehouse.

2.4

Business vie summari!ation of data 2any queries and reports against most data warehouse systems are simple aggregations based on predefined parameters. $nother key attribute of today+s data warehouses is the predefined and automatically generated summary views.
10

Bor example( many people in an organi ation may need to see product sales figures. They may have a need to summari e these sales figures for a week( a month( or a quarter. It may not be practical to summari e the needed data every time an analyst requires it. $ data warehouse that contains summary views of the detail data around the most common queries can sharply reduce the amount of processing needed at the time of analysis. &ummary views are typically created around business entities such as customers( products( and channels. The summary views also hide the complexities of the detail data. ?f course( performance gain is the most significant tangible aspect of the summary views in the data warehouse. 2ost relational databases provide the ability to build views for users that hide the underlying tables. In most &K% server packages( including 2& &K% &erver( the view exists only as a definition and it is created at the time it is actually used. "hile the concept of summary views in data warehousing systems is similar( it important to not confuse data warehousing summary views with the term CviewsD as it is used in a database system. $ summary view in a data warehouse refers to an actual table that is created and maintained independent of when it used by a user. The key concepts around the summary views are introduced in this section. 2.4.1 Initial analysis in summary views &ummary views often are generated not only by summari ing the detail data but also by applying business rules to the detail data. Bor example( the summary views may contain a filter that applies the exact business rules for considering an order a sale or a filter that applies the business rules for allocating a sale to a channel entity. The summary views can hide the complexities of the detail data from the end user for many( if not most( analysis tasks. The business rules that are applied in generating summary views can be complex. These business rules may determine exactly what constitutes a sale or they may determine how a sale is allocated to a sales or channel entity. %arge organi ations often have complex rules to charge sales to different ledger accounts. &ome sales may be allocated to warranty replacement and thus not be counted as sales. ?r( some sales may be further

15

20

25

30

discounted based on a master contract with the customer and thus need to be reduced when calculating product sales for a period. ?ften( a data warehouse will have more than one view based on business entities such as customers and products. There may be multiple physical tables or the same table may contain additional attributes that allow for easy queries.
5

In addition to applying the business rules while generating summary views( the data warehousing system may perform complex database operations such as multi5table <oins. >roduct sales may be computed by <oining the &ales( Invoice( and >roduct tables. The criteria to <oin these tables may be complex. "hile individuals mining data in the warehouse detail records need to understand all the complexities of business rules( most users can retrieve effective summary business information without fully understanding the detail data. 2.4.2 Significant performance gains The single most important reason for building the summary views is the significant performance gains they facilitate. Not only are all the complexities of detail data interpreted for an end user, the summary views also perform the most time5consuming data analysis before it is needed. &ummary views allow you to run a product sales query by merely setting up a filter based on indexed fields such as date( product codes( and other relevant criteria. Burther( this query will most likely run on a sharply smaller table as the summary views would have reduced the data from multiple tables containing millions of rows to tens of thousands of rows. $ query against this smaller table would be significantly faster than a query that runs against detail tables <oined for the query. In some instances a summary view table can be as large as the detail tables. This may be caused by summari ation in very small units or combining multiple summary views into one data table. Bor example( you may not be able to summari e the product sales by week. Instead daily product sales figures may be required for some queries. Even in these large summary views( the performance is generally better because many of the table <oins are eliminated and queries can generally use the indexes. 2.4.3 any views into the same detail The summary views in a data warehouse provide multiple views into the same detail data. These views are predefined dimensions into the detail data. These views provide an efficient method for the analyst to link with the detail data when necessary. Bor example( for the sales order data( four different product sales summary views could be generated summari ing weekly sales data. These views summari ing by product( customer( channel( and region all include the same detail data and they would need to be updated or regenerated as new data is brought into the data warehouse. Even though most of the analysis is likely to be done using the summary views( there needs to be a simple and robust way for an analyst to drill down into the detail data. 2any business problems require review of the detail data to fully understand a pattern or anomaly exhibited in the summari ed reports or queries. Drill down from many different summary views can lead to the same detail data. $ single anomaly in detail data may manifest itself differently in different summary views.

10

15

20

25

30

35

&ata 7arehouse 'ystem

1utomatically generate! an! up!ate!

Pro!uct summary *ustomer summary Eeographic summary

Fuic$ 4uery response

&etaile!
Perform %usiness analysis on !etail !ata

&ata

Initial analysis in summary vie s Performance gains ,any vie s into !etail !ata

Figure D. 'ummary vie s of !etail !ata

&ummari ation and predefined analysis of data in a data warehouse system is an important task. It is essential to maintain the integrity of the summary views because a very large part of the data warehouse activity is against the summary views. Bigure . highlights the key concepts around summary views. The summary views need to be not only designed and built( they need to be maintained as new data comes into the data warehouse.

2.5

"efinition $fter considering the various attributes and concepts of data warehousing systems( a broad definition of a data warehouse can be the following10

A data warehouse is a structured extensi"le environment designed for the analysis of non-volatile data( logically and physically transformed from multiple source applications to align with "usiness structure( updated and maintained for a long time period( expressed in simple "usiness terms( and summari%ed for )uick analysis*

Section 3:
15

+usiness use of a data warehouse

No discussion of the data warehousing systems is complete without review of the type of activity supported by a data warehouse. &ome of the activity against today+s data warehouses is predefined and not much different from traditional analysis activity. ?ther processes such as multi5dimensional analysis and information visuali ation were not available with traditional analysis tools and methods. There is a very interesting phenomenon that is observed with many data warehousing pro<ects. The users of a new data warehouse only wish to get the information that they were able to get using the old tools and methods. They wish to replicate their queries and reports with the data warehouse and make sure that all the numbers match. ?ften there is as much apprehension of the new tools and the data warehouse as there is excitement. It is only after using the new data warehouse for a period of time that they start to explore and discover the new capabilities that are available to them. &oon after( they start to have significant input into the data warehouse enhancement process and they happily become the mentors for the new users.

20

25

3.1

Tools to be used against the data arehouse ?ne of the ob<ectives of the data warehouse is to make it as flexible and as open as possible. It is not desirable to set a steep entry price in terms of software and training for using the data warehouse. The data warehouse

should be accessible by as many end5user tools and platforms as possible. ;et( it is not possible to make every feature of the data warehouse available from every end user tool. %ow5end tools such as simple query capability built into most spreadsheets may be adequate for a user that only needs to quickly reference the data warehouse. ?ther users may require the use of the most powerful multi5 dimensional analysis tools. The data warehouse administrators need to identify the tools that are supported for access to the data warehouse and the capabilities that are available using these different tools. There can be a progression path to the higher level tools for the data warehouse users. $ user can start with a low5level tool that is already familiar to him or her. $fter becoming familiar with the data warehouse he or she may be able to <ustify the cost and effort involved with using a more complex tool. In most data warehousing pro<ects( there is a need to select a preferred data warehouse access tool for the most active users. $ small number of users generate most of the analysis activity against the data warehouse. The data warehouse performance can be tuned to the requirements of the tool appropriate for these active users. This tool can be used for training and demonstration of the data warehouse.

10

3.2
15

#tandard reports and $ueries 2any users of the data warehouse need to access a set of standard reports and queries. It is desirable to periodically automatically produce a set of standard reports that are required by many different users. "hen these users need a particular report( they can <ust view the report that has already been run by the data warehouse system rather than running it themselves. This facility can be particularly useful for reports that take a long time to run. &uch a facility would require report server software. It is likely that these reports can be accessed only using the client program for that system. This facility would need to work with or be part of the preferred data warehouse access tool previously mentioned. 2any end user query and analysis tools now include server software that can be run with the data warehouse to serve reports and query results. These tools are now providing a web interface to the reports. In many data warehouse systems( this report and query server becomes an essential facility. The data warehouse users and administrators constantly need to consider any reports that are candidates to become standard reports for the data warehouse. Brequently( individual users may develop reports that can be used by other users. In addition to standard reports and queries( sometimes it is useful to share some of the advanced work done by other users. $ user may produce advanced analysis that can be parameteri ed or otherwise adapted by other users in different parts of the same organi ation or even in organi ations.

20

25

30

3.3

%ueries against summary tables $s introduced earlier( the summary views in the data warehouse can be the ob<ect of a large ma<ority of analysis in a data warehouse. &imple filtering and summation from the summary views accounts for most of the analysis activity against many data warehouses. These summary views contain predefined standard business analysis.
35

Bor example( in a typical data warehouse( the product summary view may account for a very large number of queries where different users select different products and the time periods for product sales and profit margin queries. These queries provide quick response and they are very simple to build. $dvanced users typically attach a pivot table in their analysis tool to data warehouse summary tables for simple multi5dimensional analysis.

3.4
40

"ata mining in the detail data Even though data mining in the detail data may account for a very small percentage of the data warehouse activity( the most useful data analysis might be done on the detail data. The reports and queries off the summary tables are adequate to answer many CwhatD questions in the business. The drill down into the detail data provides answers to CwhyD and ChowD questions. Data mining is an evolving science. $ data5mining user starts with summary data and drills down into the detail data looking for arguments to prove or disprove a hypothesis. The tools for data mining are evolving rapidly to satisfy the need to understand the behavior of business units such as customers and products.

45

3.5
50

&nterface ith other data arehouses The data warehouse system is likely to be interfaced with other applications that use it as the source of operational system data. $ data warehouse may feed data to other data warehouses or smaller data warehouses called data marts.

The operational system interfaces with the data warehouse often become increasingly stable and powerful. $s the data warehouse becomes a reliable source of data that has been consistently moved from the operational systems( many downstream applications find that a single interface with the data warehouse is much easier and more functional than multiple interfaces with the operational applications. The data warehouse can be a better single and consistent source for many kinds of data than the operational systems. It is however( important to remember that the much of the operational state information is not carried over to the data warehouse. Thus( data warehouse cannot be source of all operation system interfaces.

&ata 7arehouse

Pre!efine! reports an! 4ueries

'ystem 'ummari0e! &ata

Fueries against summary !ata

&etaile! &ata
&ata mining in !etail !ata 9ther 1pplications

'tan!ar! reports an! 4ueries Fueries against summary ta%les &ata mining in the !etail !ata Interface ith other applications

Figure 10. &ata arehouse analysis processes

10

Bigure *0 illustrates the analysis processes that run against a data warehouse. $lthough a ma<ority of the activity against today+s data warehouses is simple reporting and analysis( the sophistication of analysis at the high end continues to increase rapidly. ?f course( all analysis run at data warehouse is simpler and cheaper to run than through the old methods. This simplicity continues to be a main attraction of data warehousing systems.

You might also like