Data Warehousing & ERP Combination
Data Warehousing & ERP Combination
Data Warehousing & ERP Combination
by Anne Marie Smith, Ph.D. Published: April 1, 2002. Published in TDAN.com April 2002
Since the introduction of the term data warehousing in 1990, companies have explored the ways they can capture, store and manipulate data for analysis and decision support. At the same time, many companies have been instituting enterprise resource planning (ERP) software to coordinate the common functions of an enterprise. ERP software usually has a central database as its hub, allowing applications to share and reuse data more efficiently than previously permitted by separate applications. The use of ERP has led to an explosion in source data capture, and the existence of a central ERP database has created the opportunity to develop enterprise data warehouses for manipulating that data for analysis. This paper will provide an overview of the issues and challenges that the intersection of these two IS concepts are creating. Data warehouses are one of the foundations of the decision support systems of many IS operations. They serve as the storage facility of millions of transactions, formatted to allow analysis and comparison. As defined by the father of data warehouse, William H. Inmon, a data warehouse is a collection of integrated, subject-oriented databases where each unit of data is specific to some period of time. Data Warehouses can contain detailed data, lightly summarized data and highly summarized data, all formatted for analysis and decision support (Building a Data Warehouse, Inmon, W. H.; Wiley, 1996). In the Data Warehouse Toolkit, Ralph Kimball gives a more succinct definition: a copy of transaction data specifically structured for query and analysis (The Data Warehouse Toolkit, Kimball, R.; Wiley, 2000). Both definitions stress the data warehouses analysis focus, and highlight the historical nature of the data found in a data warehouse. Enterprise Resource Planning software is a recent addition to the manufacturing and information systems that have been designed to organize the flow of data from process start to finish. This flow of information has existed since the first manufacturers traded with the first merchants, but until the advent of ERP software and the processes that accompany it, this information was largely ignored and not captured. ERP software attempts to link all internal company processes into a common set of applications that share a common database. It is the common database that allows an ERP system to serve as a source for a robust data warehouse that can support sophisticated decision support and analysis.
ERP software is divided into functional areas of operation; each functional area consists of a variety of business processes. The main, common functional areas of operation in most companies would include: Marketing and Sales; Production and Operations (Materials Management, Inventory, etc.); Accounting and Finance; Human Resources. Historically, businesses have had clear divisions among each of these areas, and IS development was also clearly delineated so that systems did not share data or processes and crossfunctional analysis of information was not possible. Since all functional areas ARE interdependent, this separation was not a valid representation of a business activities and the divisions among the many information systems created artificial barriers that needed to be overcome. ERP software was designed to eliminate the barriers to sharing data and processes that occur when companies design and implement information systems for a single function or activity. ERP software coordinates the entire business process, and stores all the captured data in a common database, accessible to all the integrated applications of the ERP suite. As explained in Concepts in Enterprise Resource Planning (Brady, Monk, Wagner; Course Technology, 2001) companies can achieve many cost savings and related benefits from the use of ERP for transaction processing and management reporting through the use of the ERPs common database and integrated management reporting tools. However, much of the work performed by managers and knowledge workers in the 21st century is not transaction or management reporting-based. The main activity of knowledge and management staff is analysis, and this analysis is supported by the development and use of decision support systems. The most common application of DSS in companies today is the data warehouse. With the use of the ERPs common database and the implementation of DSS/DW user support products companies can design a decision support/data warehouse database that allows cross-functional area analysis and comparisons for better decision-making. Since companies usually implement an ERP in addition to their current applications, the problem of data integration from the various sources of data to the data warehouse becomes an issue. Actually, the existence of multiple data sources is a problem with ERP implementation regardless of whether a company plans to develop a data warehouse; this issue must be addressed and resolved at the ERP project initiation phase to avoid serious complications from multiple sources of data for analysis and transaction activity. In data warehouse development, data is usually targeted from the most reliable or stable source system and moved into the data warehouse database as needed. Identification of the correct source system is essential for any data warehouse development, and is even more critical in a data warehouse that includes an ERP along with more traditional transaction systems. Integration of data from multiple sources (ERP database and others) requires rigorous attention to the
metadata and business logic that populated the source data elements, so that the right source is chosen. Another troubling issue with ERP data is the need for historical data within the enterprises data warehouse. Traditionally, the enterprise data warehouse needs historical data (see Inmons definition). And traditionally ERP technology does not store historical data, at least not to the extent that is needed in the enterprise data warehouse. When a large amount of historical data starts to stack up in the ERP environment, the ERP environment is usually purged, or the data is archived to a remote storage facility. For example, suppose an enterprise data warehouse needs to be loaded with five years of historical data while the ERP holds at the most, six months worth of detail data. As long as the corporation is satisfied with collecting a historical set of data as time passes, then there is no problem with ERP as a source for data warehouse data. But when the enterprise data warehouse needs to go back in time and bring in historical data that has not been previously collected and saved by the ERP, then using the ERP environment as a primary source for the data warehouse is not a viable option. Metadata in the ERP is another consideration when building a data warehouse is in the ERP environment. As the metadata passes from the ERP to the data warehouse environment, the metadata must be moved and transformed into the format and structure required by the data warehouse infrastructure. There is a significant difference between operational metadata and DSS/DW metadata. Operational metadata is primarily for the developer and programmer. DSS metadata is primarily for the end user. The metadata that exists in the ERP applications database must be converted, and such a conversion is not always easy or uncomplicated, and requires experienced data administrators and users to collaborate in the effort. Mr. Inmon suggests some guidelines for using the ERP database as a source for a data warehouse. They would include the existence of a solid interface that pulls data from the ERP environment to the data warehouse environment. The ERP to enterprise data warehouse interface needs to:
be easy to use enable the access of ERP data capture the meaning of the data that is being transported into the data warehouse be aware of restrictions within the ERP that might exist when it comes to the accessing of ERP data be aware of referential integrity be aware of hierarchical relationship be aware of logically defined - implicit - relationships be aware of application conventions be aware of any structures of data supported by the ERP be efficient in accessing ERP data, supporting -
direct data movement change data capture be supportive of timely access of ERP data understand the format of data
o o
(taken from Data Warehousing and ERP, a white paper by Wm. H. Inmon, Kiva Productions, LLC, 1999) In summary, the development of data warehouses and the emergence of ERP as factors in the information systems explosion must be addressed and resolved by experienced information systems professionals with a clear understanding of the challenges each environment poses. Integrating ERP data into a data warehouse can lead to a superior source of data for analysis and decisionmaking if the data is formatted for query and reporting, and if the ERP environment is coordinated with the decision support needs of the organization. To ignore the wealth of data and information that is available from an ERP is to ignore a valuable corporate resource, one that can serve as a foundation for a superior data warehouse.
Transaction locking considerations, and transactions themselves, play very small roles in data warehouse databases. OLTP systems specialize in large volumes of data update transactions. In contrast, data warehouses specialize in rapid retrieval of information from stable data, and data updates consist primarily of periodic additions of new data. Backup and restore strategies also differ in a data warehouse from those necessary for an OLTP system. Much of the data in a data warehouse is unchanging history and does not need repetitive backup. Backup of new data can be accomplished at the time of update, and in some situations it is feasible to do these backups from the data preparation database to minimize performance impact on the data warehouse database. Restore policies for a data warehouse might also differ from those for an OLTP, depending on how critical it is for an organization to have uninterrupted access to data warehouse data. There are some considerations to take into account when designing the data warehouse if you are planning to use Microsoft SQL Server 2000 Analysis Services for OLAP and data mining. For more information, see Analysis Services Overview and OLAP and Data Warehouses. Data Mart Design There are two approaches to creating a data warehouse system for an organization. A central data warehouse can be developed and implemented first with data marts created later, or data marts can be implemented such that they make up the data warehouse when their information is joined. In either approach, design must be centralized so that all of the organization's data warehouse information is consistent and usable. Data marts that adhere to central design specifications produce reports that are consistent even though the data resides in different places. For example, a sales data mart must use the same product table arranged in the same way as the inventory data mart or summary information will be inconsistent between the two.
schemas produced by these arrangements are called star or snowflake schemas, and have been proven effective in data warehouse design. Dimensional modeling organizes information into structures that often correspond to the way analysts want to query data warehouse data. For example, the question, "What were the sales of food items in the northwest region in the third quarter of 1999?" represents the use of three dimensions (product, geography, time) to specify the information to be summarized. A Data Warehouse Model A simple dimensional model of sales information might include a fact table named Sales_Fact that contains one record for each line item of each sale, capturing the quantity sold, the unit cost, and the sale value. Varieties of information about a sales record might include the customer, the store where the sale occurred, the time and date of the sale, and the product sold. Each of these categories of information is organized into its own dimension table. Customer information is placed in a Customer dimension table, store information in a Store dimension table, time and date information in a Time dimension table, and product information in a Product dimension table. In a star schema, each dimension table has a single-part primary key that links to one part of the multipart primary key in the fact table. In a snowflake schema, one or more dimension tables are decomposed into multiple tables with the subordinate dimension tables joined to a primary dimension table instead of to the fact table. In most designs, star schemas are preferable to snowflake schemas because they involve fewer joins for information retrieval and are easier to manage.
Fact Tables
SQL Server 2000 Each data warehouse or data mart includes one or more fact tables. Central to a star or snowflake schema, a fact table captures the data that measures the organization's business operations. A fact table might contain business sales events such as cash register transactions or the contributions and expenditures of a nonprofit organization. Fact tables usually contain large numbers of rows, sometimes in the hundreds of millions of records when they contain one or more years of history for a large organization. A key characteristic of a fact table is that it contains numerical data (facts) that can be summarized to provide information about the history of the operation of the organization. Each fact table also includes a multipart index that contains
as foreign keys the primary keys of related dimension tables, which contain the attributes of the fact records. Fact tables should not contain descriptive information or any data other than the numerical measurement fields and the index fields that relate the facts to corresponding entries in the dimension tables. In the FoodMart 2000 sample database provided with Microsoft SQL Server 2000 Analysis Services, one fact table, sales_fact_1998, contains the following columns.
Column
product_id time_id customer_id promotion_id store_id store_sales store_cost unit_sales
Description
Foreign key for dimension table product. Foreign key for dimension table time_by_day. Foreign key for dimension table customer. Foreign key for dimension table promotion. Foreign key for dimension table store. Currency column containing the value of the sale. Currency column containing the cost to the store of the sale. Numeric column containing the quantity sold.
In this fact table, each entry represents the sale of a specific product on a specific day to a specific customer in accordance with a specific promotion at a specific store. The business measurements captured are the value of the sale, the cost to the store, and the quantity sold. The most useful measures to include in a fact table are numbers that are additive. Additive measures allow summary information to be obtained by adding various quantities of the measure, such as the sales of a specific item at a group of stores for a particular time period. Nonadditive measures such as inventory quantity-on-hand values can also be used in fact tables, but different summarization techniques must then be used. Aggregation in Fact Tables
Aggregation is the process of calculating summary data from detail records. It is often tempting to reduce the size of fact tables by aggregating data into summary records when the fact table is created. However, when data is summarized in the fact table, detailed information is no longer directly available to the analyst. If detailed information is needed, the detail rows that were summarized will have to be identified and located, possibly in the source system that provided the data. Fact table data should be maintained at the finest granularity possible. Aggregating data in the fact table should only be done after considering the consequences. Mixing aggregated and detailed data in the fact table can cause issues and complications when using the data warehouse. For example, a sales order often contains several line items and may contain a discount, tax, or shipping cost that is applied to the order total instead of individual line items, yet the quantities and item identification are recorded at the line item level. Summarization queries become more complex in this situation, and tools such as Analysis Services often require the creation of special filters to deal with the mixture of granularity. There are two approaches that can be used in this situation. One approach is to allocate the order level values to line items based on value, quantity, or shipping weight. Another approach is to create two fact tables, one containing data at the line item level, the other containing the order level information. The order identification key should be carried in the detail fact table so the two tables can be related. The order table can then be used as a dimension table to the detail table, with the order-level values considered as attributes of the order level in the dimension hierarchy.
Aggregation Tables
SQL Server 2000
Aggregation tables are tables that contain summaries of fact table information. These tables are used to improve query performance when SQL is used as the query mechanism. OLAP technology, such as that provided by Microsoft SQL Server 2000 Analysis Services, eliminates the need for such tables. Analysis Services creates OLAP cubes that contain preaggregated summaries so that queries can be answered quickly, regardless of the level of summarization required to answer the query. It is not necessary to create aggregation tables in the data warehouse when Analysis Services is used to provide presentation services. Analysis Services creates aggregations as necessary and stores them in tables in the data warehouse database or in internal multidimensional structures. For more information, see Analysis Services Overview.
Dimension Tables
SQL Server 2000
Dimension tables contain attributes that describe fact records in the fact table. Some of these attributes provide descriptive information; others are used to specify how fact table data should be summarized to provide useful information to the analyst. Dimension tables contain hierarchies of attributes that aid in summarization. For example, a dimension containing product information would often contain a hierarchy that separates products into categories such as food, drink, and nonconsumable items, with each of these categories further subdivided a number of times until the individual product SKU is reached at the lowest level. Dimensional modeling produces dimension tables in which each table contains fact attributes that are independent of those in other dimensions. For example, a customer dimension table contains data about customers, a product dimension table contains information about products, and a store dimension table contains information about stores. Queries use attributes in dimensions to specify a view into the fact information. For example, a query might use the product, store, and time dimensions to ask the question "What was the cost of nonconsumable goods sold in the northeast region in 1999?" Subsequent queries might drill down along one or more dimensions to examine more detailed data, such as "What was the cost of kitchen products in New York City in the third quarter of 1999?" In these examples, the dimension tables are used to specify how a measure (cost) in the fact table is to be summarized. Columns in a dimension table can be used to categorize the information into hierarchical levels. For example, a dimension table for stores in the FoodMart 2000 sample database includes the following columns that specify the hierarchy levels.
Column
store_country
Description
Specifies the country or region in which the store is located. This is the country level of the hierarchy. Specifies the state in which the store is located. This is the state level of the hierarchy. Specifies the city or province in which the store is located. This is the city level of the hierarchy. Specifies the individual store. This is the lowest level of the hierarchy.
store_state
store_city
store_id
This field contains the primary key of the store dimension table and is used to join the dimension table to the fact table. store_name Specifies the name of the store. The values in this column are used to identify the store to users in a readable form.
Other columns not shown provide additional attribute information. For information about how dimension tables are used in OLAP cubes built using Microsoft SQL Server 2000 Analysis Services, see Dimensions. Varieties of Dimension Tables The preceding example illustrates a dimension table that contains a balanced hierarchy that is separated into regular levels. Other types of dimension tables contain less balanced information, such as part-breakdown structures or organization charts in which the hierarchy is represented by parent-child relationships instead of an array of levels. Surrogate Keys It is important that primary keys of dimension tables remain stable. It is strongly recommended that surrogate keys be created and used for primary keys for all dimension tables. Surrogate keys are keys that are maintained within the data warehouse instead of keys taken from source data systems. There are several reasons for the use of surrogate keys:
Data tables in various source systems may use different keys for the same entity. Legacy systems that provide historical data might have used a different numbering system than a current online transaction processing system. A surrogate key uniquely identifies each entity in the dimension table regardless of its source key. A separate field can be used to contain the key used in the source system. Systems developed independently in company divisions may not use the same keys, or they may use keys that conflict with data in the systems of other divisions. This situation may not cause problems when each division independently reports summary data, but it cannot be permitted in the data warehouse where data is consolidated.
This situation is usually less likely than others, but some systems have been known to reuse keys belonging to obsolete data. However, the key may still be in use in historical data in the data warehouse, and the same key cannot be used to identify different entities.
Changes in organizational structures may move keys in the hierarchy. This can be a common situation. For example, if a salesperson is transferred from one region to another, the company may prefer to track two things: sales data for the salesperson with the person's original region for data prior to the transfer date, and sales data for the salesperson in the person's new region after the transfer date. To represent this organization of data, the salesperson's record must exist in two places in the sales force dimension table, which is not possible if the salesperson's company employee identification number is used as the primary key for the dimension table. A surrogate key allows the same salesperson to participate in different locations in the dimension hierarchy. In this case, the salesperson will be represented twice in the dimension table with two different surrogate keys. These surrogate keys are used to join the salesperson's records to the sets of facts appropriate to the various locations in the hierarchy occupied by the salesperson. The employee's identification number should be carried in a separate column in the table so information about the employee can be reviewed or summarized regardless of the number of times the employee's record appears in the dimension table. Dimensions that exhibit this type of change are called slowly changing dimensions. Another example of a situation that causes this type of change is the creation of a new version of a product, such as a reduced-fat version of a food item. The item will receive a new SKU or Uniform Product Code (UPC), but may retain most of the same attributes of the original item, which is still manufactured and sold. The appropriate use of surrogate keys can allow the two versions of the item to be summarized together or separately.
The implementation and management of surrogate keys is the responsibility of the data warehouse. OLTP systems are rarely affected by these situations, and the purpose of these keys is to accurately track history in the data warehouse. Surrogate keys are maintained in the data preparation area during the data transformation process.
Referential Integrity Referential integrity must be maintained between all dimension tables and the fact table. Each fact record contains foreign keys that relate to primary keys in the dimension tables. Every fact record must have a related record in every dimension table used with that fact table. Missing records in a dimension table can cause facts to be ignored when the dimension table is joined to the fact table to respond to queries or for the population of OLAP cubes. Queries can return inconsistent results if records are missing in one or more dimension tables. Queries that join a defective dimension table to the fact table will exclude facts whereas queries that do not join the defective dimension table will include those facts. Shared Dimensions A data warehouse must provide consistent information for similar queries. One method to maintain consistency is to create dimension tables that are shared and used by all components and data marts in the data warehouse. Candidates for shared dimensions include customers, time, products, and geographical dimensions such as the store dimension in the example earlier in this topic. For example, requiring that all OLAP cubes and data marts use the same shared time dimension enforces consistency of results summarized by time.
Indexes
SQL Server 2000
Indexes play an important role in data warehouse performance, as they do in any relational database. Every dimension table must be indexed on its primary key. Indexes on other columns such as those that identify levels in the hierarchical structure can also be useful in the performance of some specialized queries. The fact table must be indexed on the composite primary key made up of the foreign keys of the dimension tables. These are the primary indexes needed for most data warehouse applications because of the simplicity of star and snowflake schemas. Special query and reporting requirements may indicate the need for additional indexes.
Using
Data
Warehouses
Organizations collect data in the normal course of business operations. The purpose of a data warehouse is to consolidate and organize this data so it can be analyzed and used to support business decisions. In many cases a data warehouse contains the living history of the organization. Data warehouses usually contain historical data, often collected from a variety of disparate sources such as online transaction processing (OLTP) systems, legacy systems, text files, or spreadsheets. A data warehouse combines this data, cleanses it for accuracy and consistency, and organizes it for ease and efficiency of querying. Some definitions of a data warehouse include several elements such as a data preparation area, the cleansing process, the database that holds the data warehouse data, and the tools that organize and present the data to client applications. Other definitions restrict the data warehouse to the database that contains the data warehouse data. In large data warehousing applications, data is often segmented into specialized components, called data marts, that address individual components of the organization. Some definitions consider data marts to be part of the data warehouse; other definitions consider them to be separate entities. The intended meaning of the term data warehouse is usually clear from the context in which it is used. The topics in this section generally use the broadest definition and address individual elements as components within the context of the data warehouse. Data warehousing is an advanced and complex technology. A complete treatment of data warehousing is not possible in this document, but many excellent books and training courses are available to enhance your understanding. The topics in this section discuss the elements and processes of data warehousing and identify the many powerful tools provided by Microsoft SQL Server 2000 to help you in the task of creating, using, and maintaining a data warehouse.
Topic
Description
SQL Server 2000 Tools for Data Describes tools provided by SQL Server 2000 that are Warehouses commonly used in data warehouse applications. Parts of a Data Warehouse Creating a Data Warehouse Using a Data Warehouse Describes the elements that make up a data warehouse. Describes the steps in creating a data warehouse. Describes the tools and methods used to prepare data for presentation and to provide client access to the information.
Describes the processes involved in updating data in the data warehouse and modifying the presentation of information to users.
2000
Tools
for
Data
Microsoft SQL Server 2000 provides many tools that support database applications. Some of these tools are used more often than others in data warehouse applications, and some are specifically designed to address special needs of data warehouses. The tools listed here are commonly used in data warehouse applications, although most are also applicable to other database applications. Many other tools not mentioned here can often be used to solve specific problems in data warehouse applications. General descriptions of the tools commonly used in data warehouse applications are provided here with links to more detailed information about the tools themselves. The uses of these tools in data warehouse applications are specifically discussed in other topics in this section. Relational Database Data warehouses use relational database technology as the foundation for their design, construction, and maintenance. The core component of SQL Server 2000 is a powerful and full-featured relational database engine. SQL Server 2000 provides many tools for design and manipulation of relational databases, regardless of the applications for which the databases are used. Information about the many relational database management tools is provided throughout the SQL Server 2000 documentation. For more information, see SQL Server 2000 Features. Data Transformation Services Data warehouse applications require the transformation of data from many sources into a cohesive, consistent set of data configured appropriately for use in data warehouse operations. SQL Server 2000 provides a powerful tool for such tasks, Data Transformation Services (DTS). DTS can access data from a wide variety of sources and transform it using built-in and custom transformation specifications. For more information, see DTS Overview. Replication
Database replication is a powerful tool with many uses. Often used to distribute data and coordinate updates of distributed data in online transaction processing systems (OLTP), replication can also be used in data warehouses. Some potential data warehouse applications of replication are the distribution of data from a central data warehouse to data marts, and the updating of data warehouse data from the data preparation area. For more information, seeReplication Overview. Analysis Services Data warehouses collect and organize enterprise data to support organizational decision-making through analysis. SQL Server 2000 Analysis Services provides online analytical processing (OLAP) technology to organize massive amounts of data warehouse data for rapid analysis by client tools, and sophisticated data mining technology to analyze and discover information within the data warehouse data. For more information, see Analysis Services Overview. English Query English Query provides access to data warehouse data using English language queries such as "Show me the sales for stores in California for 1996 through 1998." English Query is a development tool for creating client applications that transform English language into the syntax of SQL to query relational databases or the syntax of Multidimensional Expressions (MDX) to query OLAP cubes. You can develop English Query models specific to your data warehouse to reduce sophisticated and complex SQL or MDX queries to simple English questions. For more information, see English Query Overview. Meta Data Services Many of the various tools in SQL Server 2000 store meta data in a centralized repository in the msdb system database. SQL Server 2000 Meta Data Services provides a browser for viewing this meta data and application interfaces for use in developing custom meta data applications. For more information, see Meta Data Services Overview.
There are several physical and functional elements that make up a data warehouse. The topics in this section discuss these elements.
Topic
Description
Data Marts
Describes data marts, which contain portions of data warehouse data for specialized purposes. Describes the roles and uses of relational databases in data warehouses. Describes various sources of organizational data typically used in data warehouses. Describes the area where data extracted from data sources is prepared for use in a data warehouse. Describes the services that organize and analyze data warehouse information and make it available to client applications. Describes the use of client applications to access and analyze information in a data warehouse.
Relational Databases
Data Sources
Presentation Services
End-User Analysis
Data Marts
SQL Server 2000
In some data warehouse implementations, a data mart is a miniature data warehouse; in others, it is just one segment of the data warehouse. Data marts are often used to provide information to functional segments of the organization. Typical examples are data marts for the sales department, the inventory and shipping department, the finance department, upper level management, and so on. Data marts can also be used to segment data warehouse data to reflect a geographically compartmentalized business in which each region is relatively autonomous. For example, a large service organization may treat regional operating centers as individual business units, each with its own data mart that contributes to the master data warehouse. Data marts are sometimes designed as complete individual data warehouses and contribute to the overall organization as a member of a distributed data warehouse. In other designs, data marts receive data from a master data warehouse through periodic updates, in which case the data mart functionality is often limited to presentation services for clients. Regardless of the functionality provided by data marts, they must be designed as components of the master data warehouse so that data organization, format, and schemas are consistent throughout the data warehouse. Inconsistent table designs, update mechanisms, or dimension hierarchies can prevent data from being reused throughout the data warehouse, and they can result in
inconsistent reports from the same data. For example, it is unlikely that summary reports produced from a finance department data mart that organizes the sales force by management reporting structure will agree with summary reports produced from a sales department data mart that organizes the same sales force by geographical region. It is not necessary to impose one view of data on all data marts to achieve consistency; it is usually possible to design consistent schemas and data formats that permit rich varieties of data views without sacrificing interoperability. For example, the use of a standard format and organization for time, customer, and product data does not preclude data marts from presenting information in the diverse perspectives of inventory, sales, or financial analysis. Data marts should be designed from the perspective that they are components of the data warehouse regardless of their individual functionality or construction. This provides consistency and usability of information throughout the organization. Microsoft SQL Server 2000 tools used for a data mart may include any of the tools used for data warehouses depending on how the data mart is designed. If the data mart is created and maintained locally and participates in the organization's data warehouse as an independent contributor, its creation and maintenance will involve all the operations of a data warehouse. If the data mart is a local access point for data distributed from a central data warehouse, only a subset of the tools may be involved. Distributing Data Warehouse Data to Data Marts If data warehouse data is maintained in a central data warehouse, the data is prepared and loaded into the data warehouse at the central site and then distributed to local data marts. SQL Server Agent and Data Transformation Services (DTS) can be used to schedule and perform data transfers, including filtering data appropriate to the data mart and updating the appropriate tables in the data mart. DTS packages can also be created and scheduled to update OLAP cubes in the data mart after new data is received from the central data warehouse. Some data warehouse distribution scenarios may also use replication to coordinate and maintain data mart data.
Data Sources
SQL Server 2000
Data warehouses are intended to provide information to decision makers. To do so, data warehouses must gather and consolidate data from many sources in the organization into a consistent set of data that accurately reflects the organization's business operation and history. Organizations often have multiple online transaction processing (OLTP) systems to capture daily business operations. These OLTP systems are seldom designed at the same time as data warehouses. They may even be designed by different organizations, which is often the case when organizations grow through acquisitions and mergers. Database schemas and data element identification keys often vary from database to database. For example, the customer table in the OLTP of an acquired company may contain many of the same customers and products as the acquiring company but use a different identification system. Data extracted from these OLTP systems must be transformed into a common representation. Legacy systems that have been in use for many years often contain denormalized data as well as unusual data identification designs and limited query flexibility. Data critical for business analysis may even reside on individual desktop computers in personal databases and spreadsheets, especially in organizations that developed and grew without a central information technology group. Such data must also be captured into the data warehouse. Sources of data to be used in the data warehouse must be identified and techniques developed for extracting the data from them. Data Transformation Services (DTS) provides powerful tools for extracting and transforming data from diverse data sources. For more information, see DTS Overview and DTS Basics.
Data to be used in the data warehouse must be extracted from the data sources, cleansed and formatted for consistency, and transformed into the data warehouse schema. The data preparation area, sometimes called the data staging area, is a relational database into which data is extracted from the data sources, transformed into common formats, checked for consistency and referential integrity, and made ready for loading into the data warehouse database. The data preparation area and the data warehouse database can be combined in some data warehouse implementations as long as the cleansing and transformation operations do not interfere with the performance or operation of serving the end users of the data warehouse data. Performing the
preparation operations in source databases is rarely an option because of the diversity of data sources and the processing load that data preparation can impose on online transaction processing systems. The relational database used for data preparation, regardless of where it is performed, must have powerful data manipulation and transformation capabilities such as those provided by Microsoft SQL Server 2000. After the initial load of a data warehouse, the data preparation area is used in an ongoing basis to prepare new data for updating the data warehouse. In most data warehouse systems, these ongoing operations are performed on a periodic basis, often scheduled to minimize performance impact on the operational data source systems. The use of a data preparation area that is separated from the data sources and the data warehouse promotes effective data warehouse management. Attempting to transform data in the data source systems can interfere with online transaction processing (OLTP) performance, and many legacy systems do not have effective or easily implemented transformation capabilities. Reconciliation of inconsistencies in data extracted from various sources can rarely be accomplished until the data is collected in a common database, at which time data integrity errors can more easily be identified and rectified. The data preparation area should isolate raw data from the data warehouse data to preserve the integrity of the data warehouse and permit it to perform its primary function of preparing information for presentation and supporting access by clients. If the data warehouse database is used for data preparation, care should be taken to avoid introducing errors into the data warehouse data and to minimize the effect of data preparation processing on the performance of the data warehouse. Many data warehouse database operations require sophisticated queries and the processing of large amounts of data; data cleansing can interfere with these operations. The data preparation area is a relational database that serves as a general work area for the data preparation operations. It will contain tables that relate source data keys to surrogate keys used in the data warehouse, tables of transformation data, and many temporary tables. It will also contain the processes and procedures, such as Data Transformation Services (DTS) packages, that extract data from source data systems.
OLAP Terminology
Let's take a simple sales database and convert it to a multidimensional database. Every time company XYZ makes a sale, the salesperson records the date, the sales amount, the product sold, and the customer that bought it. When we convert this information into a multidimensional database, the numeric values become cells, or measures. In this example, the only measure is sales amount. A multidimensional database organizes the attributes of a sale (e.g., date, product, and customer) into dimensions (e.g., a time dimension, a product dimension, and a customer dimension). The individual items within a dimension are members. For example, split pea soup is a member in the product dimension. A multidimensional database stores measures, dimensions and their members in a cube. A cube is similar to a relational table but can have more than two dimensions. (For an illustration of cube, see Bob Pfeiff's "OLAP: Resistance Is Futile," page 22.) The members in the dimensions are stored in hierarchies. The original simple sales database has a date entry for each product sold. When we add that date information to a cube, we organize all the individual dates into a hierarchy. We can choose to organize the dates into a standard calendar or a fiscal calendar. A standard calendar combines all the dates in May into a time member called May. We can group May with April and June to form Quarter 2, and we can group Quarter 2 with the other three quarters to form a member called Year 1999. The terms roll up and drill down describes actions we can perform on members of a dimension. For example, if we drill down on Quarter 3, we see Quarter 3's direct descendants in the hierarchy (i.e., July, August, and September). If we roll up August, we see August's parent in the hierarchy (i.e., Quarter 3) and the siblings of August's parent (i.e., Quarter 1, Quarter 2, and Quarter 4). We can use rolled up as a synonym for aggregated to. For example, July, August, and September are rolled up to Quarter 3.
In a cube, any measure is available for any combination of members. This availability means that we can ask for the sales for product A in Quarter 3 or the sales for product A on May 23. The cube aggregates the sales values to determine the values at higher levels in the hierarchy. Even sales on May 23 is an aggregate because multiple transactions might have occurred on that day. A cube doesn't store the original transactions; it stores only the aggregations. When you create a cube, you choose how many aggregations the cube will have. The number of aggregations affects your MDX queries' performance. The ability to move between different combinations of dimensions when viewing data with an OLAP browser/Report Browser. Multidimensional analysis tools organize the data in two primary ways: in multiple dimensions and in hierarchies. Slicing and dicing refers to the ability to combine and re-combine the dimensions to see different slices of the information. Picture slicing a three-dimensional cube of information, in order to see what values are contained in the middle layer. Slicing and dicing a cube allows an end-user to do the same thing with multiple dimensions.
For eg, If one User at Executive level is looking sales on Country level, some other may want that same data but on city level for a particular country.
So in this case slice n dice provides the option of just changing the dimensions from "Country" to "City", rest remains the same.
PIVOTING, AND SLICING & DICING ANALYSIS Slicing means taking out the slice of a cube, given certain set of select dimension (product), and value (home furnishings..) and measures (sales value, sales units..). Dicing means viewing the slices from different angles. For example -Revenue for different products within a given state or revenue for different states for a given
product. One form of Slicing and Dicing is called pivoting. Slicing means taking out the slice of a cube, given certain set of select dimension (customer segment), and value (home furnishings..) and measures (sales revenue, sales units..) or KPIs (Sales Productivity). Dicing means viewing the slices from different angles. For example -Revenue for different products within a given state OR revenue for different states for a given product. Slicing and Dicing leads to what you can call Pivot. Pivot is known in Excel context. Pivot is the standard and basic look and feel of the views you create on the OLAP cubes. A pivot creates an ability for you to create the width and depth in your view of the data. A pivot is a two dimensional lay-out of the summary data. The x and y axis are the dimensions and the intersection cells for any two dimension values contain the value of the measures. Here is an example of how you can slice and dice through pivot: Step1: Starting layout- You can have product list on y axis (say 10 products), the quarters (say four quarters) on the X-axis. You can have sales value as the measure shown in the table against intersection of a given product and a quarter. You will have 10 X 4 matrix. Step 2: Adding depth Cross-Dimensionally-Taking a step further, you can add a dimension of locations under the product to give it more depth. Therefore now you can have different locations (say 3 locations) for each row of product. You will not have a 30 (3 locations for each of the 10 products) X 4 (quarters) matrix. Step 3: Adding depth within a single dimension: You can also add another dimension like months under quarters. Now you will have 30 X 12 (3 months for each quarter). You can also specify, if you want to have sub-totals for every dimension. For example, you can have the sub-totals for locations, productions, month and quarters.
Step 4: Pivoting on an axis: You can also pivot your view and transpose the product+ location combination on X axis and quarter + month combination on Y axis. Step 5: Adding Width: Referring to starting layout-You can also add dimensions in 'width' instead of 'depth'. For example- instead of having location dimension under the product, you can add location dimension adjacent to the product dimension. Therefore, you will have a matrix, which on Y axis will have 10 rows (for 10 products) and 3 rows (for 3 locations), with a 13X4 matrix.
of this session is a definition/description of 30 OLAP terms. Many of these concepts are needed by everybody who uses OLAP. Others are only important for people who are developing an OLAP system. If youre just starting out with OLAP, try browsing a cube as youre reading these definitions. You need to see how dimensions, levels, measures, etc. actually look to an OLAP end-user. Here is my Top 30 List of OLAP Concepts. For each one I give a short-hand definition (which is highlighted) and a lengthier description. OLAP Browsing Data Mining Cube Dimension Level Drilling Up Drilling Down Drill-Through Hierarchy Member Set Member Property (Attribute) Child Members Slicing Dicing Tuple Measure (Fact) Calculated Measures
Cell Current Member Dimension Table Fact Table Star Schema Snowflaking MDX Unbalanced Hierarchy Ragged Hierarchy Actions Local Cube OLAP (The Acronym) OnLine Analytical Processing The logical meaning of OLAP would seem to include any computer application that is used to analyze data, but nobody really uses it like that. OLAP is one of the worst acronyms I have ever seen when you tell someone what the letters stand for, you still dont have the slightest idea what the software actually looks like. OLAP (Popular Definition) - A Million Spreadsheets in A Box This is what you can tell people about OLAP who have never seen it before. OLAP is basically a spreadsheet tool pretty powerful and flexible but its basic purpose is to show spreadsheets. The key for OLAP is the ability to navigate to different views of the data. You dont have to ask your technical people every time you want to see your data in a new way. Your OLAP tool allows you to move quickly and easily from one perspective to another. Do you see some information that interests you? With OLAP you can look at that data from a more detailed (or a more general) perspective. OLAP (Another Popular Definition) - Reporting With Something Extra - A standard paper-based executive report shows two dimensions - a cost center displayed on each column, an account displayed on each row, with dollar figures in the middle. An OLAP
cube adds extra dimensions - trends over time, accounts displayed for different geographical areas or for different products. You can easily move among different perspectives and between a more detailed or a more general perspective. OLAP (Technical Definition) - Fast, Interactive Browsing of Multidimensional, Multi-Level Data OLAP always involves multiple dimensions, which should have multiple levels. (If you dont have levels, your OLAP browsing doesnt have much power.) OLAP has to be fast. OLAP is interactive. OLAP browsing is something done by a human analyst. As the analysts view the data, they can ask new questions of the data and receive immediate answers. If you have a data analysis application that doesnt return results of new queries (almost) immediately, you dont have OLAP. How fast is fast enough? Less than 5 seconds is OK, but less than 1 second is a lot better. The value of Analysis Services is that it makes this fast query response possible. The data is organized in multidimensional structures, with some of the summary (aggregated) views in the data calculated ahead of time. If I had my way, we would throw the term OLAP away and replace it with something like Interactive Spreadsheeting. Or maybe we should replace the term OLAP Browsing with Data Browsing. Data Mining Computer Analysis of Data that is Not Interactive You can use Analysis Services for both OLAP and data mining. Both are methods of analyzing data. The big difference is that OLAP is used interactively. A person browses OLAP data to find the significant information. Data mining is a process where the computer analyzes data and then reports the significant results back to the analyst.
(NOTE: There is at least one Analysis Services client tool that allows a user to switch back and forth between OLAP and data mining. The user can start out with data mining and then analyze the significant findings with OLAP. Insights gained from OLAP browsing can then be used for additional data mining.) Cube A Multidimensional Way to Look at Data The cube is the primary OLAP structure used to view data. It is analogous to a table in the relational database world. The term cube implies three dimensions, but OLAP folk have stretched the concept a little bit. An Analysis Services cube can have 128 dimensions (though its probably not a good idea to have that many). Dimensions (Definition #1) - The Perspectives Used for Looking at the Data Dimensions are the answer to the question How do you want to see your data? Here are some examples of dimensions Product Time Store Customer Age Customer Income Employee Dimensions (Definition #2) - The Categories You Use for the Columns and Rows in the Spreadsheet - Do you want to see how different products have done in different time periods? Put the different products on the columns and the different time periods on the rows (or the other way around). How about looking at the age of customers who are shopping at different stores? Use a different row for each age group and a separate column for each store. Dimensions Have Levels Which Are Used for Drilling Down and Drilling Up If you just had dimensions, OLAP browsing would be kind of dull. The value of OLAP comes
in having good levels for your dimensions. Levels let you see the general view of things and the detailed view of things. If you notice that sales are higher in a particular month, you will maybe want to drill down to see if they were higher in a particular part of the month. You also might want to drill up to a higher level, to see if the data patterns are valid on a wider scale. Levels are Organized Into Hierarchies We should really say that dimensions have hierarchies and its the hierarchies that have levels. But many dimensions have only one hierarchy, and when they do, that hierarchy can be ignored, for the most part. But there are times when a single dimension has more than one hierarchy such as when the levels of a Time dimension are organized by Calendar Year and by Fiscal Year. Note: Some OLAP tools dont handle hierarchies very well. You might not see them in the one youre using, or they might just appear as if they were separate dimensions. Levels Have Members the Labels for the Columns and Rows in the Spreadsheet If you have a dimension called Time, one of the Levels might be called Month, and the members of the level Month would be the members January, February, March, etc. If you have a dimension called Product, you could have a level called Department, and the members of the level Department could be Hardware, Plumbing, Lumber, etc. Drilling Down to Child Members Drilling down is what you do when youre looking at data at a particular level and you want to see more detailed data. When you drill down to see the details at the next lower level, the members you see are called child members. For example, if you drill down on Quarter 1, you will see the child members January, February, and March. A whole variety of family terms are used to describe the relationship between members within a dimension parent, sibling, cousin, descendants, ancestors, etc.
Drill-through - Jumping from OLAP Back to the Source Data - In OLAP you can drill up and drill down to view different levels in the data. Some OLAP systems also allow you to jump back to the source data. You can select data you find interesting in the cube and drill through to the source data to view extra detail. A Collection of Members is Called a Set - You nearly always want to see more than one column and more than one row in a spreadsheet. The group of members that is shown on the columns and the group of members that is shown on the rows are called sets. Sets can be defined in OLAP in a large number of different ways. Here are a few examples 1. By listing each member {January, February, March} 2. By a family relationship the children of Quarter 1 [Quarter 1].Children 3. All the members of a level Time.Month.Members 4. All the members of a dimension Time.Members Members Have Extra Information Called Member Properties (or Attributes) You can display extra information about members of any level by creating member properties. An Employee level could have member properties such as Phone Number, Birth Date, and Hire Date. A City level could have a member property Population, which could be used to calculate per capita sales. OLAP Filtering Is Called Slicing You can use a dimension for slicing when it isnt being displayed on the columns or rows of the spreadsheet. You can slice on any member from any level of the dimension. If you slice on the member January in the Month level of the Time dimension, you will only see data from January. Dimensions Can Be Combined For Dicing You can put more than one dimension on the rows or on the columns. You will see one row for every combination of the members from each of the dimensions. If you put the children of Quarter 1 from the
Time dimension and the members of the Product Family level of the Product dimension on the rows, you would have nine rows, as follows: January Food January Drink January Non-Consumable February Food February Drink February Non-Consumable March Food March Drink March Non-Consumable When you are cutting the data with one member its called slicing. When you cut your data with sets of members from two or more dimensions, its called dicing, like youre cutting up vegetables from every direction. The Members Defining a Row (or a Column) Are Called a Tuple In geometry, a point is defined by its x-y coordinates (x, y). You can think of x and y as being members of the x-dimension and the y-dimension. In OLAP, each row or column is defined by the members that are used for that row. The first row in the previous example is defined by the tuple (January, Food). A Group of Tuples Are Called a Set The 9 tuples in the example are also called a set. Before I said that a group of members make a set. Thats true, sort of, but its really a group of tuples that make a set. In the simplest case, those tuples only have a single member. To summarize members, tuples, and sets:
The rows and the columns of an OLAP spreadsheet are always defined by a set of
tuples. Each row or column is defined by a single tuple. Each tuple is defined by a single member from one or more dimensions. In a simple case, when only a single dimension is used for a row or column, the tuple has only a single member. (NOTE: You dont really have to understand the concepts of tuples and sets when youre first starting with OLAP, but theyre very important when creating calculations, so you might as well start thinking about them. If you dont see the point of these concepts at first, dont worry. But if you start hearing about them now, youll have an easier time when creating calculations a little later.) The Numbers Are Called Measures (or Facts) The numbers in the OLAP spreadsheet are called measures. When setting up OLAP cubes, these values are also often called facts. Typical measures or facts would be: Sales Dollars Sales Count Profit Hours of Work The Numbers are Displayed in the Cells of a Cube The spaces in the OLAP spreadsheet are called cells (like the cells in an Excel spreadsheet). Calculated Measures - Applying Formulas to Multidimensional Data - Some measures are more interesting when you combine them with other measures or analyze them from the perspective of a particular dimension. With calculated measures you can view data across different time periods or calculate the percentage one particular product contributes to total sales.
Each Cell Has One (And Only One!) Current Member From Every Dimension Every cell in a cube is defined by one member from every dimension in the cube. That one member is called the current member for that dimension. The current member for each dimension is determined as follows: 1. If the dimension is used on the columns or on the rows, the current member is the member for that dimension used in the tuple defining the column or row. 2. If the dimension is used for slicing, the current member is the member used in the slice. 3. If the dimension is not used for columns, rows, or slicing, the current member is the default member for the dimension. The default member is often the All level member. When the All level member is used, the practical effect is to ignore that particular dimension in the display of the data. Dimension Information Is Stored In Dimension Tables The information used for OLAP dimensions is contained in relational database tables which are often called dimension tables. Dimension tables have the following types of fields: 1. Key fields, used to join the dimension tables to the fact table (and to other dimension tables when snowflaking). 2. Level name fields, used to store the member names for the levels. For example, the Time dimension table could have a field called Month, which would have values such as January, February, March, etc.
3. Level Order Key fields, used to store integer values used to order the members of the levels (if necessary). For example, the Time dimension table could have a field called Month Order Key, which could have a value of 1 for January, 2 for February, 3 for March, etc.
4. Member Property fields, used to store the member property information. A Time
dimension could have a field called Day Count, which would store the number of days for each month. Information for the Measures Is Stored in the Fact Table The information used for the measures is contained in a relational database table which is called the fact table. Whenever you make a new cube in the Analysis Manager, the first question youre asked is to identify the fact table. Fact tables have the following types of fields: 1. Key fields connecting the fact table to a dimension table for each of the dimensions. Every fact must be connected to one particular member from the lowest level of every one of the dimensions in the cube.
2. Measure fields, containing the numeric values used for the measures. The Star Schema A Multidimensional Data Structure in a Relational Database The combination of a fact table with its dimension tables is called a star schema. The fact table is at the center of the star, while each dimension table represents a point of the star. Snowflaking the Star Schema (The Snowflake Schema) The basic star schema has a simple structure where all the information for each dimension is contained in a single table. Snowflaking is the process of dividing information for one dimension among two or more tables. The resulting data structure is often called a snowflake schema. (I actually prefer to call this structure a star schema that has some snowflaking, but I think my preference for this terminology is a minority opinion.) Analysis Services comes with a sample database called FoodMart 2000. This OLAP database is based on a Microsoft Access database, which is located, by default, at C:\Program Files\Microsoft Analysis Services\Samples\foodmart 2000.mdb. Take a look at this database to see dimension tables and fact tables.
Right click on the Sales cube from the FoodMart 2000 database in the Analysis Manager and select Edit. Switch to the Schema tab. You will see a star schema that has snowflaking in the Product dimension. MDX is the Querying Language for OLAP A special language has been developed for OLAP. This language looks similar to SQL, but it has many unique features. MDX is used for several purposes: 1. To create calculated members members which are not in the data, but can be derived from the existing members and measures. 2. To retrieve data from a cube. 3. To create local cube files. 4. To create calculated sets. 5. To define security rules for accessing data. 6. To create actions
Unbalanced Hierarchy (Parent-Child Dimension) - A Dimension With Levels That Can Appear Or Disappear - Most dimensions in Analysis Services have a fixed number of levels. But a parent-child dimension can display levels that have an indefinite depth. This is useful for hierarchical structures that are constantly changing, such as who's supervising who in a company. Ragged Hierarchy - A Dimension Where Some Parents Have Missing Children (But Do Have Grandchildren) - It's fine in the United States to show a hierarchy that has the Country level, the State level, and the City level. But what happens to countries that don't have states? A ragged hierarchy allows you to skip levels where there aren't any real values. Actions - Moving Beyond OLAP Browsing To Doing Something - OLAP Browsing has often been a fairly passive activity, but Analysis Services actions allow you to
accomplish things right from inside your cubes. An action could allow you to open a web page for a customer. You could find a product that needed ordering and order it on the spot. You could find a salesperson with a high level of sales and send a congratulatory e-mail. You can do it all with actions inside your cubes. Local Cubes Are Files That Can Be Used For Disconnected Access to Cube Data OLAP cubes are stored in an Analysis Server. Applications can browse these cubes by connecting to the Analysis Server. But cube data can also be put into local cube files, which can be used when users do not have a connection to the Analysis Server. Browsing a local cube file appears to the users to be the same thing as browsing a cube stored in an Analysis Server. I think local cubes are particularly useful in a pilot project, when everyone is not yet hooked up to an Analysis Server.
The OLAP 30 Those are my top 30 OLAP concepts. If you understand those terms, youre a long way toward understanding how OLAP data is structured and used. Finding the Measures, Dimensions, Levels, and Member Properties The next step is finding measures, dimensions, and levels in a relational database. Its easy to build cubes if you already have your data organized into a star schema, as is the case with the FoodMart 2000 database. If you have your data in a normal relational database, its not so obvious what should be used for measures, dimensions, levels, and member properties. There are two ways of deciding on the structures that you are going to include in your cubes: 1. Deciding on OLAP structures by looking at the spreadsheets that are currently being used.
2. Deciding on OLAP structures by looking at the elements that are available in the source data. Both strategies are important. If the current spreadsheets are being used, the information in them should probably be included in the OLAP cubes. But the source data may have additional information which has been ignored in existing reports. When youre building an OLAP system, you have the opportunity to give your users more of the data than theyve seen before.