Data Mning
Data Mning
Topic: Data Warehouse Dimensional Modeling Concepts Reg no.: 11005440 Roll no.: D1R06A07 Course code: CAP-618 Course Instructor: Mrs. Rajni Bhalla Date of Submission: 16 Nov, 2012 Signature: Neha Kapoor
Contents
Dimensional Modeling Concept Fact Table- The central linkage in Dimensional Modeling Dimension Table- What does and should it contain Dimensional Modeling Process Dimensional Model Star Schema using Star Query Snow-Flake Schema in Dimensional Modeling Fact Constellation Schema How data modeling is different from an ER diagram? Benefits of Data Modeling Conclusion
Fact Table and Dimension Tables in a Dimensional Model Schema Lets consider a Data-Warehouse cube. This cube has 4 dimensions and three measures. This means that for every value of each of these 4 dimensions there will two values of coordinates. For example: Co-ordinate [City(X), Product(Y), channel(Z),Month] = [ Sales (Quantity), Sales (Value)] OR [NY, Standard Desk-top, Mail, September 2005] = [2000 units, $15000] In the dimensional modeling schema, the FACT table contains the value of coordinates against the lowest granularity of all the possible combinations of dimensions. The dimension tables contain the details of the dimensions, which include the attributes of dimensions including all the higher-level hierarchies. The link between the fact table and all the associated dimension tables is through a dimension key, which is the lowest level granularity primary key of the dimension tables.
The hierarchy attributes- Consider a business hierarchy-- pin-code to city to district to state to country for location dimension. This means that each hierarchy element will be an attribute. Textual as well as the code attributes- Location code as well as the name of the location. This is required, because both could be used for different reasons by different users. A power user could be looking for location code , whereas an end user could be looking for more explicit header . Include all parallel hierarchies A product could be having different hierarchies, depending upon if CFO OR Head of sales is looking at it. This enables the done on all hierarchies as well as cross-hierarchies. Production Primary Key Refer Surrogate primary key link to FACT table These keys are used because the production keys could change OR could be reused. For example a bill number could be reused after 5 years, OR a part number (especially FMCG) could be reused after few years. Production OR source system key- This is required for audit ability OR link to the Extraction data and source systems.
Choose the business process The process of dimensional modeling builds on a 4-step design method that helps to ensure the usability of the dimensional model and the use of the data warehouse. The basics in the design
build on the actual business process which the data warehouse should cover. Therefore the first step in the model is to describe the business process which the model builds on. This could for instance be a sales situation in a retail store. To describe the business process, one can choose to do this in plain text or use basic Business Process Modeling Notation (BPMN) or other design guides like the Unified Modeling Language (UML). Declare the grain After describing the Business Process, the next step in the design is to declare the grain of the model. The grain of the model is the exact description of what the dimensional model should be focusing on. This could for instance be An individual line item on a customer slip from a retail store. To clarify what the grain means, you should pick the central process and describe it with one sentence. Furthermore the grain (sentence) is what you are going to build your dimensions and fact table from. You might find it necessary to go back to this step to alter the grain due to new information gained on what your model is supposed to be able to deliver. Identify the dimensions The third step in the design process is to define the dimensions of the model. The dimensions must be defined within the grain from the second step of the 4-step process. Dimensions are the foundation of the fact table, and is where the data for the fact table is collected. Typically dimensions are nouns like date, store, inventory etc. These dimensions are where all the data is stored. For example, the date dimension could contain data such as year, month and weekday. Identify the facts After defining the dimensions, the next step in the process is to make keys for the fact table. This step is to identify the numeric facts that will populate each fact table row. This step is closely related to the business users of the system, since this is where they get access to data stored in the data warehouse. Therefore most of the fact table rows are numerical, additive figures such as quantity or cost per unit, etc.
A typical fact table contains keys and measures. For example, in the sample schema, the fact table, sales, contain the measures quantity_sold, amount, and average, and the keys time_key, item-key, branch_key, and location_key. The dimension tables are time, branch, item and location. A star join is a primary key to foreign key join of the dimension tables to a fact table. The main advantages of star schemas are that they:
Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design. Provide highly optimized performance for typical star queries. Are widely supported by a large number of business intelligence tools, which may anticipate OR even require that the data-warehouse schema contains dimension tables.
The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a snowflake schema because the diagram of the schema resembles a snowflake. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. Forexample, a location dimension table in a star schema might be normalized into a location table and city table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance. Figure above presents a graphical representation of a snowflake schema.
This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension. The split of fact table is done only when we want to focus on aggregation over few facts & dimensions.
An E-R diagram (used in OLTP or transactional system) has highly normalized model (Even at a logical level), whereas dimensional model aggregates most of the attributes and hierarchies of a dimension into a single entity. An E-R diagram is a complex maze of hundreds of entities linked with each other, whereas the Dimensional model has logical grouped set of star-schemas. The E-R diagram is split as per the entities. A dimension model is split as per the dimensions and facts. In an E-R diagram all attributes for an entity including textual as well as numeric, belong to the entity table. Whereas a 'dimension' entity in dimension model has mostly the textual attributes, and the 'fact' entity has mostly numeric attributes.
Dimensional modeling is a better approach for Data warehouse compared to standard Data Model.
The dimensional model has a number of important data warehouse advantages that the ER model lacks. First advantage of the dimensional model is that there are standard type of joins and framework. All dimensions can be thought of as symmetrically equal entry points into the fact table. The logical design can be done independent of expected query patterns. The user interfaces are symmetrical, the query strategies are symmetrical, and the SQL generated against the dimensional model is symmetrical. In other words,
You will never find attributes in fact tables and facts in dimension tables. If you see a non-fact field in the fact table, you can assume that it is a key to a dimension table
Second advantage of the dimensional model is that it is smoothly extensible to accommodate unexpected new data elements and new design decisions. First, all existing tables (both fact and dimension) can be changed in place by simply adding new data rows in the table. Data should not have to be reloaded. Typically, No query tool OR reporting tool needs to be reprogrammed to accommodate the change. All old applications continue to run without yielding different results. You can, respectively, make the following graceful changes to the design after the data warehouse is up and running by:
Adding new unanticipated facts (that is, new additive numeric fields in the fact table), as long as they are consistent with the fundamental grain of the existing fact table. Adding completely new dimensions, as long as there is a single value of that dimension defined for each existing fact record Adding new, unanticipated dimensional attributes. Breaking existing dimension records down to a lower level of granularity from a certain point in time forward.
Third advantage of the dimensional model is that there is a body of standard approaches for handling common modeling situations in the business world. Each of these situations has a wellunderstood set of alternatives that can be specifically programmed in report writers, query tools, and other user interfaces. These modeling situations include:
Slowly changing dimensions, where a 'constant' dimension such as Product OR Customer actually evolves slowly and asynchronously. Dimensional modeling provides specific techniques for handling slowly changing dimensions, depending on the business environment. Heterogeneous products, where a business such as a bank needs to: o Track a number of different lines of business together within a single common set of attributes and facts, but at the same time.. o It needs to describe and measure the individual lines of business in highly idiosyncratic ways using incompatible measures.