Datadeling
Datadeling
Datadeling
In the following Star Schema example, the fact table is at the center which contains keys to every
dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID & other attributes
like Units sold and revenue.
Every dimension in a star schema is represented with the only one-dimension table.
The dimension table should contain the set of attributes.
The dimension table is joined to the fact table using a foreign key
The dimension table are not joined to each other
Fact table would contain key and measure
The dimension tables are not normalized. For instance, in the above figure, Country_ID
does not have Country lookup table as an OLTP design would have.
In the following Snowflake Schema example, Country is further normalized into an individual
table.
Exam
ple of Snowflake Schema
The main benefit of the snowflake schema it uses smaller disk space.
Easier to implement a dimension is added to the Schema
Due to multiple tables query performance is reduced
The primary challenge that you will face while using the snowflake Schema is that you
need to perform more maintenance efforts because of the more lookup tables.
As you can see in above example, there are two facts table
1. Revenue
2. Product.
Overlapping dimensions can be found as forks in hierarchies. A fork happens when an entity acts
as a parent in two different dimensional hierarchies. Fork entities then identified as classification
with one-to-many relationships.
Schema Types In Data Warehouse Modeling – Star & SnowFlake Schema
1. Star Schema
2. SnowFlake Schema
3. Galaxy Schema
4. Star Cluster Schema
This is the simplest and most effective schema in a data warehouse. A fact table in the center
surrounded by multiple dimension tables resembles a star in the Star Schema model.
The fact table maintains one-to-many relations with all the dimension tables. Every row in a fact
table is associated with its dimension table rows with a foreign key reference.
Due to the above reason, navigation among the tables in this model is easy for querying
aggregated data. An end-user can easily understand this structure. Hence all the Business
Intelligence (BI) tools greatly support the Star schema model.
While designing star schemas the dimension tables are purposefully de-normalized. They are
wide with many attributes to store the contextual data for better analysis and reporting.
Queries use very simple joins while retrieving the data and thereby query performance is
increased.
It is simple to retrieve data for reporting, at any point of time for any period.
If there are many changes in the requirements, the existing star schema is not recommended to
modify and reuse in the long run.
Data redundancy is more as tables are not hierarchically divided.
An end-user can request a report using Business Intelligence tools. All such requests will be
processed by creating a chain of “SELECT queries” internally. The performance of these queries
will have an impact on the report execution time.
Due to normalized dimension tables, the ETL system has to load the number of tables.
You may need complex joins to perform a query due to the number of tables added. Hence
query performance will be degraded.
A galaxy schema is also known as Fact Constellation Schema. In this schema, multiple fact
tables share the same dimension tables. The arrangement of fact tables and dimension tables
looks like a collection of stars in the Galaxy schema model.
This type of schema is used for sophisticated requirements and for aggregated fact tables that are
more complex to be supported by the Star schema (or) SnowFlake schema. This schema is
difficult to maintain due to its complexity.
A SnowFlake schema with many dimension tables may need more complex joins while
querying. A star schema with fewer dimension tables may have more redundancy. Hence, a star
cluster schema came into the picture by combining the features of the above two schemas.
Star schema is the base to design a star cluster schema and few essential dimension tables from
the star schema are snowflaked and this, in turn, forms a more stable schema structure.
Star and SnowFlake are the most frequently used schemas in DW. Star schema is preferred if BI
tools allow business users to easily interact with the table structures with simple queries. The
SnowFlake schema is preferred if BI tools are more complicated for the business users to interact
directly with the table structures due to more joins and complex queries.
You can go ahead with the SnowFlake schema either if you want to save some storage space or if
your DW system has optimized tools to design this schema.
Given below are the key differences between Star schema and SnowFlake schema.
Single fact table is surrounded by multiple Single fact table is surrounded by multiple
4
dimension tables. hierarchies of dimension tables.
Queries use direct joins between fact and Queries use complex joins between fact and
5
dimensions to fetch the data. dimensions to fetch the data.
=============================*==================================
Data modelling is the process of creating a model for the data to store in a database. It is a
conceptual representation of data objects and its association with different data objects.
Logical: Defines how the system should be implemented regardless of the DBMS. This model is
typically created by data architects and business analysts. The purpose is to develop a technical
map of rules and data structures.
Physical: This data model describes how the system will be implemented using a specific
DBMS system. This model is typically created by DBA and developers. The purpose is the
actual implementation of the database.
The fact represents quantitative data. For example, the net amount which is due. A fact table
contains numerical data as well as foreign keys from dimensional tables.
There are two different types of data modelling schemes schemas: 1) Star Schema, and 2)
Snowflake Schema
Denormalization is used when there is a lot of involvement of the table while retrieving data. It is
used to construct a data warehouse.
Dimensions represent qualitative data. For example, product, class, plan, etc. A dimension table
has textual or descriptive attributes. For example, the product category and product name are two
attributes of the product dimension table.
Fact less fact is a table having no fact measurement. It contains only the dimension keys.
OLTP OLAP
OLAP is an online analysis and data retrieving
OLTP is an online transactional system.
process.
It is characterized by a large number of
It is characterized by a large volume of data.
short online transactions.
OLTP uses traditional DBMS. OLAP uses a data warehouse.
Tables in OLTP database are normalized. The tables in OLAP are not normalized.
Its response time is in a millisecond. Its response time is in second to minutes.
OLTP is designed for real time business OLAP is designed for the analysis of business
operations. measures by category and attributes.
The collection of rows and columns is called as table. Each and every column has a datatype.
Table contains related data in a tabular format.
Composite primary key is referred to the case where more than one table column is used as a part
of primary key.
Primary key is a column or group of columns that unequally identify each and every row in the
table. The value of primary key must not be null. Every table must contain one primary key.
Foreign key is a group of attributes which is used to link parent and child table. The value of the
foreign key column, which is available in the child table, is referred to the value of the primary
key in the parent table.
A data mart is a condensed version of a data warehouse and is designed for use by a specific
department, unit, or set of users in an organization. E.g., marketing sales, HR, or finance.
Types of normalizations are: 1) first normal form, 2) second normal form, 3) third normal forms,
4) boyce-codd fourth, and 5) fifth normal forms.
Forward engineering is a technical term used to describe the process of translating a logical
model into a physical implement automatically.
It is a data cube that stores data as a summary. It helps the user to analyse data quickly. The data
in PDAP is stored in a way that reporting can be done with ease.
Discreet data is a finite data or defined data. E.g., gender, telephone numbers. Continuous data is
data that changes in a continuous and ordered manner. E.g., age.
Time series algorithm is a method to predict continuous values of data in table. E.g.,
Performance one employee can forecast the profit or influence.
Bitmap indexes are a special type of database index that uses bitmaps (bit arrays) to answer
queries by executing bitwise operations.
Data warehousing is a process for collecting and managing data from varied sources. It provides
meaningful business enterprise insights. Data warehousing is typically used to connect and
analyse data from heterogeneous sources. It is the core of the BI system, which is built for data
analysis and reporting.
Junk dimension combines two or more related cardinality into one dimension. It is usually
Boolean or flag values.
One-to-One Relationships
One-to-Many Relationships
Many-to-One Relationships
Many-to-Many Relationships
Data mining is a multi-disciplinary skill that uses machine learning, statistics, AI, and database
technology. It is all about discovering unsuspected / previously unknown relationships amongst
the data.
44) What is the difference between logical data model and physical data model?
A different type of constraint could be unique, null values, foreign keys, composite key or check
constraint, etc.
It helps you to manage business data by normalizing it and defining its attributes.
Data modelling integrates the data of various systems to reduce data redundancy.
It enables to create efficient database design.
Data modelling helps the organization department to function as a team.
It facilitates to access data with ease.
Describes data needs for a single project but could integrate with other logical data
models based on the scope of the project.
Designed and developed independently from the DBMS.
Data attributes will have datatypes with exact precisions and length.
Normalization processes to the model, which is generally are applied typically till 3NF.
The physical data model describes data need for a single project or application. It may be
integrated with other physical data models based on project scope.
Data model contains relationships between tables that address cardinality and nullability
of the relationships.
Developed for a specific version of a DBMS, location, data storage, or technology to be
used in the project.
Columns should have exact datatypes, lengths assigned, and default values.
Primary and foreign keys, views, indexes, access profiles, and authorizations, etc. are
defined.
Two types of data modelling techniques are: 1) entity-relationship (E-R) Model, and 2) UML
(Unified Modelling Language).
The object-oriented database model is a collection of objects. These objects can have associated
features as well as methods.
It is a model which is built on hierarchical model. It allows more than one relationship to link
records, which indicates that it has multiple records. It is possible to construct a set of parent
records and child records. Each record can belong to multiple sets that enable you to perform
complex table relationships.
Hashing is a technique which is used to search all the index value and retrieve desired data. It
helps to calculate the direct location of data, which are recorded on disk without using the
structure of the index.
business or natural keys is a field that uniquely identifies an entity. For example, client ID,
employee number, email etc.
When more than one field is used to represent a key, it is referred to as a compound key.
63) What is the difference between primary key and foreign key?
Keys help you to identify any row of data in a table. In a real-world application, a table
could contain thousands of records.
Keys ensure that you can uniquely identify a table record despite these challenges.
Allows you to establish a relationship between and identify the relation between tables
Help you to enforce identity and integrity in the relationship.
An artificial key which aims to uniquely identify each record is called a surrogate key. These
kinds of key are unique because they are created when you don’t have any natural primary key.
They do not lend any meaning to the data in the table. Surrogate key is usually an integer.
Alternate key is a column or group of columns in a table that uniquely identifies every row in
that table. A table can have multiple choices for a primary key, but only one can be set as the
primary key. All the keys which are not primary key are called an Alternate Key.
Fourth normal form is a level of database normalization where there must not have non trivial
dependency other than candidate key.
A table is in 5th normal form only if it is in 4th normal form, and it cannot be decomposed into
any number of smaller tables without loss of data.
Normalization is a database design technique that organizes tables in a manner that reduces
redundancy and dependency of data. It divides larger tables into smaller tables and links them
using relationships.
MySQL
Microsoft Access
Oracle
PostgreSQL
dbase
FoxPro
SQLite
IBM DB2
Microsoft SQL Server.
Relational Database Management System is a software which is used to store data in the form of
tables. In this kind of system, data is managed and stored in rows and columns, which is known
as tuples and attributes. RDBMS is a powerful data management system and is widely used
across the world.
76) What are the advantages of data model?
The main goal of a designing data model is to make sure that data objects offered by the
functional team are represented accurately.
The data model should be detailed enough to be used for building the physical database.
The information in the data model can be used for defining the relationship between
tables, primary and foreign keys, and stored procedures.
Data Model helps businesses to communicate within and across organizations.
Data model helps to documents data mappings in the ETL process
Help to recognize correct sources of data to populate the model
To develop Data model, one should know physical data stored characteristics.
This is a navigational system that produces complex application development,
management. Thus, it requires knowledge of the biographical truth.
Even smaller changes made in structure require modification in the entire application.
There is no set of data manipulation language in DBMS.
The aggregate table contains aggregated data that can be calculated using functions such as: 1)
Average 2) MAX, 3) Count, 4) SUM, 5) SUM, and 6) MIN.
A conformed dimension is a dimension which is designed in a way that can be used across many
fact tables in various areas of a data warehouse.
There are two types of Hierarchies: 1) Level based hierarchies and 2) Parent-child hierarchies.
82) What is the difference between a data mart and data warehouse?
Data mart Data warehouse
Data mart focuses on a single subject area Data warehouse focuses on multiple areas of
of business. business.
It is used to make tactical decisions for
It helps business owners to take a strategic decision
business growth.
Data mart follows the bottom-up model Data warehouse follows a top-down model
Data source comes from more than one
Data source comes from one data source
heterogeneous data sources.
XMLA is an XML analysis that is considered as standard for accessing data in Online Analytical
Processing (OLAP).
Junk dimension helps to store data. It is used when data is not proper to store in schema.
The situation when a secondary node selects target using ping time or when the closest node is a
secondary, it is called as chained data replication.
A virtual data warehouse gives a collective view of the completed data. A virtual data warehouse
does not have historical data. It is considered as a logical data model having metadata.
Snapshot is a complete visualization of data at the time when data extraction process begins.
The ability of system to extract, cleanse, and transfer data in two directions is called as a
directional extract.
==================================*================================
Answer: Data Modelling is the diagrammatic representation showing how the entities are
related to each other. It is the initial step towards database design. We first create the conceptual
model, then the logical model and finally move to the physical model.
Generally, the data models are created in data analysis & design phase of software development
life cycle.
Answer: There are three types of data models – conceptual, logical and physical. The level of
complexity and detail increases from conceptual to logical to a physical data model.
The conceptual model shows a very basic high level of design while the physical data model
shows a very detailed view of design.
Conceptual Model will be just portraying entity names and entity relationships. Figure 1
shown in the later part of this article depicts a conceptual model.
Logical Model will be showing up entity names, entity relationships, attributes, primary
keys and foreign keys in each entity. Figure 2 shown inside question#4 in this article
depicts a logical model.
Physical Data Model will be showing primary keys, foreign keys, table names, column
names and column data types. This view actually elaborates how the model will be
actually implemented in the database.
Q #3) Throw some light on your experience in Data Modelling with respect to projects you
have worked on till date?
Note: This was the very first question in one of my Data Modelling interviews. So, before you
step into the interview discussion, you should have a very clear picture of how data modeling fits
into the assignments you have worked upon.
Answer: I have worked on a project for a health insurance provider company where we have
interfaces build in Informatica that transforms and process the data fetched from Facets database
and sends out useful information to vendors.
Note: Facets is an end to end solution to manage all the information for health care industry.
The facets database in my project was created with SQL server 2012.
We had different entities that were linked together. These entities were subscriber, member,
healthcare provider, claim, bill, enrollment, group, eligibility, plan/product, commission,
capitation, etc.
Below is the conceptual Data Model showing how the project looked like on a high-level
Figure 1:
Each of the data entities has its own data attributes. For Example, a data attribute of the provider
will be provider identification number, few data attributes of the membership will be subscriber
ID, member ID, one of the data attribute of claim will claim ID, each healthcare product or plan
will be having a unique product ID and so on.
Q #4) What are the different design schemas in Data Modelling? Explain with the
example?
Star Schema
Snowflake Schema
The simplest of the schemas is star schema where we have a fact table in the center that
references multiple dimension tables around it. All the dimension tables are connected to the fact
table. The primary key in all dimension tables acts as a foreign key in the fact table.
The ER diagram (see Figure 2) of this schema resembles the shape of a star and that is why this
schema is named as a star schema.
Figure 2:
The star schema is quite simple, flexible and it is in de-normalized form.
In a snowflake schema, the level of normalization increases. The fact table here remains the
same as in the star schema. However, the dimension tables are normalized. Due to several layers
of dimension tables, it looks like a snowflake, and thus it is named as snowflake schema.
Figure 3:
Q #5) Which scheme did you use in your project & why?
Since star schema is in de-normalized form, you require fewer joins for a query. The query is
simple and runs faster in a star schema. Coming to the snowflake schema, since it is in
normalized form, it will require a number of joins as compared to a star schema, the query will
be complex and execution will be slower than star schema.
Another significant difference between these two schemas is that snowflake schema does not
contain redundant data and thus it is easy to maintain. On the contrary, star schema has a high
level of redundancy and thus it is difficult to maintain.
Now, which one to choose for your project? If the purpose of your project is to do more of
dimension analysis, you should go for snowflake schema. For Example, if you need to find out
that “how many subscribers are tied to a particular plan which is currently active?” – go with
the snowflake model.
If the purpose of your project is to do more of a metrics analysis, you should go with a star
schema. For Example, if you need to find out that “what is the claim amount paid to a
particular subscriber?” – go with a star schema.
Answer: Dimensions represent qualitative data. For Example, plan, product, class are all
dimensions.
A dimension table contains descriptive or textual attributes. For Example, the product category
& product name are the attributes of the product dimension.
For Example, the net amount due is a fact. A fact table contains numerical data and foreign keys
from related dimensional tables. An example of the fact table can be seen from Figure 2 shown
above.
Q #9) What are the different types of dimensions you have come across? Explain each of
them in detail with an example?
For Example, if the subscriber dimension is connected to two fact tables – billing and claim
then the subscriber dimension would be treated as a conformed dimension.
b) Junk Dimension: It is a dimension table comprising of attributes that don’t have a place in
the fact table or in any of the current dimension tables. Generally, these are properties like flags
or indicators.
For Example, it can be a member eligibility flag set as ‘Y’ or ‘N’ or any other indicator set as
true/false, any specific comments, etc. if we keep all such indicator attributes in the fact table
then its size gets increased. So, we combine all such attributes and put in a single dimension
table called a junk dimension having unique junk IDs with a possible combination of all the
indicator values.
c) Role-Playing Dimension: These are the dimensions that are utilized for multiple purposes in
the same database.
For Example, a date dimension can be used for “Date of Claim”, “Billing date” or “Plan Term
date”. So, such a dimension will be called a Role-playing dimension. The primary key of the
Date dimension will be associated with multiple foreign keys in the fact table.
d) Slowly Changing Dimension (SCD): These are most important amongst all the dimensions.
These are the dimensions where attribute values vary with time. Below are the varies types of
SCDs
Type-0: These are the dimensions where attribute value remains steady with time. For
Example, Subscriber’s DOB is a type-0 SCD because it will always remain the same
irrespective of the time.
Type-1: These are the dimensions where the previous value of the attribute is replaced by
the current value. No history is maintained in the Type-1 dimension. For
Example, Subscriber’s address (where the business requires to keep the only current
address of subscriber) can be a Type-1 dimension.
Type-2: These are the dimensions where unlimited history is preserved. For Example,
Subscriber’s address (where the business requires to keep a record of all the previous
addresses of the subscriber). In this case, multiple rows for a subscriber will be inserted
in the table with his/her different addresses. There will be some column(s) that will
identify the current address. For Example, ‘Start date’ and ‘End date’. The row where
‘End date’ value will be blank would contain the subscriber’s current address and all
other rows will be having previous addresses of the subscriber.
Type-3: These are the type of dimensions where limited history is preserved. And we use
an additional column to maintain the history. For Example, Subscriber’s address (where
the business requires to keep a record of current & just one previous address). In this
case, we can dissolve the ‘address’ column into two different columns – ‘current address’
and ‘previous address’. So, instead of having multiple rows, we will be having just one-
row showing current as well as the previous address of the subscriber.
Type-4: In this type of dimension, the historical data is preserved in a separate table. The
main dimension table holds only the current data. For Example, the main dimension
table will have only one row per subscriber holding its current address. All other previous
addresses of the subscriber will be kept in the separate history table. This type of
dimension is hardly ever used.
But, instead of keeping it separately in a dimension table and putting an additional join, we put
this attribute in the fact table directly as a key. Since it does not have its own dimension table, it
can never act as a foreign key in the fact table.
Q #10) Give your idea regarding factless fact? And why do we use it?
Answer: Factless fact table is a fact table that contains no fact measure in it. It has only the
dimension keys in it.
At times, certain situations may arise in the business where you need to have a factless fact table.
For Example, suppose you are maintaining an employee attendance record system, you can
have a factless fact table having three keys.
Employee_ID
Department_ID
Time_ID
You can see that the above table does not contain any measure. Now, if you want to answer the
below question, you can do easily using the above single factless fact table rather than having
two separate fact tables:
Answer: OLTP stands for the Online Transaction Processing System & OLAP stands for the
Online Analytical Processing System. OLTP maintains the transactional data of the business &
is highly normalized generally. On the contrary, OLAP is for analysis and reporting purposes &
it is in de-normalized form.
This difference between OLAP and OLTP also gives you the way to choosing the design of
schema. If your system is OLTP, you should go with star schema design and if your system is
OLAP, you should go with snowflake schema.
Answer: Data marts are for the most part intended for a solitary branch of business. They are
designed for the individual departments.
For Example, I used to work for a health insurance provider company that had different
departments in it like Finance, Reporting, Sales and so forth.
We had a data warehouse that was holding the information pertaining to all these departments
and then we have few data marts built on top of this data warehouse. These DataMart were
specific to each department. In simple words, you can say that a DataMart is a subset of a data
warehouse.
Non-additive measures are the ones on top of which no aggregation function can be applied. For
Example, a ratio or a percentage column; a flag or an indicator column present in fact table
holding values like Y/N, etc. is a non-additive measure.
Semi- additive measures are the ones on top of which some (but not all) aggregation functions
can be applied. For Example, fee rate or account balance.
Additive measures are the ones on top of which all aggregation functions can be applied. For
Example, units purchased.
Answer: Surrogate Key is a unique identifier or a system-generated sequence number key that
can act as a primary key. It can be a column or a combination of columns. Unlike a primary key,
it is not picked up from the existing application data fields.
Answer: It is not mandatory for a database to be in 3NF. However, if your purpose is the easy
maintenance of data, less redundancy, and efficient access then you should go with a de-
normalized database.
Q #16) Have you ever came across the scenario of recursive relationships? If yes, how did
you handle it?
Answer: A recursive relationship occurs in the case where an entity is related to itself. Yes, I
have come across such a scenario.
Talking about the health care domain, it is a possibility that a health care provider (say, a doctor)
is a patient to any other health care provider. Because, if the doctor himself falls ill and needs
surgery, he will have to visit some other doctor for getting the surgical treatment.
So, in this case, the entity – health care provider is related to itself. A foreign key to the health
insurance provider’s number will have to present in each member’s (patient) record.
Q #17) List out a few common mistakes encountered during Data Modelling?
Building massive data models: Large data models are like to have more design faults.
Try to restrict your data model to not more than 200 tables.
Lack of purpose: If you do not know that what is your business solution is intended for,
you might come up with an incorrect data model. So having clarity on the business
purpose is very important to come up with the right data model.
Inappropriate use of surrogate keys: Surrogate key should not be used unnecessarily.
Use surrogate key only when the natural key cannot serve the purpose of a primary key.
Unnecessary de-normalization: Don’t denormalize until and unless you have a solid &
clear business reason to do so because de-normalization creates redundant data which is
difficult to maintain.
Q #18) What is the number of child tables that can be created out from a single parent
table?
Answer: The number of child tables that can be created out of the single parent table is equal to
the number of fields/columns in the parent table that are non-keys.
Q #19) Employee health details are hidden from his employer by the health care provider.
Which level of data hiding is this? Conceptual, physical or external?
Answer: Generally, the fact table is in normalized form and the dimension table is in de-
normalized form.
Q #21) What particulars you would need to come up with a conceptual model in a health
care domain project?
Answer: For a health care project, below details would suffice the requirement to design a basic
conceptual model
Q #22) Tricky one: If a unique constraint is applied to a column then will it throw an error
if you try to insert two nulls into it?
Answer: No, it will not throw any error in this case because a null value is unequal to another
null value. So, more than one null will be inserted in the column without any error.
Answer: Yes, let’s say we have these different entities – vehicle, car, bike, economy car, family
car, sports car.
Here, a vehicle is a super-type entity. Car and bike are its sub-type entities. Furthermore,
economy cars, sports cars, and family cars are sub-type entities of its super-type entity- car.
A super-type entity is the one that is at a higher level. Sub-type entities are ones that are grouped
together on the basis of certain characteristics. For Example, all bikes are two-wheelers and all
cars are four-wheelers. And since both are vehicles, so their super-type entity is ‘vehicle’.
Answer: Metadata is data about data. It tells you what kind of data is actually stored in the
system, what is its purpose and for whom it is intended.
===========================*===============================================