Oral Questions LP II
Oral Questions LP II
Oral Questions LP II
Star Schema
• Star Schema in data warehouse, in which the center of the star can have one fact table
and a number of associated dimension tables. It is known as star schema as its structure
resembles a star. The Star Schema data model is the simplest type of Data Warehouse
schema. It is also known as Star Join Schema and is optimized for querying large data
sets.
• The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
• There is a fact table at the center. It contains the keys to each of four dimensions.
• The fact table also contains the attributes, namely dollars sold and units sold.
Example: Each dimension has only one dimension table and each table holds a set of
attributes. For example, the location dimension table contains the attribute set {location_key,
street, city, province_or_state,country}. This constraint may cause data redundancy. For
example, "Vancouver" and "Victoria" both the cities are in the Canadian province of British
Columbia. The entries for such cities may cause data redundancy along the attributes
province_or_state and country.
Snowflake Schema
Snowflake Schema in data warehouse is a logical arrangement of tables in a multidimensional
database such that the ER diagram resembles a snowflake shape. A Snowflake Schema is an
extension of a Star Schema, and it adds additional dimensions. The dimension tables are
normalized which splits data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example,
the item dimension table in star schema is normalized and split into two dimension tables,
namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name,
type, brand, and supplier-key.
• The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
Develop and implement databases, data collection systems, data analytics and other
strategies that optimize statistical efficiency and quality
Acquire data from primary or secondary data sources and maintain databases/data
systems
Identify, analyze, and interpret trends or patterns in complex data sets
Filter and “clean” data by reviewing computer reports, printouts, and performance
indicators to locate and correct code problems
Work with management to prioritize business and information needs
Locate and define new process improvement opportunities
• KNN Algorithm is based on feature similarity and K-means refers to the division of
objects into clusters (such that each object is in exactly one cluster, not several).
Let’s understand the difference in a better way using an example of a crocodile and an alligator,
KNN Algorithm:
You can differentiate between a crocodile and an alligator based on their characteristics.
The features of the unknown animal are more a crocodile.
Therefore, it is a crocodile!
K-means clustering:
K-means performs division of objects into clusters which are “similar” between them and are
Example
the Euclidean distance between (−1,2,3) and (4,0,−3) is
Support(s) –
The number of transactions that include items in the {X} and {Y} parts of
the rule as a percentage of the total number of transaction. It is a
measure of how frequently the collection of items occur together as a
percentage of all transactions
• Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well
as the no of transactions that includes all items in {A} to the no of
transactions that includes all items in {A}.
Example
ID ITEMS
1 Bread, Milk
In this A- Milk
B- Diaper
• So, to calculate support we need to check how many times A and B comes together
in dataset means how many times milk and diaper comes together
So, (A+ B)= 3 times
• And dividied by total no of items in itemset
So, Total= 5
So, Support= 3/5= 0.6 = 60%
Confidence= ( A+B)
________
A
So, A+B means milk with diaper comes in itemset 3 times
A means only milk comes in itemset i.e 4 times and no of items
5 means A= 4/5= 0.8
Confidence= 0.6/0.8= 0.75= 75%
Apriori algorithm was the first algorithm that was proposed for frequent itemset
mining. It was later improved by R Agarwal and R Srikant and came to be known
as Apriori. This algorithm uses two steps “join” and “prune” to reduce the search
space.
T-1000 M,O,N,K,E,Y
T-1001 D,O,N,K,E,Y
T-1002 M,A,K,E
T-1003 M,U,C,K,Y
T-1004 C,O,O,K,E
M 3
O 4
N 2
E 4
Y 3
D 1
A 1
U 1
C 2
K 5
M 3
O 4
K 5
E 4
Y 3
3.9.3 Candidate Table C2:
MO 1
MK 3
ME 2
MY 2
OK 3
OE 3
OY 2
KE 4
KY 3
EY 2
• Now again Compare C2 with Min Support 3
MK 3
OK 3
OE 3
KE 4
KY 3
M,K,O 1
M,K,E 2
M,K,Y 2
O,K,E 3
O,K,Y 2
O,K,E 3
3.9.7 Now create association rule with support and Confidence for {O,K,E}
• Confidence =Support/No. of time it Occurs
O^K⇒E 3 100
O^E⇒K 3 100
https://youtu.be/LZii6N4vGDs
Also To solve the example of a-priori please check this video for a- priori algorithm.
1. Airline:
In the Airline system, it is used for operation purpose like crew assignment, analyses
of route profitability, frequent flyer program promotions, etc.
2. Banking:
It is widely used in the banking sector to manage the resources available on desk
effectively. Few banks also used for the market research, performance analysis of the
product and operations.
3. Healthcare:
Healthcare sector also used Data warehouse to strategize and predict outcomes,
generate patient's treatment reports, share data with tie-in insurance companies,
medical aid services, etc.
4. Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps
government agencies to maintain and analyze tax records, health policy records, for
every individual.
In this sector, the warehouses are primarily used to analyze data patterns, customer
trends, and to track market movements.
6. Retail chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It
also helps to track items, customer buying pattern, promotions and also used for
determining pricing policy.
7. Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and
to make distribution decisions.
8. Hospitality Industry:
• Stands for "Online Analytical Processing." OLAP allows users to analyze database
information from multiple database systems at one time. While relational databases are
considered to be two-dimensional, OLAP data is multidimensional, meaning the
information can be compared in many different ways. For example, a company might
compare their computer sales in June with sales in July, then compare those results with
the sales from another location, which might be stored in a different database
• In order to process database information using OLAP, an OLAP server is required to
organize and compare the information. Clients can analyze different sets of data using
functions built into the OLAP server. Some popular OLAP server software programs
include Oracle Express Server and Hyperion Solutions Essbase. Because of its powerful
data analysis capabilities, OLAP processing is often used for data mining, which aims to
discover new relationships between different sets of data.
Metadata is "data that provides information about other data". In other words, it is "data
about data". Many distinct types of metadata exist, including descriptive metadata,
structural metadata, administrative metadata, reference metadata, statistical metadata
and legal metadata.
1. What is a data?
• The concept of data commonly referred to as ‘raw’ data – a collection of text, numbers
and symbols with no meaning. Data therefore has to be processed, or provided with a
context, before it can have meaning.
• Example
• 3, 6, 9, 12
2. What is information?
Information is the result of processing data, usually by computer. This results in facts,
which enables the processed data to be used in context and have meaning. Information is
data that has meaning.
Example
• 3, 6, 9, 12
Only when we assign a context or meaning does the data become information. It all
becomes meaningful when we are told:
• 161.2, 175.3, 166.4, 164.7, 169.3 are the heights of 15-year-old students.
3. What is knowledge?
Dimension table is a table which contain attributes of measurements stored in fact tables. This
table consists of hierarchies, categories and logic that can be used to traverse in nodes.
A fact table is a primary table in a dimensional model. Fact table contains measurements,
metrics, and facts about a business process
ETL is abbreviated as Extract, Transform and Load. ETL is a software which is used to reads the
data from the specified data source and extracts a desired subset of data. Next, it transform the
data using rules and lookup tables and convert it to a desired state. Then, load function is used to
load the resulting data to the target database.
Data mart contains the subset of organization-wide data. This subset of data is valuable to
specific groups of an organization. In other words, we can say that a data mart contains data
specific to a particular group.
26. Accuracy, Error Rate, precision, Recall, Sensitivity, Specificity in multiclass classification
• Accuracy: It gives you the overall accuracy of the model, meaning the fraction of the
total samples that were correctly classified by the classifier. To calculate accuracy, use
the following formula: (TP+TN)/(TP+TN+FP+FN).
• Misclassification Rate: It tells you what fraction of predictions were incorrect. It is also
known as Classification Error. You can calculate it
using (FP+FN)/(TP+TN+FP+FN) or (1-Accuracy).
• Precision: It tells you what fraction of predictions as a positive class were actually
positive. To calculate precision, use the following formula: TP/(TP+FP).
• Recall: It tells you what fraction of all positive samples were correctly predicted as
positive by the classifier. It is also known as True Positive Rate (TPR), Sensitivity,
Probability of Detection. To calculate Recall, use the following formula: TP/(TP+FN).
• Specificity: It tells you what fraction of all negative samples are correctly predicted as
negative by the classifier. It is also known as True Negative Rate (TNR). To calculate
specificity, use the following formula: TN/(TN+FP).
27. Explain the techniques to improve the efficiency of apriori algorithm
• Hash based technique
• Transaction
• Reduction
• Portioning
• Sampling
• Dynamic item counting
28. In the context of data warehousing what is data transformation? `
In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Data transformation can involve the following: Smoothing, Aggregation,
Generalization, Normalization, Attribute construction.
29. What Are The Steps Involved In Kdd( Knowledge Discovery in
Database) Process?
1. Data cleaning
2. Data Mining
3. Pattern Evaluation
4. Knowledge Presentation
5. Data Integration
6. Data Selection
7. Data Transformation
2. Regression: Regression is the process of finding a model or function for distinguishing the
data into continuous real values instead of using classes or discrete values. It can also identify
the distribution movement depending on the historical data. Because a regression predictive
model predicts a quantity, therefore, the skill of the model must be reported as an error in those
predictions
Let’s take a similar example in regression also, where we are finding the possibility of rain in
some particular regions with the help of some parameters recorded earlier. Then there is a
probability associated with the rain.